Please don't delete comments, if they are referenced by follow ups.

BR
Andreas Lehmkühler

Am 16.12.2013 17:09, schrieb Tilman Hausherr (JIRA):

      [ 
https://issues.apache.org/jira/browse/PDFBOX-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-1769:
------------------------------------

     Comment: was deleted

(was: There's another problem in that parser:
I get this exception with the file amyuni2_05d__pdf1_3_acro4x.pdf (it was once 
part of the project, now no more, but it can still be found on the web):
java.io.IOException: Object (48:0) at offset 161333 does not end with 'endobj'.
     at 
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1312)
     at 
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1159)
     at 
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseDictObjects(NonSequentialPDFParser.java:1133)
     at 
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:470)
     at 
org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:731)
     at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
     at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1122)
     at pdfboxpageimageextraction.ExtractImages.doPdf(ExtractImages.java:134)
     at pdfboxpageimageextraction.ExtractImages.main(ExtractImages.java:78)

This is true, the "endobject" is indeed missing in that file. However the 
content of endObjectKey is 49 0 obj, i.e. the start of a new object.

So my suggestion is to change the segment at

{code}
if (!endObjectKey.startsWith("endobj"))
{
       throw new IOException("Object (" + readObjNr + ":" + readObjGen + ") at 
offset "
                     + offsetOrObjstmObNr + " does not end with 'endobj'.");
}
{code}

to
{code}
  if (!endObjectKey.startsWith("endobj"))
  {
      if (endObjectKey.endsWith(" obj"))
          LOG.warn("Object (" + readObjNr + ":" + readObjGen + ") at offset "
              + offsetOrObjstmObNr + " does not end with 'endobj' but with '" + 
endObjectKey + "'");
      else
          throw new IOException("Object (" + readObjNr + ":" + readObjGen + ") at 
offset "
              + offsetOrObjstmObNr + " does not end with 'endobj' but with '" + 
endObjectKey + "'"); }
{code})

Fix crash on invalid xref
-------------------------

                 Key: PDFBOX-1769
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1769
             Project: PDFBox
          Issue Type: Wish
          Components: Parsing
    Affects Versions: 1.8.2
            Reporter: William Palmer
            Assignee: Andreas Lehmkühler
             Fix For: 1.8.4, 2.0.0


Need to search for a correct xref start address
Example file:
http://digitalcorpora.org/corp/nps/files/govdocs1/020/020747.pdf
Exception in thread "main" java.io.IOException: Error: Expected an integer 
type, actual='ref'
at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1622)
Using the code:
PDFTextStripper ts = new PDFTextStripper();
PrintWriter out = new PrintWriter(new FileWriter(new File (pFile+".txt")));
RandomAccess scratchFile = new RandomAccessFile(File.createTempFile("pdfbox-", ".tmp"), 
"rw");
PDDocument doc = PDDocument.loadNonSeq(new File(pFile), scratchFile)
ts.setForceParsing(true);
ts.writeText(doc, out);
Related: PDFBOX-1757



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


Reply via email to