Tilman Hausherr created PDFBOX-2823:
---------------------------------------

             Summary: StringIndexOutOfBoundsException when doing 
DateConverter.parseDate()
                 Key: PDFBOX-2823
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2823
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 1.8.9, 1.8.10
            Reporter: Tilman Hausherr
            Assignee: Tilman Hausherr
             Fix For: 1.8.10


>From Kevin J. in the user mailing list:

We are currently using Apache Solr / Tika to index documents for searching. The 
exact version that is being used is version 1.8.8 of PDFBox.

We can across a document that produced this stack trace (trimmed to the 
relevant part of PDFBox):
{code}
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of 
range: 1
        at java.lang.String.charAt(String.java:658)
        at 
org.apache.pdfbox.util.DateConverter.parseDate(DateConverter.java:679)
        at 
org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:808)
        at 
org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:780)
        at 
org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:753)
        at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:849)
        at 
org.apache.pdfbox.pdmodel.PDDocumentInformation.getCreationDate(PDDocumentInformation.java:212)
{code}

Inspection of the document's binary revealed that it contained a creationDate 
consisting of a single white space (ASCII 0x20), which is probably illegal. I 
managed to create a small reproduction of the error using like so:
{code}
File file = new File("/path/to/document/bad.pdf");
InputStream stream = new FileInputStream(file);
PDFParser parser = new PDFParser(stream);
parser.parse();
PDDocumentInformation info = parser.getPDDocument().getDocumentInformation();
Calendar creationDate = info.getCreationDate();
System.out.println(creationDate.toString());
{code}
Which produces the same stack trace. I verified this against the latest build 
from the site on 1.8.9, and the behavior remains. This looks very similar to 
PDFBOX-1803, however that issue is marked as resolved in 1.8.5. So, my 
questions:

  *   Is the exception an expected behavior? Ideally Tika would just index the 
document anyway, the creation date isn't important to us. Tika had an issue for 
this, TIKA-1233, that marks it as fixed by swallowing the exception, but 
looking at the comments for it, they removed the try/catch in r1593983 since it 
is marked as fixed here.
  *   Is this a regression, or slightly different somehow from 1803? Shall I 
create a new issue or get the existing 1803 re-opened?
  *   The PDF that reproduces the issue can be downloaded here: 
https://www.dropbox.com/s/tll5rscrlt95xuc/bad.pdf?dl=0




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to