Re: [jira] [Updated] (PDFBOX-1585) org.apache.pdfbox.util.PDFTextStripper.getText() causes thread to block indefinitely

Florian Over Tue, 06 Aug 2013 05:03:03 -0700

Hi,
we are running now with the patch and it runs fine so far.

When building pdfbox i got the following failing test:
Failed tests:   testDateConverter(org.apache.pdfbox.util.TestDateUtil):
null expected:<2008-11-0[4T00]:00:00+00:00> but
was:<2008-11-0[3T23]:00:00+00:00>
  testExtract(org.apache.pdfbox.util.TestDateUtil):
expected:<java.util.GregorianCalendar[time=1115848800000,areFieldsSet=true,areAllFieldsSet=true,lenient=true,zone=sun.util.calendar.ZoneInfo[id="UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=2,minimalDaysInFirstWeek=4,ERA=1,YEAR=2005,MONTH=4,WEEK_OF_YEAR=19,WEEK_OF_MONTH=2,DAY_OF_MONTH=11,DAY_OF_YEAR=131,DAY_OF_WEEK=4,DAY_OF_WEEK_IN_MONTH=2,AM_PM=1,HOUR=10,HOUR_OF_DAY=22,MINUTE=0,SECOND=0,MILLISECOND=0,ZONE_OFFSET=0,DST_OFFSET=0]>
but
was:<java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id="UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=2,minimalDaysInFirstWeek=4,ERA=?,YEAR=2005,MONTH=4,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=12,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=0,SECOND=0,MILLISECOND=?,ZONE_OFFSET=?,DST_OFFSET=?]>



Best regards Florian


2013/8/5 Andreas Lehmkuehler <[email protected]>

> Hi,
>
> did you try to apply Christians patch?
>
> Am 05.08.2013 14:04, schrieb Florian Over:
>
>  Hi,
>> this is really hitting us hard on production.
>> Is anyone working on this already?
>>
>> Maybe will try the timeout for now.
>>
>> Best regards
>> Florian Over
>>
>>
>> 2013/7/3 Christian Kohlschütter (JIRA) <[email protected]>
>>
>>
>>>       [
>>> https://issues.apache.org/**jira/browse/PDFBOX-1585?page=**
>>> com.atlassian.jira.plugin.**system.issuetabpanels:all-**tabpanel<https://issues.apache.org/jira/browse/PDFBOX-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel>
>>> ]
>>>
>>> Christian Kohlschütter updated PDFBOX-1585:
>>> ------------------------------**-------------
>>>
>>>      Attachment: PDFBOX-1585.patch
>>>
>>> We had a similar problem; thanks for providing the problematic PDF.
>>>
>>> With the help of your stack trace, it was pretty easy to figure out that
>>> pdfbox was hanging in an endless loop when reading from an InputStream
>>> that
>>> reached its end (EOF).
>>>
>>> A patch is attached.
>>>
>>> PS: There are some other places in pdfbox that also might loop because
>>> InputStream#read() is not checked for -1 (EOF), but this here probably is
>>> the most important one.
>>>
>>>  org.apache.pdfbox.util.**PDFTextStripper.getText() causes thread to
>>>> block
>>>>
>>> indefinitely
>>>
>>>>
>>>>  ------------------------------**------------------------------**
>>> ------------------------
>>>
>>>>
>>>>                  Key: PDFBOX-1585
>>>>                  URL: https://issues.apache.org/**
>>>> jira/browse/PDFBOX-1585<https://issues.apache.org/jira/browse/PDFBOX-1585>
>>>>              Project: PDFBox
>>>>           Issue Type: Bug
>>>>           Components: PDFReader, Text extraction
>>>>     Affects Versions: 1.8.1
>>>>          Environment: Ubuntu Linux 10.04
>>>> Solaris 10
>>>> Java 1.6.0_34
>>>>             Reporter: Sascha Szott
>>>>          Attachments: PDFBOX-1585.patch
>>>>
>>>>
>>>> URL of the problematic pdf file is
>>>>
>>> http://www.redalyc.org/pdf/**540/54017220.pdf<http://www.redalyc.org/pdf/540/54017220.pdf>
>>>
>>>> My program tries to extract the fulltext of the given pdf file in the
>>>>
>>> following manner:
>>>
>>>> {code}
>>>> String fileName = "/home/sascha/testfile.pdf"                   // 1
>>>> PDDocument pdDoc = PDDocument.load(fileName, true); // 2
>>>> PDFTextStripper text = new PDFTextStripper();             // 3
>>>> String fullText = text.getText(pdDoc);                               //
>>>> 4
>>>> {code}
>>>> The call in line 4 causes the thread to block indefinitely (runs now for
>>>>
>>> more than two days without making any progress). The file is stored in a
>>> local file system (no network interaction occurs).
>>>
>>>> jstack indicates that the thread is not deadlocked:
>>>> {code}
>>>> "main" prio=10 tid=0x000000004187d800 nid=0x6ed8 runnable
>>>>
>>> [0x00007f9e28e56000]
>>>
>>>>     java.lang.Thread.State: RUNNABLE
>>>>          at java.io.BufferedInputStream.**
>>>> fill(BufferedInputStream.java:**218)
>>>>          at java.io.BufferedInputStream.**
>>>> read(BufferedInputStream.java:**237)
>>>>          - locked <0x00000007d73a84a0> (a java.io.BufferedInputStream)
>>>>          at java.io.FilterInputStream.**read(FilterInputStream.java:**
>>>> 66)
>>>>          at java.io.PushbackInputStream.**
>>>> read(PushbackInputStream.java:**122)
>>>>          at
>>>>
>>> org.apache.pdfbox.io.**PushBackInputStream.read(**
>>> PushBackInputStream.java:91)
>>>
>>>>          at
>>>>
>>> org.apache.pdfbox.pdfparser.**BaseParser.parseCOSHexString(**
>>> BaseParser.java:1006)
>>>
>>>>          at
>>>>
>>> org.apache.pdfbox.pdfparser.**BaseParser.parseCOSString(**
>>> BaseParser.java:808)
>>>
>>>>          at
>>>>
>>> org.apache.pdfbox.pdfparser.**PDFStreamParser.**parseNextToken(**
>>> PDFStreamParser.java:260)
>>>
>>>>          at
>>>>
>>> org.apache.pdfbox.pdfparser.**PDFStreamParser.access$000(**
>>> PDFStreamParser.java:46)
>>>
>>>>          at
>>>>
>>> org.apache.pdfbox.pdfparser.**PDFStreamParser$1.tryNext(**
>>> PDFStreamParser.java:182)
>>>
>>>>          at
>>>>
>>> org.apache.pdfbox.pdfparser.**PDFStreamParser$1.hasNext(**
>>> PDFStreamParser.java:194)
>>>
>>>>          at
>>>>
>>> org.apache.pdfbox.util.**PDFStreamEngine.**processSubStream(**
>>> PDFStreamEngine.java:255)
>>>
>>>>          at
>>>>
>>> org.apache.pdfbox.util.**PDFStreamEngine.**processSubStream(**
>>> PDFStreamEngine.java:235)
>>>
>>>>          at org.apache.pdfbox.util.**operator.Invoke.process(**
>>>> Invoke.java:67)
>>>>          at
>>>>
>>> org.apache.pdfbox.util.**PDFStreamEngine.**processOperator(**
>>> PDFStreamEngine.java:554)
>>>
>>>>          at
>>>>
>>> org.apache.pdfbox.util.**PDFStreamEngine.**processSubStream(**
>>> PDFStreamEngine.java:268)
>>>
>>>>          at
>>>>
>>> org.apache.pdfbox.util.**PDFStreamEngine.**processSubStream(**
>>> PDFStreamEngine.java:235)
>>>
>>>>          at org.apache.pdfbox.util.**operator.Invoke.process(**
>>>> Invoke.java:67)
>>>>          at
>>>>
>>> org.apache.pdfbox.util.**PDFStreamEngine.**processOperator(**
>>> PDFStreamEngine.java:554)
>>>
>>>>          at
>>>>
>>> org.apache.pdfbox.util.**PDFStreamEngine.**processSubStream(**
>>> PDFStreamEngine.java:268)
>>>
>>>>          at
>>>>
>>> org.apache.pdfbox.util.**PDFStreamEngine.**processSubStream(**
>>> PDFStreamEngine.java:235)
>>>
>>>>          at
>>>>
>>> org.apache.pdfbox.util.**PDFStreamEngine.processStream(**
>>> PDFStreamEngine.java:215)
>>>
>>>>          at
>>>>
>>> org.apache.pdfbox.util.**PDFTextStripper.processPage(**
>>> PDFTextStripper.java:455)
>>>
>>>>          at
>>>>
>>> org.apache.pdfbox.util.**PDFTextStripper.processPages(**
>>> PDFTextStripper.java:379)
>>>
>>>>          at
>>>>
>>> org.apache.pdfbox.util.**PDFTextStripper.writeText(**
>>> PDFTextStripper.java:335)
>>>
>>>>          at
>>>>
>>> org.apache.pdfbox.util.**PDFTextStripper.getText(**
>>> PDFTextStripper.java:254)
>>>
>>>>          at
>>>>
>>> de.kobv.ked.extraction.**FulltextExtraction.**getFulltext(**
>>> FulltextExtraction.java:65)
>>>
>>>> {code}
>>>> Any idea or advice on how to fix that problem? Is it possible to set up
>>>>
>>> a timeout for the extraction operation?
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> If you think it was sent incorrectly, please contact your JIRA
>>> administrators
>>> For more information on JIRA, see: http://www.atlassian.com/**
>>> software/jira <http://www.atlassian.com/software/jira>
>>>
>>>
>>
> BR
> Andreas Lehmkühler
>

Re: [jira] [Updated] (PDFBOX-1585) org.apache.pdfbox.util.PDFTextStripper.getText() causes thread to block indefinitely

Reply via email to