Hi, we are running now with the patch and it runs fine so far. When building pdfbox i got the following failing test: Failed tests: testDateConverter(org.apache.pdfbox.util.TestDateUtil): null expected:<2008-11-0[4T00]:00:00+00:00> but was:<2008-11-0[3T23]:00:00+00:00> testExtract(org.apache.pdfbox.util.TestDateUtil): expected:<java.util.GregorianCalendar[time=1115848800000,areFieldsSet=true,areAllFieldsSet=true,lenient=true,zone=sun.util.calendar.ZoneInfo[id="UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=2,minimalDaysInFirstWeek=4,ERA=1,YEAR=2005,MONTH=4,WEEK_OF_YEAR=19,WEEK_OF_MONTH=2,DAY_OF_MONTH=11,DAY_OF_YEAR=131,DAY_OF_WEEK=4,DAY_OF_WEEK_IN_MONTH=2,AM_PM=1,HOUR=10,HOUR_OF_DAY=22,MINUTE=0,SECOND=0,MILLISECOND=0,ZONE_OFFSET=0,DST_OFFSET=0]> but was:<java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id="UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=2,minimalDaysInFirstWeek=4,ERA=?,YEAR=2005,MONTH=4,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=12,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=0,SECOND=0,MILLISECOND=?,ZONE_OFFSET=?,DST_OFFSET=?]>
Best regards Florian 2013/8/5 Andreas Lehmkuehler <[email protected]> > Hi, > > did you try to apply Christians patch? > > Am 05.08.2013 14:04, schrieb Florian Over: > > Hi, >> this is really hitting us hard on production. >> Is anyone working on this already? >> >> Maybe will try the timeout for now. >> >> Best regards >> Florian Over >> >> >> 2013/7/3 Christian Kohlschütter (JIRA) <[email protected]> >> >> >>> [ >>> https://issues.apache.org/**jira/browse/PDFBOX-1585?page=** >>> com.atlassian.jira.plugin.**system.issuetabpanels:all-**tabpanel<https://issues.apache.org/jira/browse/PDFBOX-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel> >>> ] >>> >>> Christian Kohlschütter updated PDFBOX-1585: >>> ------------------------------**------------- >>> >>> Attachment: PDFBOX-1585.patch >>> >>> We had a similar problem; thanks for providing the problematic PDF. >>> >>> With the help of your stack trace, it was pretty easy to figure out that >>> pdfbox was hanging in an endless loop when reading from an InputStream >>> that >>> reached its end (EOF). >>> >>> A patch is attached. >>> >>> PS: There are some other places in pdfbox that also might loop because >>> InputStream#read() is not checked for -1 (EOF), but this here probably is >>> the most important one. >>> >>> org.apache.pdfbox.util.**PDFTextStripper.getText() causes thread to >>>> block >>>> >>> indefinitely >>> >>>> >>>> ------------------------------**------------------------------** >>> ------------------------ >>> >>>> >>>> Key: PDFBOX-1585 >>>> URL: https://issues.apache.org/** >>>> jira/browse/PDFBOX-1585<https://issues.apache.org/jira/browse/PDFBOX-1585> >>>> Project: PDFBox >>>> Issue Type: Bug >>>> Components: PDFReader, Text extraction >>>> Affects Versions: 1.8.1 >>>> Environment: Ubuntu Linux 10.04 >>>> Solaris 10 >>>> Java 1.6.0_34 >>>> Reporter: Sascha Szott >>>> Attachments: PDFBOX-1585.patch >>>> >>>> >>>> URL of the problematic pdf file is >>>> >>> http://www.redalyc.org/pdf/**540/54017220.pdf<http://www.redalyc.org/pdf/540/54017220.pdf> >>> >>>> My program tries to extract the fulltext of the given pdf file in the >>>> >>> following manner: >>> >>>> {code} >>>> String fileName = "/home/sascha/testfile.pdf" // 1 >>>> PDDocument pdDoc = PDDocument.load(fileName, true); // 2 >>>> PDFTextStripper text = new PDFTextStripper(); // 3 >>>> String fullText = text.getText(pdDoc); // >>>> 4 >>>> {code} >>>> The call in line 4 causes the thread to block indefinitely (runs now for >>>> >>> more than two days without making any progress). The file is stored in a >>> local file system (no network interaction occurs). >>> >>>> jstack indicates that the thread is not deadlocked: >>>> {code} >>>> "main" prio=10 tid=0x000000004187d800 nid=0x6ed8 runnable >>>> >>> [0x00007f9e28e56000] >>> >>>> java.lang.Thread.State: RUNNABLE >>>> at java.io.BufferedInputStream.** >>>> fill(BufferedInputStream.java:**218) >>>> at java.io.BufferedInputStream.** >>>> read(BufferedInputStream.java:**237) >>>> - locked <0x00000007d73a84a0> (a java.io.BufferedInputStream) >>>> at java.io.FilterInputStream.**read(FilterInputStream.java:** >>>> 66) >>>> at java.io.PushbackInputStream.** >>>> read(PushbackInputStream.java:**122) >>>> at >>>> >>> org.apache.pdfbox.io.**PushBackInputStream.read(** >>> PushBackInputStream.java:91) >>> >>>> at >>>> >>> org.apache.pdfbox.pdfparser.**BaseParser.parseCOSHexString(** >>> BaseParser.java:1006) >>> >>>> at >>>> >>> org.apache.pdfbox.pdfparser.**BaseParser.parseCOSString(** >>> BaseParser.java:808) >>> >>>> at >>>> >>> org.apache.pdfbox.pdfparser.**PDFStreamParser.**parseNextToken(** >>> PDFStreamParser.java:260) >>> >>>> at >>>> >>> org.apache.pdfbox.pdfparser.**PDFStreamParser.access$000(** >>> PDFStreamParser.java:46) >>> >>>> at >>>> >>> org.apache.pdfbox.pdfparser.**PDFStreamParser$1.tryNext(** >>> PDFStreamParser.java:182) >>> >>>> at >>>> >>> org.apache.pdfbox.pdfparser.**PDFStreamParser$1.hasNext(** >>> PDFStreamParser.java:194) >>> >>>> at >>>> >>> org.apache.pdfbox.util.**PDFStreamEngine.**processSubStream(** >>> PDFStreamEngine.java:255) >>> >>>> at >>>> >>> org.apache.pdfbox.util.**PDFStreamEngine.**processSubStream(** >>> PDFStreamEngine.java:235) >>> >>>> at org.apache.pdfbox.util.**operator.Invoke.process(** >>>> Invoke.java:67) >>>> at >>>> >>> org.apache.pdfbox.util.**PDFStreamEngine.**processOperator(** >>> PDFStreamEngine.java:554) >>> >>>> at >>>> >>> org.apache.pdfbox.util.**PDFStreamEngine.**processSubStream(** >>> PDFStreamEngine.java:268) >>> >>>> at >>>> >>> org.apache.pdfbox.util.**PDFStreamEngine.**processSubStream(** >>> PDFStreamEngine.java:235) >>> >>>> at org.apache.pdfbox.util.**operator.Invoke.process(** >>>> Invoke.java:67) >>>> at >>>> >>> org.apache.pdfbox.util.**PDFStreamEngine.**processOperator(** >>> PDFStreamEngine.java:554) >>> >>>> at >>>> >>> org.apache.pdfbox.util.**PDFStreamEngine.**processSubStream(** >>> PDFStreamEngine.java:268) >>> >>>> at >>>> >>> org.apache.pdfbox.util.**PDFStreamEngine.**processSubStream(** >>> PDFStreamEngine.java:235) >>> >>>> at >>>> >>> org.apache.pdfbox.util.**PDFStreamEngine.processStream(** >>> PDFStreamEngine.java:215) >>> >>>> at >>>> >>> org.apache.pdfbox.util.**PDFTextStripper.processPage(** >>> PDFTextStripper.java:455) >>> >>>> at >>>> >>> org.apache.pdfbox.util.**PDFTextStripper.processPages(** >>> PDFTextStripper.java:379) >>> >>>> at >>>> >>> org.apache.pdfbox.util.**PDFTextStripper.writeText(** >>> PDFTextStripper.java:335) >>> >>>> at >>>> >>> org.apache.pdfbox.util.**PDFTextStripper.getText(** >>> PDFTextStripper.java:254) >>> >>>> at >>>> >>> de.kobv.ked.extraction.**FulltextExtraction.**getFulltext(** >>> FulltextExtraction.java:65) >>> >>>> {code} >>>> Any idea or advice on how to fix that problem? Is it possible to set up >>>> >>> a timeout for the extraction operation? >>> >>> -- >>> This message is automatically generated by JIRA. >>> If you think it was sent incorrectly, please contact your JIRA >>> administrators >>> For more information on JIRA, see: http://www.atlassian.com/** >>> software/jira <http://www.atlassian.com/software/jira> >>> >>> >> > BR > Andreas Lehmkühler >
