Sascha Szott created PDFBOX-1585: ------------------------------------ Summary: org.apache.pdfbox.util.PDFTextStripper.getText() causes thread to block indefinitely Key: PDFBOX-1585 URL: https://issues.apache.org/jira/browse/PDFBOX-1585 Project: PDFBox Issue Type: Bug Components: PDFReader, Text extraction Affects Versions: 1.8.1 Environment: Ubuntu Linux 10.04 Solaris 10 Java 1.6.0_34 Reporter: Sascha Szott
URL of the problematic pdf file is http://www.redalyc.org/pdf/540/54017220.pdf My program tries to extract the fulltext of the given pdf file in the following manner: {code} String fileName = "/home/sascha/testfile.pdf" // 1 PDDocument pdDoc = PDDocument.load(fileName, true); // 2 PDFTextStripper text = new PDFTextStripper(); // 3 String fullText = text.getText(pdDoc); // 4 {code} The call in line 4 causes the thread to block indefinitely (runs now for more than two days without making any progress). The file is stored in a local file system (no network interaction occurs). jstack indicates that the thread is not deadlocked: {code} "main" prio=10 tid=0x000000004187d800 nid=0x6ed8 runnable [0x00007f9e28e56000] java.lang.Thread.State: RUNNABLE at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) - locked <0x00000007d73a84a0> (a java.io.BufferedInputStream) at java.io.FilterInputStream.read(FilterInputStream.java:66) at java.io.PushbackInputStream.read(PushbackInputStream.java:122) at org.apache.pdfbox.io.PushBackInputStream.read(PushBackInputStream.java:91) at org.apache.pdfbox.pdfparser.BaseParser.parseCOSHexString(BaseParser.java:1006) at org.apache.pdfbox.pdfparser.BaseParser.parseCOSString(BaseParser.java:808) at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:260) at org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46) at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:182) at org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:194) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:255) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:67) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:67) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335) at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:254) at de.kobv.ked.extraction.FulltextExtraction.getFulltext(FulltextExtraction.java:65) {code} Any idea or advice on how to fix that problem? Is it possible to set up a timeout for the extraction operation? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira