[ https://issues.apache.org/jira/browse/SOLR-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13191811#comment-13191811 ]
Jan Høydahl commented on SOLR-1786: ----------------------------------- Tested the linked PDF file with tika-app-1.1-SNAPSHOT.jar and it does not parse, I gave it 2G ram: {noformat} java -jar target/tika-app-1.1-SNAPSHOT.jar http://cdsweb.cern.ch/record/702585/files/sl-note-2000-019.pdf -m [...] <p>ERROR - Stop reading corrupt stream WARN - java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pdfbox.util.operator.Concatenate.process(Concatenate.java:47) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:63) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:105) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101) WARN - java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) [...] WARN - Bad Dictionary Declaration org.apache.pdfbox.io.PushBackInputStream@7433b121 WARN - Invalid dictionary, found: '�' but expected: '/' Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@6db22920 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101) Caused by: java.lang.NullPointerException at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:368) at org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46) at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:175) at org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:187) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:266) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:63) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:105) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more {noformat} Trying to extract using PdfBox1.7 also failed {noformat} java -Xmx3G -jar pdfbox-app-1.7.0-SNAPSHOT.jar ExtractText -debug sl-note-2000-019.pdf [...] ExtractText failed with the following exception: java.io.EOFException: Unexpected end of ZLIB input stream at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:223) at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141) at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:115) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221) at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156) at org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:105) {noformat} So you should probably pursue this on the PDFBOX mailing list/JIRA, and then let a possible fix bubble up through TIKA to Solr > Solr (trunk rev. 912116) suffers from PDFBOX-537 [Endless loop in > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary()] fixed in PDFbox > 1.0? > ---------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SOLR-1786 > URL: https://issues.apache.org/jira/browse/SOLR-1786 > Project: Solr > Issue Type: Bug > Components: contrib - Solr Cell (Tika extraction) > Affects Versions: 1.5 > Environment: Ubuntu 9.10, 32bit > Reporter: Jan Iwaszkiewicz > Priority: Critical > Labels: PDFbox > Fix For: 3.6, 4.0 > > > I tried indexing several thousand PDF documents but could not finish as Solr > was falling into an endless loop for some of them, for instance: > http://cdsweb.cern.ch/record/702585/files/sl-note-2000-019.pdf (the PDF seems > OK). > Can Solr start using PDFbox 1.0? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org