[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046728#comment-14046728 ]
Tilman Hausherr commented on TIKA-1300: --------------------------------------- I had a look at most of the files. This resulted in PDFBOX-2163 (7 files, will be fixed in 1.8 this Weekend) and PDFBOX-2167 (1 file). The rest is really broken, some of them so bad that even Acrobat can't open them. Many have incorrect xref tables. One has a broken LZW stream so that even Acrobat displays just a part of the text. One I believe I've seen before (I think brought up by William Palmer), it has a PDF stream that had two threads writing on it at the same time. Yes, TIKA-1205 should be done. > Switch default PDFBox parser to NonSequentialParser > --------------------------------------------------- > > Key: TIKA-1300 > URL: https://issues.apache.org/jira/browse/TIKA-1300 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Tim Allison > Assignee: Tim Allison > Priority: Minor > Fix For: 1.7 > > Attachments: tika_1_6_ClassicsVsNonSeq.zip > > > On TIKA-1298, [~tilman] recommended switching Tika's default to the > NonSequentialParser. We added a parameter to use the NonSequentialParser in > TIKA-1201, and there's some good discussion there about the benefits. > Is the community in favor of switching the default now? -- This message was sent by Atlassian JIRA (v6.2#6252)