[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046288#comment-14046288 ]
Tilman Hausherr commented on TIKA-1300: --------------------------------------- I'm not doing much with text extraction, but what we could need (and sorry if that is what you already do) is a diff between versions. i.e. that the extraction results are compared with a "current gold standard". And this could be done _with the snapshot versions_ of PDFBox and the other components you use. This way you would quickly notice if you get worse or better results, and don't have to wait for a release to discover a regression. > Switch default PDFBox parser to NonSequentialParser > --------------------------------------------------- > > Key: TIKA-1300 > URL: https://issues.apache.org/jira/browse/TIKA-1300 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Tim Allison > Assignee: Tim Allison > Priority: Minor > Fix For: 1.7 > > Attachments: tika_1_6_ClassicsVsNonSeq.zip > > > On TIKA-1298, [~tilman] recommended switching Tika's default to the > NonSequentialParser. We added a parameter to use the NonSequentialParser in > TIKA-1201, and there's some good discussion there about the benefits. > Is the community in favor of switching the default now? -- This message was sent by Atlassian JIRA (v6.2#6252)