[ https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047556#comment-14047556 ]
Tim Allison commented on TIKA-1300: ----------------------------------- [~tilman], I'm sorry for not responding to your earlier question. First, thank you so much for taking a look at the files. I didn't intend for you to take action immediately. I was only trying to see what the difference between the two parsers might be on a publically available set. <tongue-in-cheek degree="mild" gratitude="high">Now that you're going to fix all of these things, we'll have to go find another test set</tongue-in-cheek>. Thank you! As for the question about what we intend to do. Yes, absolutely, the goal of TIKA-1302 is to include a run against a corpus as part of continuous integration. I wouldn't necessarily call it comparing to a "gold standard." I'd reserve that term for human-adjudicated "best effort" text extraction; i.e. we'd compare the output with a Save As or some other human generated or judged version of the best possible expected text extraction. But your point about testing SNAPSHOT to see if there is a regression is exactly the goal. See (and contribute/respond to?) TIKA-1332 for a discussion of possible statistics/outputs. The idea is to collect statistics about an individual run, or (as I did here), compare two runs and look for differences. At the Tika level, we'd probably want to keep comparing to the last 2 or three releases at least to start; but the more interesting comparison would be with the last release and the last SNAPSHOT. My proposal/invitation is either for PDFBox to run their own version of this process or for PDFBox and Tika to collaborate so that Tika is running the tests against PDFBox's SNAPSHOT. This methodology might also be useful to compare PDFBox 1.8.x against trunk or to compare PDFBox to other PDF extractors (heresy, my apologies!). There will _always_ be room for improvement in the corpus, and we will always be on the lookout to improve automated indicators of type 2 and type 3 problems (see TIKA-1332). > Switch default PDFBox parser to NonSequentialParser > --------------------------------------------------- > > Key: TIKA-1300 > URL: https://issues.apache.org/jira/browse/TIKA-1300 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Tim Allison > Assignee: Tim Allison > Priority: Minor > Fix For: 1.7 > > Attachments: tika_1_6_ClassicsVsNonSeq.zip > > > On TIKA-1298, [~tilman] recommended switching Tika's default to the > NonSequentialParser. We added a parameter to use the NonSequentialParser in > TIKA-1201, and there's some good discussion there about the benefits. > Is the community in favor of switching the default now? -- This message was sent by Atlassian JIRA (v6.2#6252)