[jira] [Commented] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

Tim Allison (JIRA) Mon, 30 Jun 2014 04:06:07 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047556#comment-14047556
 ]


Tim Allison commented on TIKA-1300:
-----------------------------------

[~tilman], I'm sorry for not responding to your earlier question.  

First, thank you so much for taking a look at the files.  I didn't intend for 
you to take action immediately.  I was only trying to see what the difference 
between the two parsers might be on a publically available set.  
<tongue-in-cheek degree="mild" gratitude="high">Now that you're going to fix 
all of these things, we'll have to go find another test set</tongue-in-cheek>.  
Thank you!

As for the question about what we intend to do.  Yes, absolutely, the goal of 
TIKA-1302 is to include a run against a corpus as part of continuous 
integration.  I wouldn't necessarily call it comparing to a "gold standard."  
I'd reserve that term for human-adjudicated "best effort" text extraction; i.e. 
we'd compare the output with a Save As or some other human generated or judged 
version of the best possible expected text extraction.  But your point about 
testing SNAPSHOT to see if there is a regression is exactly the goal. See (and 
contribute/respond to?) TIKA-1332 for a discussion of possible 
statistics/outputs.

The idea is to collect statistics about an individual run, or (as I did here), 
compare two runs and look for differences.

At the Tika level, we'd probably want to keep comparing to the last 2 or three 
releases at least to start; but the more interesting comparison would be with 
the last release and the last SNAPSHOT.

My proposal/invitation is either for PDFBox to run their own version of this 
process or for PDFBox and Tika to collaborate so that Tika is running the tests 
against PDFBox's SNAPSHOT. 

This methodology might also be useful to compare PDFBox 1.8.x against trunk or 
to compare PDFBox to other PDF extractors (heresy, my apologies!).

There will _always_ be room for improvement in the corpus, and we will always 
be on the lookout to improve automated indicators of type 2 and type 3 problems 
(see TIKA-1332).

> Switch default PDFBox parser to NonSequentialParser
> ---------------------------------------------------
>
>                 Key: TIKA-1300
>                 URL: https://issues.apache.org/jira/browse/TIKA-1300
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 1.7
>
>         Attachments: tika_1_6_ClassicsVsNonSeq.zip
>
>
> On TIKA-1298, [~tilman] recommended switching Tika's default to the 
> NonSequentialParser. We added a parameter to use the NonSequentialParser in 
> TIKA-1201, and there's some good discussion there about the benefits.
> Is the community in favor of switching the default now?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

Reply via email to