[ 
https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046288#comment-14046288
 ] 

Tilman Hausherr commented on TIKA-1300:
---------------------------------------

I'm not doing much with text extraction, but what we could need (and sorry if 
that is what you already do) is a diff between versions. i.e. that the 
extraction results are compared with a "current gold standard". And this could 
be done _with the snapshot versions_ of PDFBox and the other components you 
use. This way you would quickly notice if you get worse or better results, and 
don't have to wait for a release to discover a regression.

> Switch default PDFBox parser to NonSequentialParser
> ---------------------------------------------------
>
>                 Key: TIKA-1300
>                 URL: https://issues.apache.org/jira/browse/TIKA-1300
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 1.7
>
>         Attachments: tika_1_6_ClassicsVsNonSeq.zip
>
>
> On TIKA-1298, [~tilman] recommended switching Tika's default to the 
> NonSequentialParser. We added a parameter to use the NonSequentialParser in 
> TIKA-1201, and there's some good discussion there about the benefits.
> Is the community in favor of switching the default now?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to