[ 
https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046728#comment-14046728
 ] 

Tilman Hausherr commented on TIKA-1300:
---------------------------------------

I had a look at most of the files. This resulted in PDFBOX-2163 (7 files, will 
be fixed in 1.8 this Weekend) and PDFBOX-2167 (1 file). The rest is really 
broken, some of them so bad that even Acrobat can't open them. Many have 
incorrect xref tables. One has a broken LZW stream so that even Acrobat 
displays just a part of the text. One I believe I've seen before (I think 
brought up by William Palmer), it has a PDF stream that had two threads writing 
on it at the same time.

Yes, TIKA-1205 should be done.

> Switch default PDFBox parser to NonSequentialParser
> ---------------------------------------------------
>
>                 Key: TIKA-1300
>                 URL: https://issues.apache.org/jira/browse/TIKA-1300
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 1.7
>
>         Attachments: tika_1_6_ClassicsVsNonSeq.zip
>
>
> On TIKA-1298, [~tilman] recommended switching Tika's default to the 
> NonSequentialParser. We added a parameter to use the NonSequentialParser in 
> TIKA-1201, and there's some good discussion there about the benefits.
> Is the community in favor of switching the default now?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to