[ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1575:
------------------------------
    Attachment: reports_1_8_9_multithread_vs_single.zip

I ran 1.8.9 single threaded and compared the output with the multithreaded 
1.8.9 run; same tika-app.jar, same OS.

If you look at the content diffs, 005937 and 524276 are flagged (again).  

But what's really weird is that lang id differs for 491 files.  Lang id works 
on the full string, and my content diff code works on tokens identified by 
Lucene's StandardAnalyzer.  So this suggests that there may be a fairly 
large-ish difference in the non-word characters that is causing language id to 
differ.

Fortunately, all else remains the same: number of attachments, number of 
metadata values, number of exceptions.

> Upgrade to PDFBox 1.8.9 when available
> --------------------------------------
>
>                 Key: TIKA-1575
>                 URL: https://issues.apache.org/jira/browse/TIKA-1575
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
> 10-814_Appendix B_v3.pdf, 524276_719128_diffs.zip, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
> PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx, 
> reports_1_8_9_multithread_vs_single.zip
>
>
> The PDFBox community is about to release 1.8.9.  Let's use this issue to 
> track discussions before the release and to track Tika's upgrade to PDFBox 
> 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to