[ 
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830063#comment-17830063
 ] 

Tim Allison commented on TIKA-4218:
-----------------------------------

https://corpora.tika.apache.org/base/reports/tika-2.9.2-pre-rc1-reports.tgz

Initial negative observations that require investigation:
1) some pptx are now being identified as tika-ooxml: 
commoncrawl3_refetched/HD/HDUTGEMEAGSGCJOTXREK77GYQKM3W5H3
2) some pdfs have less text: govdocs1/876/876503.pdf (this could be Tika's 
fault, not PDFBox's -- it could also be an improvement!)
3) epub+zip have many fewer "common tokens" -- this is caused by 
EncryptedExceptions being thrown in 2.9.2: 
commoncrawl3/47/47WOSBEUHE6CRMVDFBOOHUD36FEQAZ6T
4) it looks like a bunch of formats are now being identified (incorrectly) as 
x-tar, leading to exceptions: 646 appledouble, 289 microsoft icon, etc. There 
is a small handful of files that used to be identified as mp4 that are now 
being correctly handled as x-tar...
5) There are several regressions in x-xz handling: 
commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH

Initial positive observations:
1) some rfc822 have less junk, esp Persian language emails
2) some pdfs are much better
3) application/vnd.ms-htmlhelp look to be better

> Run regression tests to support 2.9.2 release
> ---------------------------------------------
>
>                 Key: TIKA-4218
>                 URL: https://issues.apache.org/jira/browse/TIKA-4218
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to