[
https://issues.apache.org/jira/browse/TIKA-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830063#comment-17830063
]
Tim Allison commented on TIKA-4218:
-----------------------------------
https://corpora.tika.apache.org/base/reports/tika-2.9.2-pre-rc1-reports.tgz
Initial negative observations that require investigation:
1) some pptx are now being identified as tika-ooxml:
commoncrawl3_refetched/HD/HDUTGEMEAGSGCJOTXREK77GYQKM3W5H3
2) some pdfs have less text: govdocs1/876/876503.pdf (this could be Tika's
fault, not PDFBox's -- it could also be an improvement!)
3) epub+zip have many fewer "common tokens" -- this is caused by
EncryptedExceptions being thrown in 2.9.2:
commoncrawl3/47/47WOSBEUHE6CRMVDFBOOHUD36FEQAZ6T
4) it looks like a bunch of formats are now being identified (incorrectly) as
x-tar, leading to exceptions: 646 appledouble, 289 microsoft icon, etc. There
is a small handful of files that used to be identified as mp4 that are now
being correctly handled as x-tar...
5) There are several regressions in x-xz handling:
commoncrawl3/YE/YEPTQ2CBI7BJ26PPVBTKZIALFSUQFDZH
Initial positive observations:
1) some rfc822 have less junk, esp Persian language emails
2) some pdfs are much better
3) application/vnd.ms-htmlhelp look to be better
> Run regression tests to support 2.9.2 release
> ---------------------------------------------
>
> Key: TIKA-4218
> URL: https://issues.apache.org/jira/browse/TIKA-4218
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)