Tim Allison created TIKA-4337:
---------------------------------
Summary: Improvements to recent xps mods
Key: TIKA-4337
URL: https://issues.apache.org/jira/browse/TIKA-4337
Project: Tika
Issue Type: Task
Reporter: Tim Allison
Attachments: xps-reports.tgz
I pulled 249 xps files out of the latest commoncrawl crawl and compared
3.0.1-SNAPSHOT with 3.0.0. There are some new exceptions, one NPE, and a few
number format exceptions where a comma-delimited string is parsed as if it were
an integer.
Reports are attached. See esp. new_exceptions_in_b_details.xlsx and
content_diffs_no_exceptions.xlsx.
The source files are available here:
https://corpora.tika.apache.org/base/share/xps.tgz
--
This message was sent by Atlassian Jira
(v8.20.10#820010)