[
https://issues.apache.org/jira/browse/TIKA-4373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17921740#comment-17921740
]
Tim Allison edited comment on TIKA-4373 at 1/28/25 1:54 PM:
------------------------------------------------------------
Couple of observations.
1) LibreOffice 24.2 is complaining about all the xlsx reports now. It is able
to repair them. This was widely discussed on the POI lists.
2) We've lost quite a few "common words" in files that used to be detected as
colon-delimited "csv" files.
3) PDF extraction has seen quite good improvements
4) zip extraction has improved in several handfuls of documents -- more
attachments
5) We're getting a bunch more files identified as json.
6) handful of new exceptions in RTF (zip bomb?!) and xps
7) improved text extraction in xps
I want to manually sample some files for 2), 5) and 6) to see if these are
serious problems.
We updated commons-codec after running these regression tests. I propose that
unless there are problems identified in the report, we move forth with a
3.1.0-rc1 vote and concurrently rerun the regression tests to pick up any
surprises with the updated commons-codec.
Let me know if you find anything.
Many, many thanks again to [~msahyoun] for his ongoing support of the
regression server.
was (Author: [email protected]):
Couple of observations.
1) LibreOffice 24.2 is complaining about all the xlsx reports now. It is able
to repair them. This was widely discussed on the POI lists.
2) We've lost quite a few "common words" in files that used to be detected as
colon-delimited "csv" files.
3) PDF extraction has seen quite good improvements
4) zip extraction has improved in several handfuls of documents -- more
attachments
5) We're getting a bunch more files identified as json.
6) handful of new exceptions in RTF (zip bomb?!) and xps
I want to manually sample some files for 2), 5) and 6) to see if these are
serious problems.
We updated commons-codec after running these regression tests. I propose that
unless there are problems identified in the report, we move forth with a
3.1.0-rc1 vote and concurrently rerun the regression tests to pick up any
surprises with the updated commons-codec.
Let me know if you find anything.
Many, many thanks again to [~msahyoun] for his ongoing support of the
regression server.
> Regression tests for 3.1.0 release
> ----------------------------------
>
> Key: TIKA-4373
> URL: https://issues.apache.org/jira/browse/TIKA-4373
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
> Attachments: reports_tika-3.0-vs-3.1.tgz
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)