[ 
https://issues.apache.org/jira/browse/TIKA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230869#comment-15230869
 ] 

Tim Allison edited comment on TIKA-1939 at 4/7/16 7:27 PM:
-----------------------------------------------------------

I ran the first pre-pre-release regression test yesterday.  Results are on our 
Rackspace [vm|http://162.242.228.174/reports].

Still need to make some mods to the eval code and the report writer.

I've only quickly looked at the reports.  I don't think there are any big 
surprises.  It will be good to rerun with the upgraded POI.

We've had some changes in mime-detection that appear to have caused regressions:
 text/html; charset=windows-1252 -> text/plain; charset=windows-1252
and a few others that are no longer identified as html.  If you look at the 
content diffs for those, they tend to have lots of extra markup terms (table, 
td, p, etc.)

I also noticed one case of pict now being identified as pdf because there is an 
embedded pdf inside the pict: govdocs1/333171.doc, and I think I remember we 
extended the search distance for %pdf.

We've also had some clear improvements in mime-detection.

We're getting quite a bit more info out of ppt.
There are many more exceptions in PDFs because of all of our truncated files 
and the diff between PDFBox 2.0.0 and 1.8.11.




was (Author: [email protected]):
I ran the first pre-pre-release regression test yesterday.  Results are on our 
Rackspace [vm|http://162.242.228.174/reports].

Still need to make some mods to the eval code and the report writer.

I've only quickly looked at the reports.  I don't think there are any big 
surprises.  It will be good to rerun with the upgraded POI.

We've had some changes in mime-detection that appear to have caused regressions:
1) pict -> pdf (doesn't appear in mime-diffs...not sure why), but see 
govdocs1/333171.doc
2) text/html; charset=windows-1252 -> text/plain; charset=windows-1252

We've also had some clear improvements in mime-detection.


> Preparation for Tika 1.13 release
> ---------------------------------
>
>                 Key: TIKA-1939
>                 URL: https://issues.apache.org/jira/browse/TIKA-1939
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>             Fix For: 1.13
>
>
> Let's use this to track tasks/discussion/links for release of Tika 1.13.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to