RE: [DISCUSS] 1.9 Tika release?

Allison, Timothy B. Fri, 05 Jun 2015 04:17:58 -0700

Thank you, Nick!

-----Original Message-----
From: Nick Burch [mailto:[email protected]] 
Sent: Friday, June 05, 2015 6:15 AM
To: [email protected]
Subject: RE: [DISCUSS] 1.9 Tika release?



>> text/dif+xml->application/dif+xml

>Expected and fine

Agreed on the mime type, but is there a reason we're losing text?  Or was that 
incorrect duplication earlier?


>> text/plain; charset=windows-1252->application/pdf
>> text/plain; charset=windows-1255->application/pdf

>These are (hopefully!) PDFs with junk on the front, so good

Agreed.

>> text/html; charset=ISO-8859-1->text/plain; charset=ISO-8859-1

>Not sure if this is correct or not, maybe double check these by hand?


>> It looks like the change of the magic range for pdfs was a good move 
>> (for govdocs1, at least).  However, we’re now losing content from those 
>> files that are now identified as bibtex.

>The *tex formats have an application/ mimetype but no parser, so now we 
>correctly detect them they stopped going through the text parser as a 
>fallback. I've hopefully fixed that in r1683702, by marking their 
>mimetypes as descending from text, so the text parser can claim them if 
>nothing else can

Thank you!

>> For govdocs1, we’re now at 6,653 “caught” exceptions for container 
>> documents (out of 979,143=0.7%), but we have roughly 33k exceptions for 
>> embedded documents out of 1,364,552=2.4%).  As before, I need to confirm 
>> that something didn’t go wrong with my code; it could also be the case 
>> that the files are being mis-id’d as Excel… For now, though, it looks 
>> like that high # is driven by embedded Excel files.

>Maybe best to raise one new jira issue per main area, and upload a single 
>sample file from govdocs that shows the problem, and we can tackle them in 
>turn in 1.10/1.11?

Y. Ran out of steam last night.

RE: [DISCUSS] 1.9 Tika release?

Reply via email to