Thank you, Nick! -----Original Message----- From: Nick Burch [mailto:[email protected]] Sent: Friday, June 05, 2015 6:15 AM To: [email protected] Subject: RE: [DISCUSS] 1.9 Tika release?
>> text/dif+xml->application/dif+xml >Expected and fine Agreed on the mime type, but is there a reason we're losing text? Or was that incorrect duplication earlier? >> text/plain; charset=windows-1252->application/pdf >> text/plain; charset=windows-1255->application/pdf >These are (hopefully!) PDFs with junk on the front, so good Agreed. >> text/html; charset=ISO-8859-1->text/plain; charset=ISO-8859-1 >Not sure if this is correct or not, maybe double check these by hand? >> It looks like the change of the magic range for pdfs was a good move >> (for govdocs1, at least). However, we’re now losing content from those >> files that are now identified as bibtex. >The *tex formats have an application/ mimetype but no parser, so now we >correctly detect them they stopped going through the text parser as a >fallback. I've hopefully fixed that in r1683702, by marking their >mimetypes as descending from text, so the text parser can claim them if >nothing else can Thank you! >> For govdocs1, we’re now at 6,653 “caught” exceptions for container >> documents (out of 979,143=0.7%), but we have roughly 33k exceptions for >> embedded documents out of 1,364,552=2.4%). As before, I need to confirm >> that something didn’t go wrong with my code; it could also be the case >> that the files are being mis-id’d as Excel… For now, though, it looks >> like that high # is driven by embedded Excel files. >Maybe best to raise one new jira issue per main area, and upload a single >sample file from govdocs that shows the problem, and we can tackle them in >turn in 1.10/1.11? Y. Ran out of steam last night.
