RE: [DISCUSS] 1.9 Tika release?

Nick Burch Fri, 05 Jun 2015 03:16:41 -0700

On Fri, 5 Jun 2015, Allison, Timothy B. wrote:

Changes in mime detection for "main" files:


text/plain; charset=ISO-8859-1->application/x-bibtex-text-file
text/plain; charset=windows-1252->application/x-bibtex-text-file
text/html; charset=ISO-8859-1->application/x-bibtex-text-file


I think these are expected and good

text/dif+xml->application/dif+xml


Expected and fine

text/plain; charset=windows-1252->application/pdf
text/plain; charset=windows-1255->application/pdf


These are (hopefully!) PDFs with junk on the front, so good

text/html; charset=ISO-8859-1->text/plain; charset=ISO-8859-1


Not sure if this is correct or not, maybe double check these by hand?

It looks like the change of the magic range for pdfs was a good move(for govdocs1, at least). However, we’re now losing content from thosefiles that are now identified as bibtex.

The *tex formats have an application/ mimetype but no parser, so now wecorrectly detect them they stopped going through the text parser as afallback. I've hopefully fixed that in r1683702, by marking theirmimetypes as descending from text, so the text parser can claim them ifnothing else can

For govdocs1, we’re now at 6,653 “caught” exceptions for containerdocuments (out of 979,143=0.7%), but we have roughly 33k exceptions forembedded documents out of 1,364,552=2.4%). As before, I need to confirmthat something didn’t go wrong with my code; it could also be the casethat the files are being mis-id’d as Excel… For now, though, it lookslike that high # is driven by embedded Excel files.

Maybe best to raise one new jira issue per main area, and upload a singlesample file from govdocs that shows the problem, and we can tackle them inturn in 1.10/1.11?


Nick

RE: [DISCUSS] 1.9 Tika release?

Reply via email to