On Fri, 5 Jun 2015, Allison, Timothy B. wrote:
Changes in mime detection for "main" files:

text/plain; charset=ISO-8859-1->application/x-bibtex-text-file
text/plain; charset=windows-1252->application/x-bibtex-text-file
text/html; charset=ISO-8859-1->application/x-bibtex-text-file

I think these are expected and good

text/dif+xml->application/dif+xml

Expected and fine

text/plain; charset=windows-1252->application/pdf
text/plain; charset=windows-1255->application/pdf

These are (hopefully!) PDFs with junk on the front, so good

text/html; charset=ISO-8859-1->text/plain; charset=ISO-8859-1

Not sure if this is correct or not, maybe double check these by hand?


It looks like the change of the magic range for pdfs was a good move (for govdocs1, at least). However, we’re now losing content from those files that are now identified as bibtex.

The *tex formats have an application/ mimetype but no parser, so now we correctly detect them they stopped going through the text parser as a fallback. I've hopefully fixed that in r1683702, by marking their mimetypes as descending from text, so the text parser can claim them if nothing else can


For govdocs1, we’re now at 6,653 “caught” exceptions for container documents (out of 979,143=0.7%), but we have roughly 33k exceptions for embedded documents out of 1,364,552=2.4%). As before, I need to confirm that something didn’t go wrong with my code; it could also be the case that the files are being mis-id’d as Excel… For now, though, it looks like that high # is driven by embedded Excel files.

Maybe best to raise one new jira issue per main area, and upload a single sample file from govdocs that shows the problem, and we can tackle them in turn in 1.10/1.11?

Nick

Reply via email to