On Fri, 5 Jun 2015, Allison, Timothy B. wrote:
Changes in mime detection for "main" files:
text/plain; charset=ISO-8859-1->application/x-bibtex-text-file
text/plain; charset=windows-1252->application/x-bibtex-text-file
text/html; charset=ISO-8859-1->application/x-bibtex-text-file
I think these are expected and good
text/dif+xml->application/dif+xml
Expected and fine
text/plain; charset=windows-1252->application/pdf
text/plain; charset=windows-1255->application/pdf
These are (hopefully!) PDFs with junk on the front, so good
text/html; charset=ISO-8859-1->text/plain; charset=ISO-8859-1
Not sure if this is correct or not, maybe double check these by hand?
It looks like the change of the magic range for pdfs was a good move
(for govdocs1, at least). However, we’re now losing content from those
files that are now identified as bibtex.
The *tex formats have an application/ mimetype but no parser, so now we
correctly detect them they stopped going through the text parser as a
fallback. I've hopefully fixed that in r1683702, by marking their
mimetypes as descending from text, so the text parser can claim them if
nothing else can
For govdocs1, we’re now at 6,653 “caught” exceptions for container
documents (out of 979,143=0.7%), but we have roughly 33k exceptions for
embedded documents out of 1,364,552=2.4%). As before, I need to confirm
that something didn’t go wrong with my code; it could also be the case
that the files are being mis-id’d as Excel… For now, though, it looks
like that high # is driven by embedded Excel files.
Maybe best to raise one new jira issue per main area, and upload a single
sample file from govdocs that shows the problem, and we can tackle them in
turn in 1.10/1.11?
Nick