I think, files that moved from text/plain to pdf should also be checked by
hand since we have quite new low-priority magic for pdfs ("%PDF-1." and
"%PDF-2." in first 0.5kB of stream).-- Best regards, Konstantin Gribov пт, 5 июня 2015 г. в 13:16, Nick Burch <[email protected]>: > On Fri, 5 Jun 2015, Allison, Timothy B. wrote: > > Changes in mime detection for "main" files: > > > > text/plain; charset=ISO-8859-1->application/x-bibtex-text-file > > text/plain; charset=windows-1252->application/x-bibtex-text-file > > text/html; charset=ISO-8859-1->application/x-bibtex-text-file > > I think these are expected and good > > > text/dif+xml->application/dif+xml > > Expected and fine > > > text/plain; charset=windows-1252->application/pdf > > text/plain; charset=windows-1255->application/pdf > > These are (hopefully!) PDFs with junk on the front, so good > > > text/html; charset=ISO-8859-1->text/plain; charset=ISO-8859-1 > > Not sure if this is correct or not, maybe double check these by hand? > > > > It looks like the change of the magic range for pdfs was a good move > > (for govdocs1, at least). However, we’re now losing content from those > > files that are now identified as bibtex. > > The *tex formats have an application/ mimetype but no parser, so now we > correctly detect them they stopped going through the text parser as a > fallback. I've hopefully fixed that in r1683702, by marking their > mimetypes as descending from text, so the text parser can claim them if > nothing else can > > > > For govdocs1, we’re now at 6,653 “caught” exceptions for container > > documents (out of 979,143=0.7%), but we have roughly 33k exceptions for > > embedded documents out of 1,364,552=2.4%). As before, I need to confirm > > that something didn’t go wrong with my code; it could also be the case > > that the files are being mis-id’d as Excel… For now, though, it looks > > like that high # is driven by embedded Excel files. > > Maybe best to raise one new jira issue per main area, and upload a single > sample file from govdocs that shows the problem, and we can tackle them in > turn in 1.10/1.11? > > Nick
