I think, files that moved from text/plain to pdf should also be checked by
hand since we have quite new low-priority magic for pdfs ("%PDF-1." and
"%PDF-2." in first 0.5kB of stream).

-- 
Best regards,
Konstantin Gribov

пт, 5 июня 2015 г. в 13:16, Nick Burch <[email protected]>:

> On Fri, 5 Jun 2015, Allison, Timothy B. wrote:
> > Changes in mime detection for "main" files:
> >
> > text/plain; charset=ISO-8859-1->application/x-bibtex-text-file
> > text/plain; charset=windows-1252->application/x-bibtex-text-file
> > text/html; charset=ISO-8859-1->application/x-bibtex-text-file
>
> I think these are expected and good
>
> > text/dif+xml->application/dif+xml
>
> Expected and fine
>
> > text/plain; charset=windows-1252->application/pdf
> > text/plain; charset=windows-1255->application/pdf
>
> These are (hopefully!) PDFs with junk on the front, so good
>
> > text/html; charset=ISO-8859-1->text/plain; charset=ISO-8859-1
>
> Not sure if this is correct or not, maybe double check these by hand?
>
>
> > It looks like the change of the magic range for pdfs was a good move
> > (for govdocs1, at least).  However, we’re now losing content from those
> > files that are now identified as bibtex.
>
> The *tex formats have an application/ mimetype but no parser, so now we
> correctly detect them they stopped going through the text parser as a
> fallback. I've hopefully fixed that in r1683702, by marking their
> mimetypes as descending from text, so the text parser can claim them if
> nothing else can
>
>
> > For govdocs1, we’re now at 6,653 “caught” exceptions for container
> > documents (out of 979,143=0.7%), but we have roughly 33k exceptions for
> > embedded documents out of 1,364,552=2.4%).  As before, I need to confirm
> > that something didn’t go wrong with my code; it could also be the case
> > that the files are being mis-id’d as Excel… For now, though, it looks
> > like that high # is driven by embedded Excel files.
>
> Maybe best to raise one new jira issue per main area, and upload a single
> sample file from govdocs that shows the problem, and we can tackle them in
> turn in 1.10/1.11?
>
> Nick

Reply via email to