[
https://issues.apache.org/jira/browse/TIKA-4351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17898794#comment-17898794
]
Sebastian Nagel commented on TIKA-4351:
---------------------------------------
No, not working on anything. But happy to help writing regular expressions or
testing any PR or draft on a long list of Content-Type headers.
I'm also not confident to which extend a stricter validation is possible. I
mean, all MIME types in IANA registry follow a clear pattern. Then there are
`x-` types, but most of even these seem to a great extend follow the pattern
(except for the x- prefix) used in the registry. However, it's a broad field
and there are many non-standardized file formats.
> More restrictive MIME type validation
> -------------------------------------
>
> Key: TIKA-4351
> URL: https://issues.apache.org/jira/browse/TIKA-4351
> Project: Tika
> Issue Type: Improvement
> Components: core, mime
> Affects Versions: 3.0.0
> Reporter: Sebastian Nagel
> Priority: Major
>
> Background:
> - [~tallison] started a [discussion on the Common Crawl user
> group|https://groups.google.com/g/common-crawl/c/0FANtRcJOts/m/q5KtncIcBgAJ]
> about strange and obviously erroneous "identified" MIME types in Common Crawl
> data which were identified in Nutch using Tika's magic detector. See
> [o.a.nutch.util.MimeUtil#autoResolveContentType|https://github.com/apache/nutch/blob/e1b8dbe909b0f8c181dcb5ee0e7e072f27f82cbb/src/java/org/apache/nutch/util/MimeUtil.java#L153]
> for the source code.
> - the issue is tracked on Nutch's site in NUTCH-3089
> - however, implementing a complex MIME type validation seems out of Nutch's
> scope and is eventually better done and maintained by Tika
> While looking at more examples, digging deeper and trying to improve the
> detection code in Nutch, I came up with the following points regarding the
> validation of the MIME type in
> [MimeTypes#forName|https://tika.apache.org/3.0.0/api/org/apache/tika/mime/MimeTypes.html#forName(java.lang.String)].
> The method is used both from Nutch and Tika (in
> [MimeTypes#detect(...)|https://tika.apache.org/3.0.0/api/org/apache/tika/mime/MimeTypes.html#detect(java.io.InputStream,org.apache.tika.metadata.Metadata)]):
> - "forName" accepts non-ASCII Unicode characters as part of the MIME type
> ({{foo/bär}}) - not covered by [RFC
> 2045|https://datatracker.ietf.org/doc/html/rfc2045#section-5.1] which allows
> only US_ASCII characters. Of course, one might argue, that already the HTTP
> header parser should filter such headers away, but ...
> - the grammar in RFC 2045 is lazy interpreted, that is a type or subtype may
> include the allowed characters in any order
> -- (sub)types not registered at IANA are accepted even if they do not start
> with "x-" / "X-" / "x."
> -- [RFC 6838|https://datatracker.ietf.org/doc/html/rfc6838#section-4.2] is
> more restrictive, e.g.,
> --- (sub)types are required to start with a letter or number
> --- fewer non-letter/number characters are allowed
> - Nutch passes the Content-Type HTTP header value and the URL as metadata
> hints to MimeTypes.detect(inputstream, metadata). This helped to improve the
> detection especially for types which are subclasses of application/zip. At
> least, in the past, this was necessary to handle various Office document
> types.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)