[ https://issues.apache.org/jira/browse/TIKA-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074872#comment-16074872 ]
Hudson commented on TIKA-2419: ------------------------------ FAILURE: Integrated in Jenkins build Tika-trunk #1308 (See [https://builds.apache.org/job/Tika-trunk/1308/]) TIKA-2419 Do all 4 html doctype varients for the same text range (nick: [https://github.com/apache/tika/commit/d98bec077bbeabe095d9200f6b729b465e51368c]) * (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml TIKA-2419 If we detect XML but the XML is broken, try the HTML magics (nick: [https://github.com/apache/tika/commit/383015235d4fc855c16d8d65c0c3cae96488951d]) * (edit) tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java > Try HTML mime magic on broken XML files > --------------------------------------- > > Key: TIKA-2419 > URL: https://issues.apache.org/jira/browse/TIKA-2419 > Project: Tika > Issue Type: Bug > Components: mime > Affects Versions: 1.15 > Reporter: Nick Burch > > As noticed from the latest common crawl work, some url-hosted HTML files are > being detected as text/plain then specialised out to their programming > language url extension > This is caused broken XML in the HTML, and by us having dropped the magic > priority of HTML to 40 (below that of XML), to avoid it matching for > HTML-containing other types like emails. Because these files have broken XML > (eg an empty encoding on the xml tag), the XML root extractor doesn't run, > and they get downmixed to text plain then specialised by filename -- This message was sent by Atlassian JIRA (v6.4.14#64029)