[ 
https://issues.apache.org/jira/browse/TIKA-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074872#comment-16074872
 ] 

Hudson commented on TIKA-2419:
------------------------------

FAILURE: Integrated in Jenkins build Tika-trunk #1308 (See 
[https://builds.apache.org/job/Tika-trunk/1308/])
TIKA-2419 Do all 4 html doctype varients for the same text range (nick: 
[https://github.com/apache/tika/commit/d98bec077bbeabe095d9200f6b729b465e51368c])
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
TIKA-2419 If we detect XML but the XML is broken, try the HTML magics (nick: 
[https://github.com/apache/tika/commit/383015235d4fc855c16d8d65c0c3cae96488951d])
* (edit) tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java


> Try HTML mime magic on broken XML files
> ---------------------------------------
>
>                 Key: TIKA-2419
>                 URL: https://issues.apache.org/jira/browse/TIKA-2419
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.15
>            Reporter: Nick Burch
>
> As noticed from the latest common crawl work, some url-hosted HTML files are 
> being detected as text/plain then specialised out to their programming 
> language url extension
> This is caused broken XML in the HTML, and by us having dropped the magic 
> priority of HTML to 40 (below that of XML), to avoid it matching for 
> HTML-containing other types like emails. Because these files have broken XML 
> (eg an empty encoding on the xml tag), the XML root extractor doesn't run, 
> and they get downmixed to text plain then specialised by filename



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to