[ 
https://issues.apache.org/jira/browse/TIKA-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063699#comment-14063699
 ] 

Matthias Krueger commented on TIKA-1365:
----------------------------------------

Some more observations:
* The lower priority of the text/html magics vs. the application/xml magics was 
introduced in 
https://github.com/apache/tika/commit/88b52d66f0f70ca1edc89dc5387c6f74306d4c99 
as part of TIKA-560. Goal was to be able to distinguish between text/html and 
application/x-foxmail. This was achieved by decreasing the priority of the 
text/html magics below the default of 50.
* I guess it would have been better to increase the priority of the rather 
unique application/x-foxmail magic to something above 50.

I would suggest
* Give the application/x-foxmail magic priority 60.
* Set the text/html magic priority back to 50.
* Split the application/xml magics into two and set the priority of the "file 
starts with comment" magic to something lower than 50 (not that strong 
indicator of XML).

I can provide a patch if desired.


> Incorrectly MimeType detection for Apache Lucene web site
> ---------------------------------------------------------
>
>                 Key: TIKA-1365
>                 URL: https://issues.apache.org/jira/browse/TIKA-1365
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.5
>            Reporter: Tien Nguyen Manh
>         Attachments: discussion.html
>
>
> Tika 1.5 detect many page from apache lucene web site as xml, for example 
> this page 
> http://lucene.apache.org/core/discussion.html
> Here are error log:, it failed to parse becuase it use xml parser
> Apache Tika was unable to parse the document
> at http://lucene.apache.org/core/discussion.html.
> The full exception stack trace is included below:
> org.apache.tika.exception.TikaException: XML parse error
>       at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>       at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
>       at org.apache.tika.gui.TikaGUI.openURL(TikaGUI.java:293)
>       at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:247)
>       at 
> javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2018)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to