[
https://issues.apache.org/jira/browse/TIKA-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063617#comment-14063617
]
Matthias Krueger commented on TIKA-1365:
----------------------------------------
{code}
System.out.println(new DefaultDetector().detect(new BufferedInputStream(new
FileInputStream("discussion.html")), new Metadata()));
{code}
with the attached file returns
{code}
application/xml
{code}
which is expected considering that the file starts with a comment and the
current magic detection for application/xml includes
{code}
<match value="<!--" type="string" offset="0"/>
{code}
with priority 50 while the next best matching magic configured for text/html is
{code}
<match value="<head" type="string" offset="0:64"/>
{code}
with priority 40.
For web request based content type detection it might be helpful to first check
java.net.URLConnection#getContentType() (if available) and supply that as
Metadata.CONTENT_TYPE.
> Incorrectly MimeType detection for Apache Lucene web site
> ---------------------------------------------------------
>
> Key: TIKA-1365
> URL: https://issues.apache.org/jira/browse/TIKA-1365
> Project: Tika
> Issue Type: Bug
> Components: detector
> Affects Versions: 1.5
> Reporter: Tien Nguyen Manh
> Attachments: discussion.html
>
>
> Tika 1.5 detect many page from apache lucene web site as xml, for example
> this page
> http://lucene.apache.org/core/discussion.html
> Here are error log:, it failed to parse becuase it use xml parser
> Apache Tika was unable to parse the document
> at http://lucene.apache.org/core/discussion.html.
> The full exception stack trace is included below:
> org.apache.tika.exception.TikaException: XML parse error
> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
> at org.apache.tika.gui.TikaGUI.openURL(TikaGUI.java:293)
> at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:247)
> at
> javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2018)
--
This message was sent by Atlassian JIRA
(v6.2#6252)