Not reading mime-type correctly
-------------------------------
Key: NUTCH-1041
URL: https://issues.apache.org/jira/browse/NUTCH-1041
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.4
Reporter: Markus Jelsma
Fix For: 1.4, 2.0
Another issue with mime-types and test url's. Below are two logs lines from
MimeUtil. Mime-type is still ok at the start of the autoResolveContentType
method:
{code}
Jul 11, 2011 6:46:15 PM org.apache.nutch.util.MimeUtil autoResolveContentType
INFO: Type: text/html; charset=ISO-8859-1 from:
http://www.taxipoll.nl/taxipol.htm
Jul 11, 2011 6:46:15 PM org.apache.nutch.util.MimeUtil autoResolveContentType
INFO: Type: text/html from: http://archief.hoofdklassehockey.nl/hschema2009.html
{code}
mIME-TYpe correctness has been confirmed with Curl. The documents, however, do
not end up in the index with the correct mime-type, here's output from
IndexingFiltersChecker. ParserChecker does output the correct Content-Type.
{code}
http://www.taxipoll.nl/taxipol.htm --> taxipoll/htm
http://archief.hoofdklassehockey.nl/hschema2009.html --> tet/html
{code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira