Not reading mime-type correctly
-------------------------------

                 Key: NUTCH-1041
                 URL: https://issues.apache.org/jira/browse/NUTCH-1041
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.4
            Reporter: Markus Jelsma
             Fix For: 1.4, 2.0


Another issue with mime-types and test url's. Below are two logs lines from 
MimeUtil. Mime-type is still ok at the start of the autoResolveContentType 
method:

{code}
Jul 11, 2011 6:46:15 PM org.apache.nutch.util.MimeUtil autoResolveContentType
INFO: Type: text/html; charset=ISO-8859-1 from: 
http://www.taxipoll.nl/taxipol.htm
Jul 11, 2011 6:46:15 PM org.apache.nutch.util.MimeUtil autoResolveContentType
INFO: Type: text/html from: http://archief.hoofdklassehockey.nl/hschema2009.html
{code}

mIME-TYpe correctness has been confirmed with Curl. The documents, however, do 
not end up in the index with the correct mime-type, here's output from 
IndexingFiltersChecker. ParserChecker does output the correct Content-Type.

{code}
http://www.taxipoll.nl/taxipol.htm   -->  taxipoll/htm
http://archief.hoofdklassehockey.nl/hschema2009.html  --> tet/html
{code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to