[jira] [Commented] (NUTCH-1041) Not reading mime-type correctly

Markus Jelsma (JIRA) Mon, 11 Jul 2011 10:13:26 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063438#comment-13063438
 ]


Markus Jelsma commented on NUTCH-1041:
--------------------------------------

More strange behaviour, Nutch trunk ParserChecker outputs wrong Content-Type :  
tet/html; charset=iso-8859-1 while Nutch 1.4 ParserChecker is doing fine.


> Not reading mime-type correctly
> -------------------------------
>
>                 Key: NUTCH-1041
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1041
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>
> Another issue with mime-types and test url's. Below are two logs lines from 
> MimeUtil. Mime-type is still ok at the start of the autoResolveContentType 
> method:
> {code}
> Jul 11, 2011 6:46:15 PM org.apache.nutch.util.MimeUtil autoResolveContentType
> INFO: Type: text/html; charset=ISO-8859-1 from: 
> http://www.taxipoll.nl/taxipol.htm
> Jul 11, 2011 6:46:15 PM org.apache.nutch.util.MimeUtil autoResolveContentType
> INFO: Type: text/html from: 
> http://archief.hoofdklassehockey.nl/hschema2009.html
> {code}
> mIME-TYpe correctness has been confirmed with Curl. The documents, however, 
> do not end up in the index with the correct mime-type, here's output from 
> IndexingFiltersChecker. ParserChecker does output the correct Content-Type.
> {code}
> http://www.taxipoll.nl/taxipol.htm   -->  taxipoll/htm
> http://archief.hoofdklassehockey.nl/hschema2009.html  --> tet/html
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1041) Not reading mime-type correctly

Reply via email to