[ 
https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542154
 ] 

david euler commented on NUTCH-540:
-----------------------------------

hi, Renaud Richardet, when nutch get null encoding from meta data:

String encoding = (String) metaData.get("CharEncodingForConversion"); 

it would construct content String from bytes using platform default charset, 
when server's default charset is different from the cached page's charset, 
error encoded chars would be displayed. in fact, most of the cases, we can find 
the correct charset of a web page by it's meta data:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

but i don't know why some pages fails to guess the encoding from meta data when 
the meta info does exist.

> some problem about the Nutch cache
> ----------------------------------
>
>                 Key: NUTCH-540
>                 URL: https://issues.apache.org/jira/browse/NUTCH-540
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
>            Reporter: crossany
>             Fix For: 0.9.0
>
>         Attachments: 1.gif, 1186733525.jpg
>
>
> I'am a chinese.
> I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on 
> linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website 
> it a chinese website the web charset it's also UTF-8. when Use the nutch on 
> tomcat for search chinese word , I find the search result' Title and 
> description was right to display. but when I click the cache, the cache web 
> was display a error charset code, I see the cache
> web' charset also utf-8. I find a website use Nutch 
> http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also 
> error.
> I use Luke to see the segments It's can display chinese word, I think maybe 
> it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to