[ 
https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542153
 ] 

david euler commented on NUTCH-540:
-----------------------------------

hello, crossany, i met the problem too. and finally fixed it by replace the 
following line in cached.jsp :

content = new String(bean.getContent(details));

with:
content = new String(bean.getContent(details), "UTF-8");

the error is caused by new String(byte[]), when we construct a String from byte 
array without specifying any charset, it would read your platform's default 
charset. On Windows XP (Chinese Edition), it is GBK by default. 

hope it helps, see reference of JDK :
java.lang.String.String(byte[] bytes)

Constructs a new String by decoding the specified array of bytes using the 
 platform's default charset. The length of the new String is a function of the 
 charset, and hence may not be equal to the length of the byte array. 
The behavior of this constructor when the given bytes are not valid in the 
 default charset is unspecified. The java.nio.charset.CharsetDecoder class 
 should be used when more control over the decoding process is required. 
Parameters:
        bytes the bytes to be decoded into characters
Since:
        JDK1.1

> some problem about the Nutch cache
> ----------------------------------
>
>                 Key: NUTCH-540
>                 URL: https://issues.apache.org/jira/browse/NUTCH-540
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 0.9.0
>         Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
>            Reporter: crossany
>             Fix For: 0.9.0
>
>         Attachments: 1.gif, 1186733525.jpg
>
>
> I'am a chinese.
> I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on 
> linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website 
> it a chinese website the web charset it's also UTF-8. when Use the nutch on 
> tomcat for search chinese word , I find the search result' Title and 
> description was right to display. but when I click the cache, the cache web 
> was display a error charset code, I see the cache
> web' charset also utf-8. I find a website use Nutch 
> http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also 
> error.
> I use Luke to see the segments It's can display chinese word, I think maybe 
> it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to