There are  many Chinese Html pages use UTF-8, so your method might cause the
summary of theses pages to be garbage in your search result, which is very
ugly...

The encodings of  Html pages are deteced by HtmlParser.  Firstly,HtmlParser
will try to find charset meta information in the page head,if this
information doesn't exist,HtmlParser will use default encoding,and default
encoding can be set in Nutch-site.xml.I suggest you don't use default
encoding, just discard the pages whose encoding can't be determined.

You can also to use "jchardet"  to detect encoding of html pages. If charset
encoding can't be  determined by both charset meta data and jchardet, just
discard it.


On Nov 8, 2007 4:09 PM, crossafire <[EMAIL PROTECTED]> wrote:

>
> I just crawl some chinese website where Used GB2312 for Web Meta Charset,
> the crawl and search it's OK. But when I want to try the Web Cached It's
> encoding it's error.
> So I see The cached.jsp in my tomcat . I know try to edit the cached.jsp
>
> if (encoding != null) {
>      try {
>        content = new String(bean.getContent(details), encoding);
>      }
>      catch (UnsupportedEncodingException e) {
>        // fallback to windows-1252
>        content = new String(bean.getContent(details), "windows-1252");
>      }
>    }
>    else
>      content = new String(bean.getContent(details), "gb2312");
>  }
>
> that the display Cached web it's Ok, But that just can do for web which
> used
> GB2312
> So it's not a good idear for me.
> I want get the Cached web encoding
> So I try to debug the Cached.jsp like this
> String encoding = (String) metaData.get("CharEncodingForConversion");
> System.out.print(encoding);
> It's debug the encoding is NULL;
>
> Metadata metaData = bean.getParseData(details).getContentMeta();
> String contentType = (String) metaData.get(Metadata.CONTENT_TYPE);
> System.out.print(contenType);
>
> It's just debug the contenType is text/html
>
> I hope somebody can know how to get The Cachec Web encoding
>
> Thanks
>
>
>
> --
> View this message in context:
> http://www.nabble.com/How-can-I-know-the-Cached-Web-Charset-tf4769632.html#a13642889
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Reply via email to