crossafire wrote:
>
> I just crawl some chinese website where Used GB2312 for Web Meta Charset,
> the crawl and search it's OK. But when I want to try the Web Cached It's
> encoding it's error.
> So I see The cached.jsp in my tomcat . I know try to edit the cached.jsp
>
> if (encoding != null) {
> try {
> content = new String(bean.getContent(details), encoding);
> }
> catch (UnsupportedEncodingException e) {
> // fallback to windows-1252
> content = new String(bean.getContent(details), "windows-1252");
> }
> }
> else
> content = new String(bean.getContent(details), "gb2312");
> }
>
> that the display Cached web it's Ok, But that just can do for web which
> used GB2312
> So it's not a good idear for me.
> I want get the Cached web encoding
> So I try to debug the Cached.jsp like this
> String encoding = (String) metaData.get("CharEncodingForConversion");
> System.out.print(encoding);
> It's debug the encoding is NULL;
>
> Metadata metaData = bean.getParseData(details).getContentMeta();
> String contentType = (String) metaData.get(Metadata.CONTENT_TYPE);
> System.out.print(contenType);
>
> It's just debug the contenType is text/html
>
> I hope somebody can know how to get The Cachec Web encoding
>
> Thanks
>
>
>
>
Thank you
But I must to know the Html charset becasue many chinese web site used
gb2312 for html page
I think I just try the jchardet , Thank you very much
--
View this message in context:
http://www.nabble.com/How-can-I-know-the-Cached-Web-Charset-tf4769632.html#a13660093
Sent from the Nutch - User mailing list archive at Nabble.com.