RE: char encoding

Fadzi Ushewokunze Fri, 30 Oct 2009 00:27:11 -0700

interesting - i will try this and let you know because it was set to
windows encoding (why on earth!?)




On Thu, 2009-10-29 at 20:35 -0400, Fuad Efendi wrote:
> > > i dont have any special requirement for any special characters, i am
> > > happy with usual utf-8
> > >
> > > any suggestion on the best way to configure this correctly; everything
> > > seems quite ok looking at the code not sure whats missing.
> 
> 
> Try to set UTF-8 in configuration file:
> parser.character.encoding.default = UTF-8
> 
> 
> 
> > -----Original Message-----
> > From: Fuad Efendi [mailto:f...@efendi.ca]
> > Sent: October-29-09 8:19 PM
> > To: nutch-user@lucene.apache.org; fa...@butterflycluster.net
> > Subject: RE: char encoding
> > 
> > Is it "?" or "¿" (Inverted Question Mark)?
> > 
> > Because ¿ is replacement for character codes not having representation in
> > specific encoding scheme; you may get it, for instance, if binary stream
> is
> > UTF-8 encoded, and Nutch considers it as Windows-1252. All byte(s) not
> > having representation in windows-1252 will be represented as "¿".
> > 
> > Nutch tries on the best effort; however, it can't use dedicated CPU as
> > browsers.... I agree with Ken. Browsers may fully ignore headers/meta and
> > sniff and analyze byte array to find correct encoding (in case, for
> > instance, if byte stream is UTF-8, and http/meta is windows-1252). Nutch
> > can't do that (it requires a lot of CPU).
> > 
> > Windows-1252 -s default scheme for html-parser in case if Nutch can't find
> > correct HTTP/META...
> > 
> > 
> > From HtmlParser API:
> >    * We need to do something similar to what's done by mozilla
> >    *
> >
> (http://lxr.mozilla.org/seamonkey/source/parser/htmlparser/src/nsParser.cpp#
> > 1993).
> >    * See also http://www.w3.org/TR/REC-xml/#sec-guessing
> > 
> > 
> > private static String sniffCharacterEncoding(byte[] content) {...}
> > 
> > - it doesn't currently use HTTP Headers.
> > - it tries to find META tag in first 2000 bytes.
> > 
> > 
> > So, for instance, some weird sites (such as AJAX/Portals) may have a lot
> of
> > generated JavaScript before META tag; 2000 could be small.
> > 
> > Then, EncodingDetector is called:
> >       detector.addClue(sniffCharacterEncoding(contentInOctets),
> "sniffed");
> > 
> > - but it doen't make sense...
> > 
> > 
> >   public String guessEncoding(Content content, String defaultValue) {
> >     /*
> >      * This algorithm could be replaced by something more sophisticated;
> >      * ideally we would gather a bunch of data on where various clues
> >      * (autodetect, HTTP headers, HTML meta tags, etc.) disagree, tag each
> > with
> >      * the correct answer, and use machine learning/some statistical
> method
> >      * to generate a better heuristic.
> >      */
> > 
> > 
> > 
> > TODO list... as a workaround, please check for this site that META could
> be
> > found in first 2000 bytes...
> > 
> > 
> > 
> > -Fuad
> > http://www.linkedin.com/in/liferay
> > 
> > 
> > > -----Original Message-----
> > > From: Fadzi Ushewokunze [mailto:fa...@butterflycluster.net]
> > > Sent: October-29-09 7:05 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: char encoding
> > >
> > > hi there,
> > >
> > > i am having issues with the HTMLParser failing to detect the char
> > > encoding. so lots of non alpha-numeric chars end up as "?" ;
> > >
> > > i dont have any special requirement for any special characters, i am
> > > happy with usual utf-8
> > >
> > > any suggestion on the best way to configure this correctly; everything
> > > seems quite ok looking at the code not sure whats missing.
> > >
> > > thanks.
> > >
> > >
> > 
> > 
> 
> 
>

RE: char encoding

Reply via email to