Hi byte
you can use EncodingDetector util to detect character encodings. and then
use tagsoup or Neko to parse the html. you can check the source code of
parse-html plugin. some code like this:
=====================
byte[] contentInOctets = content.getContent();
InputSource input = new InputSource(new
ByteArrayInputStream(contentInOctets));
EncodingDetector detector = new EncodingDetector(conf);
detector.autoDetectClues(content, true);
detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
String encoding = detector.guessEncoding(content,
defaultCharEncoding);
metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);
input.setEncoding(encoding);
if (LOG.isTraceEnabled()) { LOG.trace("Parsing..."); }
root = parse(input);
....
--
Don't Grow Old, Grow Up... :-)