Re: Converting HTML text in org.apache.nutch.protocol.Content to String

feng lu Wed, 14 Aug 2013 08:06:02 -0700

Hi byte

you can use EncodingDetector util to detect character encodings. and then
use tagsoup or Neko to parse the html. you can check the source code of
parse-html plugin. some code like this:


=====================

 byte[] contentInOctets = content.getContent();
      InputSource input = new InputSource(new
ByteArrayInputStream(contentInOctets));

      EncodingDetector detector = new EncodingDetector(conf);
      detector.autoDetectClues(content, true);
      detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
      String encoding = detector.guessEncoding(content,
defaultCharEncoding);

      metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
      metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);

      input.setEncoding(encoding);
      if (LOG.isTraceEnabled()) { LOG.trace("Parsing..."); }
      root = parse(input);
....

-- 
Don't Grow Old, Grow Up... :-)

Re: Converting HTML text in org.apache.nutch.protocol.Content to String

Reply via email to