Re: Text extraction from HTML

Giovanni Novelli Fri, 29 Jul 2005 01:22:41 -0700

I have tried both HtmlParser v1.5 and NekoHTML. About the former my
implementation doesn't work as i.e. it get text from javascripts; I
have followed the hint from
http://htmlparser.sourceforge.net/javadoc/org/htmlparser/visitors/TextExtractingVisitor.html


The following is my NOT working implementation relying upon HtmlParser v1.5:

import org.htmlparser.visitors.TextExtractingVisitor;
import org.htmlparser.*;
import org.htmlparser.util.*;

public class HtmlFilter {
        public static String getText(String html) {
                Parser parser = Parser.createParser(html, "UTF-8");
                TextExtractingVisitor visitor = new TextExtractingVisitor();
                try {
                        parser.visitAllNodesWith(visitor);
                } catch (ParserException e) {
                        e.printStackTrace();
                }
                String textInPage = visitor.getExtractedText();
                return textInPage;
        }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Text extraction from HTML

Reply via email to