largest text block from parse tree?

Brian Whitman Thu, 17 Jan 2008 10:47:34 -0800

We do a lot of post-processing of text output by nutch to get"aboutness," do machine learning & NLP on, etc.

One problem we always have is that the nutch full text output is fromall parts of the page. For example a long essay or a blog post: you'llget the text of the post but also all the ads, navigation text,sidebar material, etc.

Has anyone dealt with this problem? Is there some heuristic I canapply somewhere in nutch's parser to either denote or filter by thelargest html block of text before it outputs the "content" lucene field?

largest text block from parse tree?

Reply via email to