We do a lot of post-processing of text output by nutch to get
"aboutness," do machine learning & NLP on, etc.
One problem we always have is that the nutch full text output is from
all parts of the page. For example a long essay or a blog post: you'll
get the text of the post but also all the ads, navigation text,
sidebar material, etc.
Has anyone dealt with this problem? Is there some heuristic I can
apply somewhere in nutch's parser to either denote or filter by the
largest html block of text before it outputs the "content" lucene field?
- largest text block from parse tree? Brian Whitman
-