We do a lot of post-processing of text output by nutch to get "aboutness," do machine learning & NLP on, etc.

One problem we always have is that the nutch full text output is from all parts of the page. For example a long essay or a blog post: you'll get the text of the post but also all the ads, navigation text, sidebar material, etc.

Has anyone dealt with this problem? Is there some heuristic I can apply somewhere in nutch's parser to either denote or filter by the largest html block of text before it outputs the "content" lucene field?

Reply via email to