Create your own parser.

Create a new plaguin parse-chunwei :) for example (just copy the whole
parse-html plugin amd take it from there)

subclass HtmlParser and there you have it.

G.

P.S. Do not forget to replace the parse-html in plugin.include in the
nutch-site.xpl entry.


On Tue, 2006-01-24 at 10:16 +0800, Chun Wei Ho wrote:
> Is there an easy way to filter content after fetching but before parsing?
> 
> I'm crawling a site where the information pages includes a form on the
> side, and the option values of the form (which also get sucked into
> the parse.getText() value that I index as "content") is interfering
> with searches on the index. I plan to filter the content and remove
> the form html block before parsing (as per above question). Does
> anyone have another method around this?
> 
> Thanks
> CW
> 


Reply via email to