Create your own parser. Create a new plaguin parse-chunwei :) for example (just copy the whole parse-html plugin amd take it from there)
subclass HtmlParser and there you have it. G. P.S. Do not forget to replace the parse-html in plugin.include in the nutch-site.xpl entry. On Tue, 2006-01-24 at 10:16 +0800, Chun Wei Ho wrote: > Is there an easy way to filter content after fetching but before parsing? > > I'm crawling a site where the information pages includes a form on the > side, and the option values of the form (which also get sucked into > the parse.getText() value that I index as "content") is interfering > with searches on the index. I plan to filter the content and remove > the form html block before parsing (as per above question). Does > anyone have another method around this? > > Thanks > CW >
