Is there an easy way to filter content after fetching but before parsing?

I'm crawling a site where the information pages includes a form on the
side, and the option values of the form (which also get sucked into
the parse.getText() value that I index as "content") is interfering
with searches on the index. I plan to filter the content and remove
the form html block before parsing (as per above question). Does
anyone have another method around this?

Thanks
CW


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to