Is there an easy way to filter content after fetching but before parsing? I'm crawling a site where the information pages includes a form on the side, and the option values of the form (which also get sucked into the parse.getText() value that I index as "content") is interfering with searches on the index. I plan to filter the content and remove the form html block before parsing (as per above question). Does anyone have another method around this?
Thanks CW ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
