Hi , I am also a newbie with nutch. Actually our requirement is to do opinion crawling. i.e we are looking to crawl only certail html pages which contain user opinions about products, items or movies, mail the page should contains opinions. And our seed list contains some review web sites like amazon, mouthsut and some other sites. so during fetching of html pages, how should i verify weather the html page is a opinion or not... Nutch experts(Nutch reasearches) put your comments and suggestions. Hope the requirement is clear..
Thanks and regards, Naresh -- View this message in context: http://n3.nabble.com/how-to-parse-html-files-while-crawling-tp706816p712846.html Sent from the Nutch - User mailing list archive at Nabble.com.