Hi all, After looked several materials, I find that Nutch indexing based on the parsed text - so If I don't want something to be indexed, I most likely need to remove the thing I don't want to indexed before parsing to pure text...
Also, where is the cached page html file located? Is it the pre-parsed html or another html file stored in somewhere? Thank you for any answer or discussion -- View this message in context: http://www.nabble.com/Where-is-the-crawled-cached-page-html--tp16048280p16048280.html Sent from the Nutch - User mailing list archive at Nabble.com.
