Hi all,

After looked several materials, I find that Nutch indexing based on the
parsed text - so If I don't want something to be indexed, I most likely need
to remove the thing I don't want to indexed before parsing to pure text...

Also, where is the cached page html file located? Is it the pre-parsed html
or another html file stored in somewhere?

Thank you for any answer or discussion
-- 
View this message in context: 
http://www.nabble.com/Where-is-the-crawled-cached-page-html--tp16048280p16048280.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to