[Nutch-general] not indexing path names

jay jiang Thu, 16 Feb 2006 11:09:00 -0800

I am crawling an intranet. Apparently Nutch also indexes the url pathnames (as a document) as it crawls. So if a query word appears in thepath name, the entire url path name would be one result. Since thiskind of info would typically be of no value to users, I want to filterthem out.I think we have to crawl them since we need to get the actual documenturls underneath the path. But we do not want to index them. Is thereanyway to configure not to index path names during the crawling step?If not, can we configure it in the search step? I know we can alwaysfilter it using getDetails(). But this seems not a very clean way.


Thanks,
--Jay


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] not indexing path names

Reply via email to