Thanks, Jake. This does not work. I guess I did not describe my problem clearly. I'll try again.

My startup url is:   http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/

And here are some of the entries in the crawl log:

060217 105511 fetching http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/08/
060217 105511 fetching http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/10/
060217 105511 fetching http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/12/
...
060217 105519 fetching http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/08/TWN-2005-08-06.html 060217 105519 fetching http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/10/TWN-2005-03-27.html 060217 105519 fetching http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/12/TWN-2005-12-03.html

If my search query is "nerds", it will bring up those three path names as individual results as well. For example:

*Index of /meta-data/0/4/The Word Nerds/2005/08 <http://pod-master-001.bbn.com/meta-data/pods/pod9/0/4/The%20Word%20Nerds/2005/08/>* * ... *meta-data/0/4/The Word *Nerds*/2005/08 Index of /meta-data/0/4/* ... *

So my question is how I can filter out those path names in the result list. I think there should be an option some where in the configuration file to allow NOT to index certain files based on the url pattern. I know we have similar options in crawl-urlfilter.txt. But in my case these directories do need to be crawled. However, the directory name should not be indexed as a single document. It's more like we'd have a file called index-urlfilter.txt.

Thanks,
--Jay

Vanderdray, Jacob wrote:

Jay,

        The url field is handled by the query-basic filter.  There is a
setting inside conf/nutch-default.xml that controls the weighting
(boost) for that field.  You can reduce the influence of this field by
putting a new value in your conf/nutch-site.xml file.  You may even be
able to completely nullify it by setting the value to 0.0.  I've pasted
what I think you'd need to put in nutch-site.xml bellow.  I haven't
tested this.  Let me know how it goes if you give it a try.

Thanks,
Jake.

<property>
 <name>query.url.boost</name>
 <value>0.0</value>
 <description> Used as a boost for url field in Lucene query.
 </description>
</property>

-----Original Message-----
From: jay jiang [mailto:[EMAIL PROTECTED] Sent: Thursday, February 16, 2006 2:08 PM
To: [email protected]
Subject: not indexing path names

I am crawling an intranet. Apparently Nutch also indexes the url path names (as a document) as it crawls. So if a query word appears in the path name, the entire url path name would be one result. Since this kind of info would typically be of no value to users, I want to filter them out. I think we have to crawl them since we need to get the actual document urls underneath the path. But we do not want to index them. Is there anyway to configure not to index path names during the crawling step? If not, can we configure it in the search step? I know we can always filter it using getDetails(). But this seems not a very clean way.

Thanks,
--Jay



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to