[Nutch-general] Re: not indexing path names

jay jiang Wed, 01 Mar 2006 13:01:11 -0800

Jake,

That's exactly the case. I have a workaround by using the "file://"protocol since all data is in our intranet.

Ideally, and it should not be hard to do, is to allow index-basic pluginto not just add more fields to a document, but also to nullify thedocument (i.e. not indexing it).


--Jay Jiang

Vanderdray, Jacob wrote:

Jay,

        Sorry, I didn't understand what you were trying to do.  I think
I get it now.  You've got directory listing turned on and you're using
that to list out the content of the site, but you don't want the
directory listings returned as search results.  Does that sound right?

        I don't know of any search filters that would do quite what
you're looking to do.  If you control the site, you might be able to
switch from using directory listings for your content to using actual
html pages.  At that point you could add robot meta tags on those pages
to follow, but not index them.

Jake.

-----Original Message-----
From: jay jiang [mailto:[EMAIL PROTECTED]Sent: Friday, February 17, 2006 11:31 AM
To: [email protected]
Subject: Re: not indexing path names
Thanks, Jake. This does not work. I guess I did not describe myproblem clearly. I'll try again.
My startup url is:   http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/

And here are some of the entries in the crawl log:

060217 105511 fetching
http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/08/
060217 105511 fetching
http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/10/
060217 105511 fetching
http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/12/
...
060217 105519 fetchinghttp://xxx/meta-data/0/4/The%20Word%20Nerds/2005/08/TWN-2005-08-06.html060217 105519 fetchinghttp://xxx/meta-data/0/4/The%20Word%20Nerds/2005/10/TWN-2005-03-27.html060217 105519 fetchinghttp://xxx/meta-data/0/4/The%20Word%20Nerds/2005/12/TWN-2005-12-03.html
If my search query is "nerds", it will bring up those three path namesas individual results as well. For example:
*Index of /meta-data/0/4/The Word Nerds/2005/08<http://pod-master-001.bbn.com/meta-data/pods/pod9/0/4/The%20Word%20Nerds/2005/08/>** ... *meta-data/0/4/The Word *Nerds*/2005/08 Index of /meta-data/0/4/*... *
So my question is how I can filter out those path names in the resultlist. I think there should be an option some where in the configuration
file to allow NOT to index certain files based on the url pattern. Iknow we have similar options in crawl-urlfilter.txt. But in my casethese directories do need to be crawled. However, the directory nameshould not be indexed as a single document. It's more like we'd have afile called index-urlfilter.txt.
Thanks,
--Jay
Vanderdray, Jacob wrote:
Jay,

        The url field is handled by the query-basic filter.  There is a
setting inside conf/nutch-default.xml that controls the weighting
(boost) for that field.  You can reduce the influence of this field by
putting a new value in your conf/nutch-site.xml file.  You may even be
able to completely nullify it by setting the value to 0.0.  I've pasted
what I think you'd need to put in nutch-site.xml bellow.  I haven't
tested this.  Let me know how it goes if you give it a try.

Thanks,
Jake.

<property>
<name>query.url.boost</name>
<value>0.0</value>
<description> Used as a boost for url field in Lucene query.
</description>
</property>

-----Original Message-----
From: jay jiang [mailto:[EMAIL PROTECTED]Sent: Thursday, February 16, 2006 2:08 PM
To: [email protected]
Subject: not indexing path names
I am crawling an intranet. Apparently Nutch also indexes the url pathnames (as a document) as it crawls. So if a query word appears in thepath name, the entire url path name would be one result. Since thiskind of info would typically be of no value to users, I want to filterthem out.I think we have to crawl them since we need to get the actual documenturls underneath the path. But we do not want to index them. Is thereanyway to configure not to index path names during the crawling step?If not, can we configure it in the search step? I know we can alwaysfilter it using getDetails(). But this seems not a very clean way.
Thanks,
--Jay




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: not indexing path names

Reply via email to