Re: Nutch + Solr: filtering URL while indexing

Julien Nioche Mon, 08 Feb 2010 05:25:16 -0800

Hi,

You'd need to filter the URLs from the segments as well before you
index. Removing the entries from the linkDB will just prevent them
from getting anchor fields - they'll still be added to the index.
Look at the class IndexerMapReduce for more details.


An option would be to add support for URLfilters in the map method to
be able to determine which URLs to remove from the indexing
altogether. This is pretty trivial to implement and could be a nice
contribution. Feel free to add submit it to JIRA if you implement it

HTH

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

On 8 February 2010 12:26, Stefano Cherchi <stefanocher...@yahoo.it> wrote:
> Is there nobody out there who can provide some kind of hint?
>
> I'm really stuck with this problem and I cannot figure out what else I can do.
>
> Thanks
>
> S
>
>
>
>
>
>
>
> ----- Messaggio originale -----
>> Da: Stefano Cherchi <stefanocher...@yahoo.it>
>> A: nutch-u...@lucene.apache..org
>> Inviato: Gio 4 febbraio 2010, 17:00:35
>> Oggetto: Nutch + Solr: filtering URL while indexing
>>
>> Hi everybody. I've been struggling for three days now with a quite trivial
>> problem, without solution.
>>
>> I need to index a few web sites with the following structure:
>>
>> Page type 1: List of posts (http://www.website.com/list.html?page=XXx) where 
>> XXx
>> is a progressive number from 00 to 999. Each page has links to the following 
>> and
>> the previous list page.
>> Page type 2: the actual post page (http://www.website.com/post--x_y_z.html)
>> where xyz is an arbitrary string of letters and numbers representing the post
>> title..
>> Page type 3: other contents like statical pages, external links, and other
>> unwanted and useless stuff.
>>
>> I need to crawl pages of both type 1 and 2 but I want to index only type 2.
>> Crawling pages of type 1 is the only way to reach type 2 because pages of 
>> type 2
>> have unpredictable URLs. So I'm performing a step-by-step indexing this way:
>>
>> I set the following regular expressions in regex-urlfilter.txt
>> +^http://www.website.com/list.html[?]page[=][0-9]{2,3}$
>> +^http://www.website.com/post--
>> -.
>>
>> inject (http://www.website.com/list.html?page=00)
>>
>> then I cycle N times
>> generate
>> fetch
>> parse
>> updatedb
>>
>> and I can see that only type 1 and type 2 pages are actually crawled and
>> fetched. Great.
>>
>> Then I edit the regex-urlfilter.txt leaving only
>> +^http://www.website.com/post--
>> -.
>>
>> and perform
>> invertlinks (with filtering on)
>> solrindex
>>
>> Now I would expect that all type 1 pages are stripped away from the linkdb 
>> and
>> only type 2 pages are added to Solr index, but when I browse the indexed
>> documents I still found both 1 and 2 page types.
>>
>> Can someone please explain why?
>>
>> Thank you.
>>
>> S
>>
>> ----------------------------------
>> "Anyone proposing to run Windows on servers should be prepared to explain
>> what they know about servers that Google, Yahoo, and Amazon don't."
>> Paul Graham
>>
>>
>> "A mathematician is a device for turning coffee into theorems."
>> Paul Erdos (who obviously never met a sysadmin)
>
>
>
>
>
>

Re: Nutch + Solr: filtering URL while indexing

Reply via email to