[ http://issues.apache.org/jira/browse/NUTCH-249?page=all ]

Stefan Groschupf updated NUTCH-249:
-----------------------------------

    Attachment: blackWhiteListV2.patch

A concept tryout of black- white list filtering. I'm looking for beta tester 
and improvement suggestions. (Especially I'm looking for terminus suggestions)
Such a filter mechanism can be very useful for vertical search deployments of 
nutch with very large filter sets.

A black-White Url pattern database can be created and used to filter urls until 
updating a crawldb. So the crawlDb contains only urls that passes the black 
white list. In case a url match a black url prefix it will not written to the 
crawlDb. In case a url match a white prefix it is written to the crawlDb. 
In case a url does not match a white or black prefix it is also not written to 
the crawlDb.

Url filtering happens on a host level so a url only need to be filtered by all 
patterns for the same host. 

Usage: 
// inject prefix url patterns (a text file in a folder) that a url should not 
match
bin/nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/projects/negativeUrls/ 
-black 
// injkect prefix url patterns that a url is allowed to match
bin/nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/projects/positiveUrls/ 
-white 
// update a fetched segment into a database (only urls will be added to the db 
that pass the black white filter)
bin/nutch org.apache.nutch.crawl.bw.BWUpdateDb testCrawlDb bwdb 
segments/20060416181635/ 

Known Issues:
Hadoop does not allow to have different formats for one job, so some overhead 
format converting is required that currently slow down the processing. 

Any comments are welcome!

> black- white list url filtering
> -------------------------------
>
>          Key: NUTCH-249
>          URL: http://issues.apache.org/jira/browse/NUTCH-249
>      Project: Nutch
>         Type: Improvement

>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Trivial
>      Fix For: 0.8-dev
>  Attachments: blackWhiteListV2.patch
>
> Existing url filter mechanisms need to process each url against each filter 
> pattern. For very large filter sets this may be does not scale very well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to