> >>> First is to allow for cleaning up.  This consists of a new option to
> >>> "updatedb" which can scrub the database of all URLs which no longer
> >>> match URLFilter settings (regex-urlfilter.txt). This allows a change in
> >>> the urlfilter to be reflected against Nutches current dataset, something
> >>> I think others have asked for in the past.
> >>>       
> >> Yes, this would be a welcome addition.  Note that Andrzej recently 
> >> committed a change that causes Generate to filter urls, which achieves 
> >> the same effect, but w/o removing them from the database, so they're 
> >> still consuming space & time.
> >>     
> >
> > Excellent. I'll put someone on this.

> Stefan submitted some code that from my cursory glance looks good - 
> could you please check NUTCH-226 and see if it works for you? I plan to 
> run some tests and add this (plus a companion LinkDBFilter).

I have about 100M urls to expunge and I want them completely gone.
Changing the status doesn't count from a performance perspective. Nutch
spends a significant amount of time in sorts and other logic on this
garbage during both generate and updatedb with every cycle.

Doing the actual expunging during updatedb is better than as a separate
command for performance. As a periodic option (scrubbing content
generation or abuse sites in my case) combining with updatedb will
reduce the IO and CPU requirements. Updatedb already reads in the DB,
cycles through every entry, sorts it, and write it out.


Doing this in a separate command would kill 4 to 8 hours of otherwise
usable time. Doing it as a part of updatedb probably costs about 1 hour
of work (CPU time to apply filters only).

-- 
Rod Taylor <[EMAIL PROTECTED]>



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to