> >>> First is to allow for cleaning up. This consists of a new option to > >>> "updatedb" which can scrub the database of all URLs which no longer > >>> match URLFilter settings (regex-urlfilter.txt). This allows a change in > >>> the urlfilter to be reflected against Nutches current dataset, something > >>> I think others have asked for in the past. > >>> > >> Yes, this would be a welcome addition. Note that Andrzej recently > >> committed a change that causes Generate to filter urls, which achieves > >> the same effect, but w/o removing them from the database, so they're > >> still consuming space & time. > >> > > > > Excellent. I'll put someone on this.
> Stefan submitted some code that from my cursory glance looks good - > could you please check NUTCH-226 and see if it works for you? I plan to > run some tests and add this (plus a companion LinkDBFilter). I have about 100M urls to expunge and I want them completely gone. Changing the status doesn't count from a performance perspective. Nutch spends a significant amount of time in sorts and other logic on this garbage during both generate and updatedb with every cycle. Doing the actual expunging during updatedb is better than as a separate command for performance. As a periodic option (scrubbing content generation or abuse sites in my case) combining with updatedb will reduce the IO and CPU requirements. Updatedb already reads in the DB, cycles through every entry, sorts it, and write it out. Doing this in a separate command would kill 4 to 8 hours of otherwise usable time. Doing it as a part of updatedb probably costs about 1 hour of work (CPU time to apply filters only). -- Rod Taylor <[EMAIL PROTECTED]> ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
