I'm going to make a request in Jira now. -AJ --- Matt Kangas <[EMAIL PROTECTED]> wrote:
> Great! Is there a ticket in JIRA requesting this > feature? If not, we > should file one and get a few votes in favor of it. > AFAIK, that's the > process for getting new features into Nutch. > > On Sep 2, 2005, at 1:30 PM, AJ Chen wrote: > > > Matt, > > This is great! It would be very useful to Nutch > developers if your > > code can be shared. I'm sure quite a few > applications will benefit > > from it because it fills a gap between whole-web > crawling and > > single site (or a handful of sites) crawling. > I'll be interested > > in adapting your plugin to Nutch convention. > > Thanks, > > -AJ > > > > Matt Kangas wrote: > > > > > >> AJ and Earl, > >> > >> I've implemented URLFilters before. In fact, I > have a > >> WhitelistURLFilter that implements just what you > describe: a > >> hashtable of regex-lists. We implemented it > specifically because > >> we want to be able to crawl a large number of > known-good paths > >> through sites, including paths through CGIs. The > hash is a Nutch > >> ArrayFile, which provides low runtime overhead. > We've tested it > >> on 200+ sites thus far, and haven't seen any > indication that it > >> will have problems scaling further. > >> > >> The filter and its supporting WhitelistWriter > currently rely on a > >> few custom classes, but it should be > straightforward to adapt to > >> Nutch naming conventions, etc. If you're > interested in doing this > >> work, I can see if it's ok to publish our code. > >> > >> BTW, we're currently alpha-testing the site that > uses this > >> plugin, and preparing for a public beta. I'll be > sure to post > >> here when we're finally open for business. :) > >> > >> --Matt > >> > >> > >> On Sep 2, 2005, at 11:43 AM, AJ Chen wrote: > >> > >> > >>> From reading http://wiki.apache.org/nutch/ > >>> DissectingTheNutchCrawler, it seems that a new > urlfilter is a > >>> good place to extend the inclusion regex > capability. The new > >>> urlfilter will be defined by urlfilter.class > property, which > >>> gets loaded by the URLFilterFactory. > >>> Regex is necessary because you want to include > urls matching > >>> certain patterns. > >>> > >>> Can anybody who implemented URLFilter plugin > before share some > >>> thoughts about this approach? I expect the new > filter must have > >>> all capabilities that the current > RegexURLFilter.java has so > >>> that it won't require change in any other > classes. The > >>> difference is that the new filter uses a hash > table for > >>> efficiently looking up regex for included > domains (a large > >>> number!). > >>> > >>> BTW, I can't find urlfilter.class property in > any of the > >>> configuration files in Nutch-0.7. Does 0.7 > version still support > >>> urlfilter extension? Any difference relative to > what's described > >>> in the doc DissectingTheNutchCrawler cited > above? > >>> > >>> Thanks, > >>> AJ > >>> > >>> Earl Cahill wrote: > >>> > >>> > >>> > >>>>> The goal is to avoid entering 100,000 regex in > the > >>>>> craw-urlfilter.xml and checking ALL these > regex for each URL. > >>>>> Any comment? > >>>>> > >>>>> > >>>>> > >>>> > >>>> Sure seems like just some hash look up table > could > >>>> handle it. I am having a hard time seeing when > you > >>>> really need a regex and a fixed list wouldn't > do. Especially if > >>>> you have forward and maybe a backwards > >>>> lookup as well in a multi-level hash, to > perhaps > >>>> include/exclude at a certain sudomain level, > like > >>>> > >>>> include: com->site->good (for good.site.com > stuff) > >>>> exclude: com->site->bad (for bad.site.com) > >>>> > >>>> and kind of walk backwards, kind of like dns. > Then > >>>> you could just do a few hash lookups instead of > >>>> 100,000 regexes. > >>>> > >>>> I realize I am talking about host and not page > level > >>>> filtering, but if you want to include > everything from > >>>> your 100,000 sites, I think such a strategy > could > >>>> work. > >>>> > >>>> Hope this makes sense. Maybe I could write > some code > >>>> to and see if it works in practice. If nothing > else, > >>>> maybe the hash stuff could just be another > filter > >>>> option in conf/crawl-urlfilter.txt. > >>>> > >>>> Earl > >>>> > >>>> > >>>> > >>> > >>> -- > >>> AJ (Anjun) Chen, Ph.D. > >>> Canova Bioconsulting Marketing * BD * Software > Development > >>> 748 Matadero Ave., Palo Alto, CA 94306, USA > >>> Cell 650-283-4091, [EMAIL PROTECTED] > >>> > --------------------------------------------------- > >>> > >>> > >> > >> -- > >> Matt Kangas / [EMAIL PROTECTED] > >> > >> > >> > >> > > > > -- > > AJ (Anjun) Chen, Ph.D. > > Canova Bioconsulting Marketing * BD * Software > Development > > 748 Matadero Ave., Palo Alto, CA 94306, USA > > Cell 650-283-4091, [EMAIL PROTECTED] > > > --------------------------------------------------- > > > > -- > Matt Kangas / [EMAIL PROTECTED] > > >
