Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Michael Ji Sun, 09 Oct 2005 05:53:57 -0700

hi,

How is performance concern if the size of domain list
reaches 10,000?


Micheal Ji,

--- "Gal Nitzan (JIRA)" <[EMAIL PROTECTED]> wrote:

>      [
>
http://issues.apache.org/jira/browse/NUTCH-100?page=all
> ]
> 
> Gal Nitzan updated NUTCH-100:
> -----------------------------
> 
>            type: Improvement  (was: New Feature)
>     Description: 
> Hi,
> 
> I have written a new plugin, based on the URLFilter
> interface: urlfilter-db .
> 
> The purpose of this plugin is to filter domains,
> i.e. I would like to crawl the world but to fetch
> only certain domains.
> 
> The plugin uses a caching system (SwarmCache, easier
> to deploy than JCS) and on the back-end a database.
> 
> For each url
>    filter is called
> end for
> 
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> 
> 
> The plugin reads the cache size, jdbc driver,
> connection string, table to use and domain field
> from nutch-site.xml
> 
> 
>   was:
> Hi,
> 
> I have written (not much) a new plugin, based on the
> URLFilter interface: urlfilter-db .
> 
> The purpose of this plugin is to filter domains,
> i.e. I would like to crawl the world but to fetch
> only certain domains.
> 
> The plugin uses a caching system (SwarmCache, easier
> to deploy than JCS) and on the back-end a database.
> 
> For each url
>    filter is called
> end for
> 
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> 
> 
> The plugin reads the cache size, jdbc driver,
> connection string, table to use and domain field
> from nutch-site.xml
> 
> 
>     Environment: All Nutch versions  (was: MapRed)
> 
> Fixed some issues
> clean up
> Added a patch for Subversion
> 
> > New plugin urlfilter-db
> > -----------------------
> >
> >          Key: NUTCH-100
> >          URL:
> http://issues.apache.org/jira/browse/NUTCH-100
> >      Project: Nutch
> >         Type: Improvement
> >   Components: fetcher
> >     Versions: 0.8-dev
> >  Environment: All Nutch versions
> >     Reporter: Gal Nitzan
> >     Priority: Trivial
> >  Attachments: AddedDbURLFilter.patch,
> urlfilter-db.tar.gz, urlfilter-db.tar.gz
> >
> > Hi,
> > I have written a new plugin, based on the
> URLFilter interface: urlfilter-db .
> > The purpose of this plugin is to filter domains,
> i.e. I would like to crawl the world but to fetch
> only certain domains.
> > The plugin uses a caching system (SwarmCache,
> easier to deploy than JCS) and on the back-end a
> database.
> > For each url
> >    filter is called
> > end for
> > filter
> >  get the domain name from url
> >   call cache.get domain
> >   if not in cache try the database
> >   if in database cache it and return it
> >   return null
> > end filter
> > The plugin reads the cache size, jdbc driver,
> connection string, table to use and domain field
> from nutch-site.xml
> 
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of
> the administrators:
>   
>
http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira
> 
> 



                
__________________________________ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Reply via email to