Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

ogjunk-nutch Sun, 09 Oct 2005 23:26:28 -0700

Hi Gal,

I'm curious about the memory consumption of the cache and the speed of
retrieval of an item from the cache, when the cache has 100k domains in
it.


Thanks,
Otis


--- Gal Nitzan <[EMAIL PROTECTED]> wrote:

> Hi Michael,
> 
> At the moment I have about 3000 domains in my db. I didn't time the 
> performance however having even 100k domains shouldn't have an impact
> 
> since it is fetched only once from the database to the cache. A
> little 
> performance hit should be over 100k (depends on number elements
> defined 
> in xml file).
> 
> After a few birth problems, the plugin works nicely and I do not feel
> 
> any impact.
> 
> Regards,
> 
> Gal
> 
> 
> Michael Ji wrote:
> > hi,
> >
> > How is performance concern if the size of domain list
> > reaches 10,000?
> >
> > Micheal Ji,
> >
> > --- "Gal Nitzan (JIRA)" <[EMAIL PROTECTED]> wrote:
> >
> >   
> >>      [
> >>
> >>     
> > http://issues.apache.org/jira/browse/NUTCH-100?page=all
> >   
> >> ]
> >>
> >> Gal Nitzan updated NUTCH-100:
> >> -----------------------------
> >>
> >>            type: Improvement  (was: New Feature)
> >>     Description: 
> >> Hi,
> >>
> >> I have written a new plugin, based on the URLFilter
> >> interface: urlfilter-db .
> >>
> >> The purpose of this plugin is to filter domains,
> >> i.e. I would like to crawl the world but to fetch
> >> only certain domains.
> >>
> >> The plugin uses a caching system (SwarmCache, easier
> >> to deploy than JCS) and on the back-end a database.
> >>
> >> For each url
> >>    filter is called
> >> end for
> >>
> >> filter
> >>  get the domain name from url
> >>   call cache.get domain
> >>   if not in cache try the database
> >>   if in database cache it and return it
> >>   return null
> >> end filter
> >>
> >>
> >> The plugin reads the cache size, jdbc driver,
> >> connection string, table to use and domain field
> >> from nutch-site.xml
> >>
> >>
> >>   was:
> >> Hi,
> >>
> >> I have written (not much) a new plugin, based on the
> >> URLFilter interface: urlfilter-db .
> >>
> >> The purpose of this plugin is to filter domains,
> >> i.e. I would like to crawl the world but to fetch
> >> only certain domains.
> >>
> >> The plugin uses a caching system (SwarmCache, easier
> >> to deploy than JCS) and on the back-end a database.
> >>
> >> For each url
> >>    filter is called
> >> end for
> >>
> >> filter
> >>  get the domain name from url
> >>   call cache.get domain
> >>   if not in cache try the database
> >>   if in database cache it and return it
> >>   return null
> >> end filter
> >>
> >>
> >> The plugin reads the cache size, jdbc driver,
> >> connection string, table to use and domain field
> >> from nutch-site.xml
> >>
> >>
> >>     Environment: All Nutch versions  (was: MapRed)
> >>
> >> Fixed some issues
> >> clean up
> >> Added a patch for Subversion
> >>
> >>     
> >>> New plugin urlfilter-db
> >>> -----------------------
> >>>
> >>>          Key: NUTCH-100
> >>>          URL:
> >>>       
> >> http://issues.apache.org/jira/browse/NUTCH-100
> >>     
> >>>      Project: Nutch
> >>>         Type: Improvement
> >>>   Components: fetcher
> >>>     Versions: 0.8-dev
> >>>  Environment: All Nutch versions
> >>>     Reporter: Gal Nitzan
> >>>     Priority: Trivial
> >>>  Attachments: AddedDbURLFilter.patch,
> >>>       
> >> urlfilter-db.tar.gz, urlfilter-db.tar.gz
> >>     
> >>> Hi,
> >>> I have written a new plugin, based on the
> >>>       
> >> URLFilter interface: urlfilter-db .
> >>     
> >>> The purpose of this plugin is to filter domains,
> >>>       
> >> i.e. I would like to crawl the world but to fetch
> >> only certain domains.
> >>     
> >>> The plugin uses a caching system (SwarmCache,
> >>>       
> >> easier to deploy than JCS) and on the back-end a
> >> database.
> >>     
> >>> For each url
> >>>    filter is called
> >>> end for
> >>> filter
> >>>  get the domain name from url
> >>>   call cache.get domain
> >>>   if not in cache try the database
> >>>   if in database cache it and return it
> >>>   return null
> >>> end filter
> >>> The plugin reads the cache size, jdbc driver,
> >>>       
> >> connection string, table to use and domain field
> >> from nutch-site.xml
> >>
> >> -- 
> >> This message is automatically generated by JIRA.
> >> -
> >> If you think it was sent incorrectly contact one of
> >> the administrators:
> >>   
> >>
> >>     
> > http://issues.apache.org/jira/secure/Administrators.jspa
> >   
> >> -
> >> For more information on JIRA, see:
> >>    http://www.atlassian.com/software/jira
> >>
> >>
> >>     
> >
> >
> >
> >             
> > __________________________________ 
> > Yahoo! Music Unlimited 
> > Access over 1 million songs. Try it free.
> > http://music.yahoo.com/unlimited/
> >
> > .
> >
> >   
> 
> 
>

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Reply via email to