Hi Gal,

I'm curious about the memory consumption of the cache and the speed of
retrieval of an item from the cache, when the cache has 100k domains in
it.

Thanks,
Otis


--- Gal Nitzan <[EMAIL PROTECTED]> wrote:

> Hi Michael,
> 
> At the moment I have about 3000 domains in my db. I didn't time the 
> performance however having even 100k domains shouldn't have an impact
> 
> since it is fetched only once from the database to the cache. A
> little 
> performance hit should be over 100k (depends on number elements
> defined 
> in xml file).
> 
> After a few birth problems, the plugin works nicely and I do not feel
> 
> any impact.
> 
> Regards,
> 
> Gal
> 
> 
> Michael Ji wrote:
> > hi,
> >
> > How is performance concern if the size of domain list
> > reaches 10,000?
> >
> > Micheal Ji,
> >
> > --- "Gal Nitzan (JIRA)" <[EMAIL PROTECTED]> wrote:
> >
> >   
> >>      [
> >>
> >>     
> > http://issues.apache.org/jira/browse/NUTCH-100?page=all
> >   
> >> ]
> >>
> >> Gal Nitzan updated NUTCH-100:
> >> -----------------------------
> >>
> >>            type: Improvement  (was: New Feature)
> >>     Description: 
> >> Hi,
> >>
> >> I have written a new plugin, based on the URLFilter
> >> interface: urlfilter-db .
> >>
> >> The purpose of this plugin is to filter domains,
> >> i.e. I would like to crawl the world but to fetch
> >> only certain domains.
> >>
> >> The plugin uses a caching system (SwarmCache, easier
> >> to deploy than JCS) and on the back-end a database.
> >>
> >> For each url
> >>    filter is called
> >> end for
> >>
> >> filter
> >>  get the domain name from url
> >>   call cache.get domain
> >>   if not in cache try the database
> >>   if in database cache it and return it
> >>   return null
> >> end filter
> >>
> >>
> >> The plugin reads the cache size, jdbc driver,
> >> connection string, table to use and domain field
> >> from nutch-site.xml
> >>
> >>
> >>   was:
> >> Hi,
> >>
> >> I have written (not much) a new plugin, based on the
> >> URLFilter interface: urlfilter-db .
> >>
> >> The purpose of this plugin is to filter domains,
> >> i.e. I would like to crawl the world but to fetch
> >> only certain domains.
> >>
> >> The plugin uses a caching system (SwarmCache, easier
> >> to deploy than JCS) and on the back-end a database.
> >>
> >> For each url
> >>    filter is called
> >> end for
> >>
> >> filter
> >>  get the domain name from url
> >>   call cache.get domain
> >>   if not in cache try the database
> >>   if in database cache it and return it
> >>   return null
> >> end filter
> >>
> >>
> >> The plugin reads the cache size, jdbc driver,
> >> connection string, table to use and domain field
> >> from nutch-site.xml
> >>
> >>
> >>     Environment: All Nutch versions  (was: MapRed)
> >>
> >> Fixed some issues
> >> clean up
> >> Added a patch for Subversion
> >>
> >>     
> >>> New plugin urlfilter-db
> >>> -----------------------
> >>>
> >>>          Key: NUTCH-100
> >>>          URL:
> >>>       
> >> http://issues.apache.org/jira/browse/NUTCH-100
> >>     
> >>>      Project: Nutch
> >>>         Type: Improvement
> >>>   Components: fetcher
> >>>     Versions: 0.8-dev
> >>>  Environment: All Nutch versions
> >>>     Reporter: Gal Nitzan
> >>>     Priority: Trivial
> >>>  Attachments: AddedDbURLFilter.patch,
> >>>       
> >> urlfilter-db.tar.gz, urlfilter-db.tar.gz
> >>     
> >>> Hi,
> >>> I have written a new plugin, based on the
> >>>       
> >> URLFilter interface: urlfilter-db .
> >>     
> >>> The purpose of this plugin is to filter domains,
> >>>       
> >> i.e. I would like to crawl the world but to fetch
> >> only certain domains.
> >>     
> >>> The plugin uses a caching system (SwarmCache,
> >>>       
> >> easier to deploy than JCS) and on the back-end a
> >> database.
> >>     
> >>> For each url
> >>>    filter is called
> >>> end for
> >>> filter
> >>>  get the domain name from url
> >>>   call cache.get domain
> >>>   if not in cache try the database
> >>>   if in database cache it and return it
> >>>   return null
> >>> end filter
> >>> The plugin reads the cache size, jdbc driver,
> >>>       
> >> connection string, table to use and domain field
> >> from nutch-site.xml
> >>
> >> -- 
> >> This message is automatically generated by JIRA.
> >> -
> >> If you think it was sent incorrectly contact one of
> >> the administrators:
> >>   
> >>
> >>     
> > http://issues.apache.org/jira/secure/Administrators.jspa
> >   
> >> -
> >> For more information on JIRA, see:
> >>    http://www.atlassian.com/software/jira
> >>
> >>
> >>     
> >
> >
> >
> >             
> > __________________________________ 
> > Yahoo! Music Unlimited 
> > Access over 1 million songs. Try it free.
> > http://music.yahoo.com/unlimited/
> >
> > .
> >
> >   
> 
> 
> 

Reply via email to