Hi Gal, I'm curious about the memory consumption of the cache and the speed of retrieval of an item from the cache, when the cache has 100k domains in it.
Thanks, Otis --- Gal Nitzan <[EMAIL PROTECTED]> wrote: > Hi Michael, > > At the moment I have about 3000 domains in my db. I didn't time the > performance however having even 100k domains shouldn't have an impact > > since it is fetched only once from the database to the cache. A > little > performance hit should be over 100k (depends on number elements > defined > in xml file). > > After a few birth problems, the plugin works nicely and I do not feel > > any impact. > > Regards, > > Gal > > > Michael Ji wrote: > > hi, > > > > How is performance concern if the size of domain list > > reaches 10,000? > > > > Micheal Ji, > > > > --- "Gal Nitzan (JIRA)" <[EMAIL PROTECTED]> wrote: > > > > > >> [ > >> > >> > > http://issues.apache.org/jira/browse/NUTCH-100?page=all > > > >> ] > >> > >> Gal Nitzan updated NUTCH-100: > >> ----------------------------- > >> > >> type: Improvement (was: New Feature) > >> Description: > >> Hi, > >> > >> I have written a new plugin, based on the URLFilter > >> interface: urlfilter-db . > >> > >> The purpose of this plugin is to filter domains, > >> i.e. I would like to crawl the world but to fetch > >> only certain domains. > >> > >> The plugin uses a caching system (SwarmCache, easier > >> to deploy than JCS) and on the back-end a database. > >> > >> For each url > >> filter is called > >> end for > >> > >> filter > >> get the domain name from url > >> call cache.get domain > >> if not in cache try the database > >> if in database cache it and return it > >> return null > >> end filter > >> > >> > >> The plugin reads the cache size, jdbc driver, > >> connection string, table to use and domain field > >> from nutch-site.xml > >> > >> > >> was: > >> Hi, > >> > >> I have written (not much) a new plugin, based on the > >> URLFilter interface: urlfilter-db . > >> > >> The purpose of this plugin is to filter domains, > >> i.e. I would like to crawl the world but to fetch > >> only certain domains. > >> > >> The plugin uses a caching system (SwarmCache, easier > >> to deploy than JCS) and on the back-end a database. > >> > >> For each url > >> filter is called > >> end for > >> > >> filter > >> get the domain name from url > >> call cache.get domain > >> if not in cache try the database > >> if in database cache it and return it > >> return null > >> end filter > >> > >> > >> The plugin reads the cache size, jdbc driver, > >> connection string, table to use and domain field > >> from nutch-site.xml > >> > >> > >> Environment: All Nutch versions (was: MapRed) > >> > >> Fixed some issues > >> clean up > >> Added a patch for Subversion > >> > >> > >>> New plugin urlfilter-db > >>> ----------------------- > >>> > >>> Key: NUTCH-100 > >>> URL: > >>> > >> http://issues.apache.org/jira/browse/NUTCH-100 > >> > >>> Project: Nutch > >>> Type: Improvement > >>> Components: fetcher > >>> Versions: 0.8-dev > >>> Environment: All Nutch versions > >>> Reporter: Gal Nitzan > >>> Priority: Trivial > >>> Attachments: AddedDbURLFilter.patch, > >>> > >> urlfilter-db.tar.gz, urlfilter-db.tar.gz > >> > >>> Hi, > >>> I have written a new plugin, based on the > >>> > >> URLFilter interface: urlfilter-db . > >> > >>> The purpose of this plugin is to filter domains, > >>> > >> i.e. I would like to crawl the world but to fetch > >> only certain domains. > >> > >>> The plugin uses a caching system (SwarmCache, > >>> > >> easier to deploy than JCS) and on the back-end a > >> database. > >> > >>> For each url > >>> filter is called > >>> end for > >>> filter > >>> get the domain name from url > >>> call cache.get domain > >>> if not in cache try the database > >>> if in database cache it and return it > >>> return null > >>> end filter > >>> The plugin reads the cache size, jdbc driver, > >>> > >> connection string, table to use and domain field > >> from nutch-site.xml > >> > >> -- > >> This message is automatically generated by JIRA. > >> - > >> If you think it was sent incorrectly contact one of > >> the administrators: > >> > >> > >> > > http://issues.apache.org/jira/secure/Administrators.jspa > > > >> - > >> For more information on JIRA, see: > >> http://www.atlassian.com/software/jira > >> > >> > >> > > > > > > > > > > __________________________________ > > Yahoo! Music Unlimited > > Access over 1 million songs. Try it free. > > http://music.yahoo.com/unlimited/ > > > > . > > > > > > >