Hi Otis,

I have only a few thousands urls in my db at the moment. However, for a 100K it should be about 600-800KB. I do not cache the url itself, only a hash string. So the next time a url is searched in the cache if the hash exists than it is allowed.

Regards,

Gal

[EMAIL PROTECTED] wrote:
Hi Gal,

I'm curious about the memory consumption of the cache and the speed of
retrieval of an item from the cache, when the cache has 100k domains in
it.

Thanks,
Otis


--- Gal Nitzan <[EMAIL PROTECTED]> wrote:

Hi Michael,

At the moment I have about 3000 domains in my db. I didn't time the performance however having even 100k domains shouldn't have an impact

since it is fetched only once from the database to the cache. A
little performance hit should be over 100k (depends on number elements defined in xml file).

After a few birth problems, the plugin works nicely and I do not feel

any impact.

Regards,

Gal


Michael Ji wrote:
hi,

How is performance concern if the size of domain list
reaches 10,000?

Micheal Ji,

--- "Gal Nitzan (JIRA)" <[EMAIL PROTECTED]> wrote:

     [

http://issues.apache.org/jira/browse/NUTCH-100?page=all
]

Gal Nitzan updated NUTCH-100:
-----------------------------

           type: Improvement  (was: New Feature)
Description: Hi,

I have written a new plugin, based on the URLFilter
interface: urlfilter-db .

The purpose of this plugin is to filter domains,
i.e. I would like to crawl the world but to fetch
only certain domains.

The plugin uses a caching system (SwarmCache, easier
to deploy than JCS) and on the back-end a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver,
connection string, table to use and domain field
from nutch-site.xml


  was:
Hi,

I have written (not much) a new plugin, based on the
URLFilter interface: urlfilter-db .

The purpose of this plugin is to filter domains,
i.e. I would like to crawl the world but to fetch
only certain domains.

The plugin uses a caching system (SwarmCache, easier
to deploy than JCS) and on the back-end a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver,
connection string, table to use and domain field
from nutch-site.xml


    Environment: All Nutch versions  (was: MapRed)

Fixed some issues
clean up
Added a patch for Subversion

New plugin urlfilter-db
-----------------------

         Key: NUTCH-100
         URL:
http://issues.apache.org/jira/browse/NUTCH-100
     Project: Nutch
        Type: Improvement
  Components: fetcher
    Versions: 0.8-dev
 Environment: All Nutch versions
    Reporter: Gal Nitzan
    Priority: Trivial
 Attachments: AddedDbURLFilter.patch,
urlfilter-db.tar.gz, urlfilter-db.tar.gz
Hi,
I have written a new plugin, based on the
URLFilter interface: urlfilter-db .
The purpose of this plugin is to filter domains,
i.e. I would like to crawl the world but to fetch
only certain domains.
The plugin uses a caching system (SwarmCache,
easier to deploy than JCS) and on the back-end a
database.
For each url
   filter is called
end for
filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter
The plugin reads the cache size, jdbc driver,
connection string, table to use and domain field
from nutch-site.xml

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of
the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



                
__________________________________ Yahoo! Music Unlimited Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

.




.



Reply via email to