Hi,

I have written (not much) a new plugin, based on the URLFilter interface: urlfilter-db .

The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.

The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml

Since I do not have the tools to add it to the svn and all, If someone is interested let me know and I can mail it.

Regards,

Gal


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to