[ http://issues.apache.org/jira/browse/NUTCH-100?page=all ]

Gal Nitzan updated NUTCH-100:
-----------------------------

           type: Improvement  (was: New Feature)
    Description: 
Hi,

I have written a new plugin, based on the URLFilter interface: urlfilter-db .

The purpose of this plugin is to filter domains, i.e. I would like to crawl the 
world but to fetch only certain domains.

The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on 
the back-end a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver, connection string, table to use 
and domain field from nutch-site.xml


  was:
Hi,

I have written (not much) a new plugin, based on the URLFilter interface: 
urlfilter-db .

The purpose of this plugin is to filter domains, i.e. I would like to crawl the 
world but to fetch only certain domains.

The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on 
the back-end a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver, connection string, table to use 
and domain field from nutch-site.xml


    Environment: All Nutch versions  (was: MapRed)

Fixed some issues
clean up
Added a patch for Subversion

> New plugin urlfilter-db
> -----------------------
>
>          Key: NUTCH-100
>          URL: http://issues.apache.org/jira/browse/NUTCH-100
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: All Nutch versions
>     Reporter: Gal Nitzan
>     Priority: Trivial
>  Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz
>
> Hi,
> I have written a new plugin, based on the URLFilter interface: urlfilter-db .
> The purpose of this plugin is to filter domains, i.e. I would like to crawl 
> the world but to fetch only certain domains.
> The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and 
> on the back-end a database.
> For each url
>    filter is called
> end for
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> The plugin reads the cache size, jdbc driver, connection string, table to use 
> and domain field from nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to