[ http://issues.apache.org/jira/browse/NUTCH-100?page=all ]
Gal Nitzan updated NUTCH-100:
-----------------------------
type: Improvement (was: New Feature)
Description:
Hi,
I have written a new plugin, based on the URLFilter interface: urlfilter-db .
The purpose of this plugin is to filter domains, i.e. I would like to crawl the
world but to fetch only certain domains.
The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on
the back-end a database.
For each url
filter is called
end for
filter
get the domain name from url
call cache.get domain
if not in cache try the database
if in database cache it and return it
return null
end filter
The plugin reads the cache size, jdbc driver, connection string, table to use
and domain field from nutch-site.xml
was:
Hi,
I have written (not much) a new plugin, based on the URLFilter interface:
urlfilter-db .
The purpose of this plugin is to filter domains, i.e. I would like to crawl the
world but to fetch only certain domains.
The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on
the back-end a database.
For each url
filter is called
end for
filter
get the domain name from url
call cache.get domain
if not in cache try the database
if in database cache it and return it
return null
end filter
The plugin reads the cache size, jdbc driver, connection string, table to use
and domain field from nutch-site.xml
Environment: All Nutch versions (was: MapRed)
Fixed some issues
clean up
Added a patch for Subversion
> New plugin urlfilter-db
> -----------------------
>
> Key: NUTCH-100
> URL: http://issues.apache.org/jira/browse/NUTCH-100
> Project: Nutch
> Type: Improvement
> Components: fetcher
> Versions: 0.8-dev
> Environment: All Nutch versions
> Reporter: Gal Nitzan
> Priority: Trivial
> Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz
>
> Hi,
> I have written a new plugin, based on the URLFilter interface: urlfilter-db .
> The purpose of this plugin is to filter domains, i.e. I would like to crawl
> the world but to fetch only certain domains.
> The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and
> on the back-end a database.
> For each url
> filter is called
> end for
> filter
> get the domain name from url
> call cache.get domain
> if not in cache try the database
> if in database cache it and return it
> return null
> end filter
> The plugin reads the cache size, jdbc driver, connection string, table to use
> and domain field from nutch-site.xml
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira