[
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13227859#comment-13227859
]
Mathijs Homminga commented on NUTCH-882:
----------------------------------------
Hi guys,
I have second thoughts on implementing the NutchContext concept at this stage.
All Nutch processes are centered around the concept of a WebPage. And I agree,
many of these processes and their plugins might benefit from additional input
which is related to, but not directly part of a WebPage. Like host statistics,
metadata or domain information.
The proposed NutchContext solution is elegant in the way that it makes this
additional information available to plugins, in an extensible way.
However, it indeed requires a big API break for plugins (since we don't use
abstract base classes for all the plugins, we can't fix it there to keep them
compatible).
I'm afraid that a patch that tries to implement the Host table and the
NutchContext at the same time, will have a hard time to make it to the
repository ;)
I propose to move the NutchContext approach to a new issue.
Plugins and other components can still use Host information by using the HostDB
class directly to perform efficient host lookups when needed. We can then
decide later to make this part of the NutchContext.
Agreed?
> Design a Host table in GORA
> ---------------------------
>
> Key: NUTCH-882
> URL: https://issues.apache.org/jira/browse/NUTCH-882
> Project: Nutch
> Issue Type: New Feature
> Affects Versions: nutchgora
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Fix For: nutchgora
>
> Attachments: NUTCH-882-v1.patch, hostdb.patch
>
>
> Having a separate GORA table for storing information about hosts (and
> domains?) would be very useful for :
> * customising the behaviour of the fetching on a host basis e.g. number of
> threads, min time between threads etc...
> * storing stats
> * keeping metadata and possibly propagate them to the webpages
> * keeping a copy of the robots.txt and possibly use that later to filter the
> webtable
> * store sitemaps files and update the webtable accordingly
> I'll try to come up with a GORA schema for such a host table but any comments
> are of course already welcome
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira