[
https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187950#comment-13187950
]
Edward Drapkin commented on NUTCH-1201:
---------------------------------------
You bring up a good point, and I was making a pretty blatant assumption that
the code is in fact reusable for these other cases.
I think at the highest level, fetching will always basically be a
producer-consumer task, which implies that there will always be these
components: some queue, something to feed the queue, something to consume from
the queue, and something to pull it all together into the hadoop job. If
there's a better way of architecting the code necessary to run a fetching
process, it's not something I've seen. The interfaces that I suggest reflect
this (and use the same names currently being used) and the default
implementations would be the existing code, so as to not break BC.
I do think, though, that Fetcher itself ought to be able to be overridden and
customized (hence providing an interface to it), although we should focus on
making that something that no one wants to do, so it doesn't even need to be
discouraged. I envision a situation in which Fetcher just basically serves as
"glue" that holds the other three components together, so a situation where
some logic needs to be changed would be changed in one of the other components.
We may wind up in a situation where the only benefit to providing custom queue
behavior is in conjunction with providing custom queue feeder + queue consumer
behavior... as a matter of fact, I'd fully expect this to frequently be the
case. Perhaps a better overall approach here might be to break Fetching into a
high-level Nutch abstraction, then provide several fetching plugins that can be
dropped into place depending on the situation, similar to the way that the
protocol plugins behave. The fetcher already runs threads outside of the
hadoop framework, so a generic fetcher job that just invoked a fetching plugin
wouldn't have to be a regression of any sort.
The more I think about it, the more I think that this may be the right solution
to a modular fetching system: Nutch (eventually) shipping with
"fetch-depthfirst" and "fetch-unthreaded" and "fetch-default" and any other
scenario that may arise would allow for support for several cases right out of
box. This approach would probably be the most difficult in terms of man hours
and testing (but hey, I'm volunteering, right?), but I think it's probably the
best way to provide modular fetcher functionality.
If we decide to break the fetcher into a plugin, then the fetcher only has to
conform to a relatively simply interface. I'd think that we would provide an
abstract class that implements that interface and holds together the other
sub-components mentioned above, as a starting point for the various fetcher
plugins, but I don't think we would have to require that it be used. We could,
similarly, offer abstract class default implementations of the various
sub-components as well, but we'd nowhere force or require them to be used in
any capacity.
> Allow for different FetcherThread impls
> ---------------------------------------
>
> Key: NUTCH-1201
> URL: https://issues.apache.org/jira/browse/NUTCH-1201
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.5
>
>
> For certain cases we need to modify parts in FetcherThread and make it
> pluggable. This introduces a new config directive fetcher.impl that takes a
> FQCN and uses that setting Fetcher.fetch to load a class to use for
> job.setMapRunnerClass(). This new class has to extend Fetcher and and inner
> class FetcherThread. This allows for overriding methods in FetcherThread but
> also methods in Fetcher itself if required.
> A follow up on this issue would be to refactor parts of FetcherThread to make
> it easier to override small sections instead of copying the entire method
> body for a small change, which is now the case.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira