[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls

Edward Drapkin (Commented) (JIRA) Tue, 17 Jan 2012 12:02:05 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187950#comment-13187950
 ]


Edward Drapkin commented on NUTCH-1201:
---------------------------------------

You bring up a good point, and I was making a pretty blatant assumption that 
the code is in fact reusable for these other cases.

I think at the highest level, fetching will always basically be a 
producer-consumer task, which implies that there will always be these 
components: some queue, something to feed the queue, something to consume from 
the queue, and something to pull it all together into the hadoop job.  If 
there's a better way of architecting the code necessary to run a fetching 
process, it's not something I've seen.  The interfaces that I suggest reflect 
this (and use the same names currently being used) and the default 
implementations would be the existing code, so as to not break BC.

I do think, though, that Fetcher itself ought to be able to be overridden and 
customized (hence providing an interface to it), although we should focus on 
making that something that no one wants to do, so it doesn't even need to be 
discouraged.  I envision a situation in which Fetcher just basically serves as 
"glue" that holds the other three components together, so a situation where 
some logic needs to be changed would be changed in one of the other components. 
 

We may wind up in a situation where the only benefit to providing custom queue 
behavior is in conjunction with providing custom queue feeder + queue consumer 
behavior... as a matter of fact, I'd fully expect this to frequently be the 
case.  Perhaps a better overall approach here might be to break Fetching into a 
high-level Nutch abstraction, then provide several fetching plugins that can be 
dropped into place depending on the situation, similar to the way that the 
protocol plugins behave.  The fetcher already runs threads outside of the 
hadoop framework, so a generic fetcher job that just invoked a fetching plugin 
wouldn't have to be a regression of any sort.  

The more I think about it, the more I think that this may be the right solution 
to a modular fetching system: Nutch (eventually) shipping with 
"fetch-depthfirst" and "fetch-unthreaded" and "fetch-default" and any other 
scenario that may arise would allow for support for several cases right out of 
box.  This approach would probably be the most difficult in terms of man hours 
and testing (but hey, I'm volunteering, right?), but I think it's probably the 
best way to provide modular fetcher functionality.

If we decide to break the fetcher into a plugin, then the fetcher only has to 
conform to a relatively simply interface.  I'd think that we would provide an 
abstract class that implements that interface and holds together the other 
sub-components mentioned above, as a starting point for the various fetcher 
plugins, but I don't think we would have to require that it be used.  We could, 
similarly, offer abstract class default implementations of the various 
sub-components as well, but we'd nowhere force or require them to be used in 
any capacity.
                
> Allow for different FetcherThread impls
> ---------------------------------------
>
>                 Key: NUTCH-1201
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1201
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>
> For certain cases we need to modify parts in FetcherThread and make it 
> pluggable. This introduces a new config directive fetcher.impl that takes a 
> FQCN and uses that setting Fetcher.fetch to load a class to use for 
> job.setMapRunnerClass(). This new class has to extend Fetcher and and inner 
> class FetcherThread. This allows for overriding methods in FetcherThread but 
> also methods in Fetcher itself if required.
> A follow up on this issue would be to refactor parts of FetcherThread to make 
> it easier to override small sections instead of copying the entire method 
> body for a small change, which is now the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls

Reply via email to