On 1/21/2012 8:27 AM, Lewis John Mcgibbney wrote:
Hi Julien,
There are 8 issues in trunk about the fetcher - some of them
unrelated to the Fetcher (NUTCH-827
<https://issues.apache.org/jira/browse/NUTCH-827> / Nutch-1193)
with most of the others being improvements (NUTCH-828
<https://issues.apache.org/jira/browse/NUTCH-828> / NUTCH-1079
<https://issues.apache.org/jira/browse/NUTCH-1079>) with possibly
just a very few being real issues.
This puts the whole discussion into much better context, thanks for
pointing this out. Maybe I should have made it more clear, that I only
filtered the fetcher issues on our Jira and I was simply modelling my
discussion around that. You are completely correct though, it would be
different if the fetcher was in a similar state to
protocol-httpclient... which it is obviously not.
I am also concerned about getting too radical changes to such a
core part of the framework, especially when more pressing issues
could be looked after instead.
+1
Having said that if someone can come up with an interesting
proposal for improving the Fetcher that would be very good, I
would simply suggest that we then have a separate implementation
for that.
+1
Ok with this in mind then, is there some guidance we can communicate
to Eddie? He has specifically mentioned that he shares similar
opinions wrt the fetcher being a core part of Nutch, radical changes
etc, and I also share this point of view. He has also added that he
doesn't want to spend the time changing material which we may or may
not merge with trunk, this also makes perfect sense. Additionally
Ken's comments emphasise that this has been somewhat attempted in the
past and that lessons have been learned and the implementation we have
cuts the mustard as is.
Maybe we could nudge Eddie in the right direction, which would benefit
both himself and the project over the next while, I think this was the
most important point I was trying to emphasise, however looking over
my original comment this was maybe not how it was written.
Thanks
Lewis
If there's more important and/or interesting things for me to work on,
I'll be glad to. I'm completely unfamiliar with the current state of
the project as a whole - and looking through JIRA is a bit daunting.
The only reason I'm attracted to working on the fetcher is I think it's
a really interesting and compelling problem to solve, and it's making it
more flexible is something that would directly benefit our use for it,
so it will be easier to devote time to it while I'm at the office. I do
have a glut of free time at the moment though, so I'm perfectly okay
working on another area that's more pressing - I just don't know what it
is. I saw that protocol-httpclient needs to be rewritten, is there
someone working on that?
I can work on more important and less controversial / radical things,
but I do think that having a more flexible, pluggable fetcher will be an
enormous improvement to Nutch and can greatly expand the potential uses
for it as a piece of software. There's a ton of cases where pluggable
fetching could have a huge improvement: local filesystem search,
single-threaded / small site indexing, email indexing (SMTP, POP, etc.),
etc. I suggested an extremely (perhaps too much so) abstract
archtecture for fetching in ticket #1201, and for the sake of brevity I
won't repeat myself here, but I think that would give Nutch a good base
for flexible fetching, which I believe is a huge improvement to the
project. I'm obviously new to the development here and I'm willing do
whatever needs doing, I just believe the fetching is something that
needs doing. I just want to contribute!
Thanks,
Eddie