On 1/21/2012 8:27 AM, Lewis John Mcgibbney wrote:
Hi Julien,


    There are 8 issues in trunk about the fetcher - some of them
    unrelated to the Fetcher (NUTCH-827
    <https://issues.apache.org/jira/browse/NUTCH-827> / Nutch-1193)
    with most of the others being improvements (NUTCH-828
    <https://issues.apache.org/jira/browse/NUTCH-828> / NUTCH-1079
    <https://issues.apache.org/jira/browse/NUTCH-1079>) with possibly
just a very few being real issues. This puts the whole discussion into much better context, thanks for pointing this out. Maybe I should have made it more clear, that I only filtered the fetcher issues on our Jira and I was simply modelling my discussion around that. You are completely correct though, it would be different if the fetcher was in a similar state to protocol-httpclient... which it is obviously not.

    I am also concerned about getting too radical changes to such a
    core part of the framework, especially when more pressing issues
could be looked after instead.
+1

    Having said that if someone can come up with an interesting
    proposal for improving the Fetcher that would be very good, I
    would simply suggest that we then have a separate implementation
    for that.

+1



Ok with this in mind then, is there some guidance we can communicate to Eddie? He has specifically mentioned that he shares similar opinions wrt the fetcher being a core part of Nutch, radical changes etc, and I also share this point of view. He has also added that he doesn't want to spend the time changing material which we may or may not merge with trunk, this also makes perfect sense. Additionally Ken's comments emphasise that this has been somewhat attempted in the past and that lessons have been learned and the implementation we have cuts the mustard as is. Maybe we could nudge Eddie in the right direction, which would benefit both himself and the project over the next while, I think this was the most important point I was trying to emphasise, however looking over my original comment this was maybe not how it was written.

Thanks
Lewis

If there's more important and/or interesting things for me to work on, I'll be glad to. I'm completely unfamiliar with the current state of the project as a whole - and looking through JIRA is a bit daunting. The only reason I'm attracted to working on the fetcher is I think it's a really interesting and compelling problem to solve, and it's making it more flexible is something that would directly benefit our use for it, so it will be easier to devote time to it while I'm at the office. I do have a glut of free time at the moment though, so I'm perfectly okay working on another area that's more pressing - I just don't know what it is. I saw that protocol-httpclient needs to be rewritten, is there someone working on that?

I can work on more important and less controversial / radical things, but I do think that having a more flexible, pluggable fetcher will be an enormous improvement to Nutch and can greatly expand the potential uses for it as a piece of software. There's a ton of cases where pluggable fetching could have a huge improvement: local filesystem search, single-threaded / small site indexing, email indexing (SMTP, POP, etc.), etc. I suggested an extremely (perhaps too much so) abstract archtecture for fetching in ticket #1201, and for the sake of brevity I won't repeat myself here, but I think that would give Nutch a good base for flexible fetching, which I believe is a huge improvement to the project. I'm obviously new to the development here and I'm willing do whatever needs doing, I just believe the fetching is something that needs doing. I just want to contribute!

Thanks,
Eddie

Reply via email to