Hi Eddie, My own personal favorite area would be to integrate with crawler-commons.
There's been some occasional work done to move things into this shared project - e.g. robots parser & a base HTTP fetcher from Bixo. I believe there's a Jira issue open to switch Nutch to using that robots.txt parser, which would be an improvement over what Nutch currently has. There are other pieces of Nutch that could/eventually should be moved there, e.g. URL normalization, but that doesn't directly benefit Nutch, just other Java-based crawlers. Or, if you have experience with JSPs/GUI work, then I think there's this big open issue around improving the Nutch GUI, which would likely provide the most benefit to the most users. I haven't been following the current status, but I know that there have been periodic discussions, and I think 101tec did some work on this a while back (for a client), but I don't know if that's been contributed (or could be, for that matter). -- Ken On Jan 21, 2012, at 8:17am, Edward Drapkin wrote: > On 1/21/2012 8:27 AM, Lewis John Mcgibbney wrote: >> >> Hi Julien, >> >> >> There are 8 issues in trunk about the fetcher - some of them unrelated to >> the Fetcher (NUTCH-827 / Nutch-1193) with most of the others being >> improvements (NUTCH-828 / NUTCH-1079) with possibly just a very few being >> real issues. >> >> This puts the whole discussion into much better context, thanks for pointing >> this out. Maybe I should have made it more clear, that I only filtered the >> fetcher issues on our Jira and I was simply modelling my discussion around >> that. You are completely correct though, it would be different if the >> fetcher was in a similar state to protocol-httpclient... which it is >> obviously not. >> >> I am also concerned about getting too radical changes to such a core part of >> the framework, especially when more pressing issues could be looked after >> instead. >> +1 >> >> Having said that if someone can come up with an interesting proposal for >> improving the Fetcher that would be very good, I would simply suggest that >> we then have a separate implementation for that. >> +1 >> >> >> >> Ok with this in mind then, is there some guidance we can communicate to >> Eddie? He has specifically mentioned that he shares similar opinions wrt the >> fetcher being a core part of Nutch, radical changes etc, and I also share >> this point of view. He has also added that he doesn't want to spend the time >> changing material which we may or may not merge with trunk, this also makes >> perfect sense. Additionally Ken's comments emphasise that this has been >> somewhat attempted in the past and that lessons have been learned and the >> implementation we have cuts the mustard as is. >> Maybe we could nudge Eddie in the right direction, which would benefit both >> himself and the project over the next while, I think this was the most >> important point I was trying to emphasise, however looking over my original >> comment this was maybe not how it was written. >> >> Thanks >> Lewis > > If there's more important and/or interesting things for me to work on, I'll > be glad to. I'm completely unfamiliar with the current state of the project > as a whole - and looking through JIRA is a bit daunting. The only reason I'm > attracted to working on the fetcher is I think it's a really interesting and > compelling problem to solve, and it's making it more flexible is something > that would directly benefit our use for it, so it will be easier to devote > time to it while I'm at the office. I do have a glut of free time at the > moment though, so I'm perfectly okay working on another area that's more > pressing - I just don't know what it is. I saw that protocol-httpclient > needs to be rewritten, is there someone working on that? > > I can work on more important and less controversial / radical things, but I > do think that having a more flexible, pluggable fetcher will be an enormous > improvement to Nutch and can greatly expand the potential uses for it as a > piece of software. There's a ton of cases where pluggable fetching could > have a huge improvement: local filesystem search, single-threaded / small > site indexing, email indexing (SMTP, POP, etc.), etc. I suggested an > extremely (perhaps too much so) abstract archtecture for fetching in ticket > #1201, and for the sake of brevity I won't repeat myself here, but I think > that would give Nutch a good base for flexible fetching, which I believe is a > huge improvement to the project. I'm obviously new to the development here > and I'm willing do whatever needs doing, I just believe the fetching is > something that needs doing. I just want to contribute! > > Thanks, > Eddie -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

