Re: [DISCUSS] Issues with Fetcher

Ken Krugler Sat, 21 Jan 2012 10:33:17 -0800

Hi Eddie,

My own personal favorite area would be to integrate with crawler-commons.


There's been some occasional work done to move things into this shared project 
- e.g. robots parser & a base HTTP fetcher from Bixo.

I believe there's a Jira issue open to switch Nutch to using that robots.txt 
parser, which would be an improvement over what Nutch currently has.

There are other pieces of Nutch that could/eventually should be moved there, 
e.g. URL normalization, but that doesn't directly benefit Nutch, just other 
Java-based crawlers.

Or, if you have experience with JSPs/GUI work, then I think there's this big 
open issue around improving the Nutch GUI, which would likely provide the most 
benefit to the most users. I haven't been following the current status, but I 
know that there have been periodic discussions, and I think 101tec did some 
work on this a while back (for a client), but I don't know if that's been 
contributed (or could be, for that matter).

-- Ken

On Jan 21, 2012, at 8:17am, Edward Drapkin wrote:

> On 1/21/2012 8:27 AM, Lewis John Mcgibbney wrote:
>> 
>> Hi Julien,
>> 
>> 
>> There are 8 issues in trunk about the fetcher - some of them unrelated to 
>> the Fetcher (NUTCH-827 / Nutch-1193) with most of the others being 
>> improvements (NUTCH-828 / NUTCH-1079) with possibly just a very few being 
>> real issues.
>>  
>> This puts the whole discussion into much better context, thanks for pointing 
>> this out. Maybe I should have made it more clear, that I only filtered the 
>> fetcher issues on our Jira and I was simply modelling my discussion around 
>> that. You are completely correct though, it would be different if the 
>> fetcher was in a similar state to protocol-httpclient... which it is 
>> obviously not.
>>  
>> I am also concerned about getting too radical changes to such a core part of 
>> the framework, especially when more pressing issues could be looked after 
>> instead.
>> +1
>>  
>> Having said that if someone can come up with an interesting proposal for 
>> improving the Fetcher that would be very good, I would simply suggest that 
>> we then have a separate implementation for that.
>> +1
>>  
>> 
>> 
>> Ok with this in mind then, is there some guidance we can communicate to 
>> Eddie? He has specifically mentioned that he shares similar opinions wrt the 
>> fetcher being a core part of Nutch, radical changes etc, and I also share 
>> this point of view. He has also added that he doesn't want to spend the time 
>> changing material which we may or may not merge with trunk, this also makes 
>> perfect sense. Additionally Ken's comments emphasise that this has been 
>> somewhat attempted in the past and that lessons have been learned and the 
>> implementation we have cuts the mustard as is. 
>> Maybe we could nudge Eddie in the right direction, which would benefit both 
>> himself and the project over the next while, I think this was the most 
>> important point I was trying to emphasise, however looking over my original 
>> comment this was maybe not how it was written.
>> 
>> Thanks
>> Lewis
> 
> If there's more important and/or interesting things for me to work on, I'll 
> be glad to.  I'm completely unfamiliar with the current state of the project 
> as a whole - and looking through JIRA is a bit daunting.  The only reason I'm 
> attracted to working on the fetcher is I think it's a really interesting and 
> compelling problem to solve, and it's making it more flexible is something 
> that would directly benefit our use for it, so it will be easier to devote 
> time to it while I'm at the office.  I do have a glut of free time at the 
> moment though, so I'm perfectly okay working on another area that's more 
> pressing - I just don't know what it is.  I saw that protocol-httpclient 
> needs to be rewritten, is there someone working on that?
> 
> I can work on more important and less controversial / radical things, but I 
> do think that having a more flexible, pluggable fetcher will be an enormous 
> improvement to Nutch and can greatly expand the potential uses for it as a 
> piece of software.  There's a ton of cases where pluggable fetching could 
> have a huge improvement: local filesystem search, single-threaded / small 
> site indexing, email indexing (SMTP, POP, etc.), etc.  I suggested an 
> extremely (perhaps too much so) abstract archtecture for fetching in ticket 
> #1201, and for the sake of brevity I won't repeat myself here, but I think 
> that would give Nutch a good base for flexible fetching, which I believe is a 
> huge improvement to the project.  I'm obviously new to the development here 
> and I'm willing do whatever needs doing, I just believe the fetching is 
> something that needs doing.  I just want to contribute!
> 
> Thanks,
> Eddie

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: [DISCUSS] Issues with Fetcher

Reply via email to