Re: Nutch near future - strategic directions

Andrzej Bialecki Fri, 20 Nov 2009 03:42:52 -0800

Sami Siren wrote:

Lots of good thoughts and ideas, easy to agree with.


Something for the "ease of use" category:
-allow running on top of plain vanilla hadoop

What does it mean "plain vanilla" here? Do you mean the current DBimplementation? That's the idea, we should aim for an abstract layerthat can accommodate both HBase and plain MapFile-s.

-split into reusable components with nice and clean public api
-publish mvn artifacts so developers can directly use mvn, ivy etc topull required dependencies for their specific crawler


+1, with slight preference towards ivy.

My biggest concern is in execution of this (or any other) plan.
Some of the changes or improvements that have been proposed are quite"heavy" in nature and would require large changes. I am just thinkingthat would it still be better to take a fresh start instead of trying todo this incrementally on top of existing code base.

Well ... that's (almost) what Dogacan did with the HBase port. I agreethat we should not feel too constrained by the existing code base, butit would be silly to throw everything away and start from scratch - weneed to find a middle ground. The crawler-commons and Tika projectsshould help us to get rid of the ballast and significantly reduce thesize of our code.

In the history of Nutch this approach is not something new (remember mapreduce?) and in my opinion it worked nicely then. Perhaps it isdifferent this time since the changes we are discussing now have manyabstract things hanging in the air, even fundamental ones.


Nutch 0.7 to 0.8 reused a lot of the existing code.

Of course the rewrite approach means that it will take some time beforewe actually get into the point where we can start adding real substance(meaning new features etc).
So to summarize, I would go ahead and put together a branch "nutch N.0"that would consist of (a.k.a my wish list, hope I am not being tooaggressive here):
-runs on top of plain hadoop


See above - what do you mean by that?

-use osgi (or some other more optimal extension mechanism that fits andis easy to use)-basic http/https crawling functionality (with "db abstraction" or hbasedirectly and smart data structures that allow flexible and efficientusage of the data)
-basic solr integration for indexing/search
-basic parsing with tika
After the basics are ok we would start adding and promoting any of thehidden gems we might have, or some solutions for the interestingchallenges.

I believe that's more or less where Dogacan's port is right now, exceptit's not merged with the OSGI port.

ps. many of the interesting challenges in your proposal seem to fall inthe category of "data analysis and manipulation" that are mostly, usedafter the data has been crawled or between the fetch cycles so many ofthose could be implemented into current code base also, somehow I justfeel that things could be made more efficient and understandable if thefoundation (eg. data structures, extendability for example) was inbetter shape. Also if written nicely other projects could use them too!

Definitely agree with this. Example: the PageRank package - it worksquite well with the current code, but it's design is obscured by theScoringFilter api and the need to maintain its own extended DB-s.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Nutch near future - strategic directions

Reply via email to