Sami Siren wrote:
Lots of good thoughts and ideas, easy to agree with.

Something for the "ease of use" category:
-allow running on top of plain vanilla hadoop

What does it mean "plain vanilla" here? Do you mean the current DB implementation? That's the idea, we should aim for an abstract layer that can accommodate both HBase and plain MapFile-s.

-split into reusable components with nice and clean public api
-publish mvn artifacts so developers can directly use mvn, ivy etc to pull required dependencies for their specific crawler

+1, with slight preference towards ivy.


My biggest concern is in execution of this (or any other) plan.
Some of the changes or improvements that have been proposed are quite "heavy" in nature and would require large changes. I am just thinking that would it still be better to take a fresh start instead of trying to do this incrementally on top of existing code base.

Well ... that's (almost) what Dogacan did with the HBase port. I agree that we should not feel too constrained by the existing code base, but it would be silly to throw everything away and start from scratch - we need to find a middle ground. The crawler-commons and Tika projects should help us to get rid of the ballast and significantly reduce the size of our code.

In the history of Nutch this approach is not something new (remember map reduce?) and in my opinion it worked nicely then. Perhaps it is different this time since the changes we are discussing now have many abstract things hanging in the air, even fundamental ones.

Nutch 0.7 to 0.8 reused a lot of the existing code.


Of course the rewrite approach means that it will take some time before we actually get into the point where we can start adding real substance (meaning new features etc).

So to summarize, I would go ahead and put together a branch "nutch N.0" that would consist of (a.k.a my wish list, hope I am not being too aggressive here):

-runs on top of plain hadoop

See above - what do you mean by that?

-use osgi (or some other more optimal extension mechanism that fits and is easy to use) -basic http/https crawling functionality (with "db abstraction" or hbase directly and smart data structures that allow flexible and efficient usage of the data)
-basic solr integration for indexing/search
-basic parsing with tika

After the basics are ok we would start adding and promoting any of the hidden gems we might have, or some solutions for the interesting challenges.

I believe that's more or less where Dogacan's port is right now, except it's not merged with the OSGI port.

ps. many of the interesting challenges in your proposal seem to fall in the category of "data analysis and manipulation" that are mostly, used after the data has been crawled or between the fetch cycles so many of those could be implemented into current code base also, somehow I just feel that things could be made more efficient and understandable if the foundation (eg. data structures, extendability for example) was in better shape. Also if written nicely other projects could use them too!

Definitely agree with this. Example: the PageRank package - it works quite well with the current code, but it's design is obscured by the ScoringFilter api and the need to maintain its own extended DB-s.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to