Sami Siren wrote:
Lots of good thoughts and ideas, easy to agree with.
Something for the "ease of use" category:
-allow running on top of plain vanilla hadoop
What does it mean "plain vanilla" here? Do you mean the current DB
implementation? That's the idea, we should aim for an abstract layer
that can accommodate both HBase and plain MapFile-s.
-split into reusable components with nice and clean public api
-publish mvn artifacts so developers can directly use mvn, ivy etc to
pull required dependencies for their specific crawler
+1, with slight preference towards ivy.
My biggest concern is in execution of this (or any other) plan.
Some of the changes or improvements that have been proposed are quite
"heavy" in nature and would require large changes. I am just thinking
that would it still be better to take a fresh start instead of trying to
do this incrementally on top of existing code base.
Well ... that's (almost) what Dogacan did with the HBase port. I agree
that we should not feel too constrained by the existing code base, but
it would be silly to throw everything away and start from scratch - we
need to find a middle ground. The crawler-commons and Tika projects
should help us to get rid of the ballast and significantly reduce the
size of our code.
In the history of Nutch this approach is not something new (remember map
reduce?) and in my opinion it worked nicely then. Perhaps it is
different this time since the changes we are discussing now have many
abstract things hanging in the air, even fundamental ones.
Nutch 0.7 to 0.8 reused a lot of the existing code.
Of course the rewrite approach means that it will take some time before
we actually get into the point where we can start adding real substance
(meaning new features etc).
So to summarize, I would go ahead and put together a branch "nutch N.0"
that would consist of (a.k.a my wish list, hope I am not being too
aggressive here):
-runs on top of plain hadoop
See above - what do you mean by that?
-use osgi (or some other more optimal extension mechanism that fits and
is easy to use)
-basic http/https crawling functionality (with "db abstraction" or hbase
directly and smart data structures that allow flexible and efficient
usage of the data)
-basic solr integration for indexing/search
-basic parsing with tika
After the basics are ok we would start adding and promoting any of the
hidden gems we might have, or some solutions for the interesting
challenges.
I believe that's more or less where Dogacan's port is right now, except
it's not merged with the OSGI port.
ps. many of the interesting challenges in your proposal seem to fall in
the category of "data analysis and manipulation" that are mostly, used
after the data has been crawled or between the fetch cycles so many of
those could be implemented into current code base also, somehow I just
feel that things could be made more efficient and understandable if the
foundation (eg. data structures, extendability for example) was in
better shape. Also if written nicely other projects could use them too!
Definitely agree with this. Example: the PageRank package - it works
quite well with the current code, but it's design is obscured by the
ScoringFilter api and the need to maintain its own extended DB-s.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com