Sebastien Rainville wrote:
Did anybody tried to use both Nutch and HBase together yet?
Basically I need to store structured information extracted from the web
pages. Saving that data in a database like mysql would be a temporary
option but in the long term, the amount of information will grow fast
and I'll need a more scalable system. That's where HBase comes into
play. The next logical move would then be to modify nutch to save the
pages in HBase. The system would then be very flexible. Is it what you
guys have in mind for the future of Nutch?
In short - yes. However, at the moment HBase still seems too unstable to
integrate it into Nutch. So basically we (at least me and Dennis) are
playing with it to get the feel of what's possible.
But for now, Nutch is not integrated with HBase... I can still write
Nutch extensions that save the structured data that I need into HBase.
Is there a way to make them interact smoothly? The first obvious problem
that I have is that both of them are built on a different version of
Hadoop. Is there's a good way of doing it?
Good news: Nutch trunk has been updated to Hadoop 0.15. The first
official release of HBase also runs on Hadoop 0.15.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com