Marcus Herou wrote:
Hi thanks for the answer.
I will not use HBase for free-text searching, for that Lucene is way more
mature, scalable etc.
What I want to use HBase for is a somewhat more familiar and clean concept
of storing data than large sequential files spread out on HDFS.
Typical use-cases:
* Search with Lucene in some way: Solr, NutchBean etc.
* Get the actual data from HBase or some other clustered db based on a
primary key which is stored in Lucene.
For occasional retrieval this might be ok, for quick access to many
random records - I doubt the performance would be acceptable.
* Applications get an easier integration point than using CrawlDb.get(...)
or dump.
* This is so we don't store the same data in duplicate (or more) places,
wasting disk.
Hmm ... Keep in mind that parsed text and parse data is needed when
searching, and for this you need to offer a maximum performance. If you
plan to keep parse text and parse data in HBase this means that you will
have to create a second copy of this data in a format suitable for fast
retrieval.
The yes answers in you mail was they referring to actual implementations ?
Unfortunately, no :) I only meant that in my opinion it would be
desirable to move this part of Nutch to HBase.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com