Re: Nutch + HBase

Andrzej Bialecki Tue, 17 Jun 2008 13:14:05 -0700

Marcus Herou wrote:

Hi thanks for the answer.


I will not use HBase for free-text searching, for that Lucene is way more
mature, scalable etc.

What I want to use HBase for is a somewhat more familiar and clean concept
of storing data than large sequential files spread out on HDFS.
Typical use-cases:

* Search with Lucene in some way: Solr, NutchBean etc.
* Get the actual data from HBase or some other clustered db based on a
primary key which is stored in Lucene.

For occasional retrieval this might be ok, for quick access to manyrandom records - I doubt the performance would be acceptable.

* Applications get an easier integration point than using CrawlDb.get(...)
or dump.
* This is so we don't store the same data in duplicate (or more) places,
wasting disk.

Hmm ... Keep in mind that parsed text and parse data is needed whensearching, and for this you need to offer a maximum performance. If youplan to keep parse text and parse data in HBase this means that you willhave to create a second copy of this data in a format suitable for fastretrieval.


The yes answers in you mail was they referring to actual implementations ?

Unfortunately, no :) I only meant that in my opinion it would bedesirable to move this part of Nutch to HBase.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Nutch + HBase

Reply via email to