Hi.

Cool! This is a much lower level and probably better approach than ours. We
have now a functional index which however only have support for primitve
types but not free text indexing. It can store dups of data in the index for
fast retrieval. It is mostly used as a test of howto scale indexing
alongside with HBase. In the end we will probably stick with Lucene.

We will probably in the end as well subclass HRegion, HTable etc but for now
we have a system which rather uses the existing framework.

I understand that you would like to use HDFS for storing stuff... But have
you tried GlusterFS ?

It is so simple and really works as a normal POSIX system. We will store our
Solr based index failes in GlusterFS. Actually I think we will use GlusterFS
as storing mechanism for the HDFS as well :) Stupid but we have some highly
potential storage machines which are must faster than a bunch of local
machines.

The community should really spend some time in looking at the first of my
knowledge clustered file system which will lower storage costs making SAN
commodity. Yes we have Lustre, yes we have KosmosFS but have you ever tried
to install Lustre ? Puh... Enough about GlusterFS this is a HBase mailing
list :)

Kindly

//Marcus

On Tue, Aug 5, 2008 at 4:58 PM, Ning Li <[EMAIL PROTECTED]> wrote:

> We have been working on supporting Lucene-based index in HBase.
> In a nutshell, we extend the region to support indexing on column(s).
>
> We have a working implementation of our design. An overview of our
> design and the preliminary performance evaluation is provided below.
> We welcome feedback and we would be happy to contribute the code
> to HBase once the major performance issue is resolved.
>
> DATA MODEL
> An index can be created for a column, a column family or all the
> columns. In the implementation, we extend the HRegion class so that
> it not only manages store files which stores the column values of a
> region, but also Lucene instances which are used to support indexing
> on columns.
>
> The following assumes a per-column index and in the end we'll briefly
> describe how per-column family index and all-column index work.
>
> UPDATING A COLUMN
> Upon receiving a column update request, a region not only adds the
> column to the cache part of the store, but also analyzes the column
> and adds it to the cache part of the index. Same as the store files,
> the Lucene index files are also written to HDFS.
>
> Following the HBase design, to avoid resource contention, a region
> server globally schedules the cache flush and the compaction of both
> the store files and the index files of all the regions on the server.
>
> QUERYING AN INDEX
> We add to HTable the following method to enable querying an index.
>    Results search(range, column, query, max_num_hits);
> Depending on the specified key range, a client sends a search request
> to one or more region servers, who call the search method of queried
> regions. The client will merge the results from all the queried regions.
>
> In the current implementation, queries are conducted on the index files
> stored in HDFS.
>
> SPLITTING A REGION
> The region split works the same way as before - in addition to creating
> reference files for store files, reference files are also created for index
> files in the child regions. The old parent region will be deleted once
> all the reference files are deleted.
>
> PERFORMANCE ISSUES
> Our preliminary performance experiments show that the performance
> of building an index is quite reasonable. However, the performance of
> random reads in HDFS is so poor that the search performance is
> dramatically worse than that on local file systems.
>
> We are exploring different ways to solve this problem. One possibility
> is to store a copy on local file system. On the other hand, most likely
> HDFS already stores a local copy...
>
> VARIATIONS
> As we mentioned earlier, an index can also be created for a column
> family or for all the columns. If an index is created for a column family,
> whenever a column is updated, the rest of the column family needs to
> be retrieved to re-index the column family. This adds some overhead
> to the indexing process. Also, it's open what the best versioning
> semantics is.
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[EMAIL PROTECTED]
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Reply via email to