Sorry for my slow response; answers below.

On May 20, 2010, at 5:23 PM, Sanjit Jhala wrote:

> Thanks John, that does look quite interesting. It looks like in addition to 
> containing a bunch of cells, the row class needs to provide some mechanism 
> (eg a map) to efficiently lookup the cell corresponding to a given qualified 
> column (ie column family + qualifier). In the case where a Hive column 
> matches an entire column family, do you use this same map using the property 
> that the column family is a prefix of the map key or is there an additional 
> map that maps the column family to a set of qualifiers or directly to a set 
> of cells ?

There is a separate map (LazyHBaseCellMap).  LazyHBaseRow instantiates this for 
Hive column values which correspond to HBase column families.

> The wiki also indicates that in future multiple versions of a cell could be 
> exposed to the storage handler since Hive can deal with non-unique rows. I 
> can definitely see how you should be able to  store non-unique Hive rows in 
> Hypertable (since Hypertable supports multi-versioned cells), however since 
> the fundamental unit of storage in the BigTable design is a cell, I don't 
> understand how you propose to map multiple cell versions back to non-unique 
> Hive rows. Maybe you're thinking of mapping them to a single Hive row, where 
> the columns are of the List type? And then maybe the query language allows 
> you to filter by the first, last or any value in the list?


Yeah, I realized this recently when I started thinking about it again :)

Exposing per-cell timestamps is possible, and there are a number of ways to do 
it, including the one you mention.  But they're all unwieldy, so we should 
probably defer them until there's a very good use case.

A simpler scheme I'm thinking about is to map a Hive partition to a particular 
timestamp.  Then for queries, this will specify a point-in-time (we would need 
to validate that only equality predicates are used on the partition key since 
returning multiple versions of a row isn't well-defined as you correctly point 
out).  For inserts, all cells created would get the same timestamp.  Maybe this 
would cover the majority of use-cases?

JVS

Reply via email to