Re: SerDe and Rows

Sanjit Jhala Wed, 26 May 2010 09:45:25 -0700

Yes, I think you can do this as long as all inserts follow the scheme you
mention. In fact you can even think of having point in time row versioning
defined at some timestamp T as the collection of the latest versions of
cells with timestamps <= T. As you rightly point out, its all about the use
case :)


-Sanjit


On Tue, May 25, 2010 at 7:39 PM, John Sichi <[email protected]> wrote:

> Sorry for my slow response; answers below.
>
> On May 20, 2010, at 5:23 PM, Sanjit Jhala wrote:
>
> > Thanks John, that does look quite interesting. It looks like in addition
> to containing a bunch of cells, the row class needs to provide some
> mechanism (eg a map) to efficiently lookup the cell corresponding to a given
> qualified column (ie column family + qualifier). In the case where a Hive
> column matches an entire column family, do you use this same map using the
> property that the column family is a prefix of the map key or is there an
> additional map that maps the column family to a set of qualifiers or
> directly to a set of cells ?
>
> There is a separate map (LazyHBaseCellMap).  LazyHBaseRow instantiates this
> for Hive column values which correspond to HBase column families.
>
> > The wiki also indicates that in future multiple versions of a cell could
> be exposed to the storage handler since Hive can deal with non-unique rows.
> I can definitely see how you should be able to  store non-unique Hive rows
> in Hypertable (since Hypertable supports multi-versioned cells), however
> since the fundamental unit of storage in the BigTable design is a cell, I
> don't understand how you propose to map multiple cell versions back to
> non-unique Hive rows. Maybe you're thinking of mapping them to a single Hive
> row, where the columns are of the List type? And then maybe the query
> language allows you to filter by the first, last or any value in the list?
>
>
> Yeah, I realized this recently when I started thinking about it again :)
>
> Exposing per-cell timestamps is possible, and there are a number of ways to
> do it, including the one you mention.  But they're all unwieldy, so we
> should probably defer them until there's a very good use case.
>
> A simpler scheme I'm thinking about is to map a Hive partition to a
> particular timestamp.  Then for queries, this will specify a point-in-time
> (we would need to validate that only equality predicates are used on the
> partition key since returning multiple versions of a row isn't well-defined
> as you correctly point out).  For inserts, all cells created would get the
> same timestamp.  Maybe this would cover the majority of use-cases?
>
> JVS
>
>

Re: SerDe and Rows

Reply via email to