RE: SerDe and Rows

John Sichi Thu, 27 May 2010 21:45:42 -0700

It is a helper method called by the main serialize method; it shouldn't 
actually be public at all.  Note that as an example, this code is good for 
understanding the interactions, but it is not a good example for code structure 
or performance; I'm working on refactoring it.

JVS

________________________________________
From: Sanjit Jhala [[email protected]]
Sent: Thursday, May 27, 2010 9:29 PM
To: [email protected]
Subject: Re: SerDe and Rows

Btw whats the purpose of the alternative serialize API in HBaseSerDe.java ?
public static boolean serialize(ByteStream.Output out, Object obj, 
ObjectInspector objInspector, byte[] separators, int level, Text nullSequence, 
boolean escaped, byte escapeChar, boolean[] needsEscape)

It doesn't look like this API is part of the SerDe interface and I'm wondering 
where it gets called from?

-Sanjit

On Wed, May 26, 2010 at 9:38 AM, Sanjit Jhala 
<[email protected]<mailto:[email protected]>> wrote:
Yes, I think you can do this as long as all inserts follow the scheme you 
mention. In fact you can even think of having point in time row versioning 
defined at some timestamp T as the collection of the latest versions of cells 
with timestamps <= T. As you rightly point out, its all about the use case :)

-Sanjit

On Tue, May 25, 2010 at 7:39 PM, John Sichi 
<[email protected]<mailto:[email protected]>> wrote:
Sorry for my slow response; answers below.

On May 20, 2010, at 5:23 PM, Sanjit Jhala wrote:

> Thanks John, that does look quite interesting. It looks like in addition to 
> containing a bunch of cells, the row class needs to provide some mechanism 
> (eg a map) to efficiently lookup the cell corresponding to a given qualified 
> column (ie column family + qualifier). In the case where a Hive column 
> matches an entire column family, do you use this same map using the property 
> that the column family is a prefix of the map key or is there an additional 
> map that maps the column family to a set of qualifiers or directly to a set 
> of cells ?

There is a separate map (LazyHBaseCellMap).  LazyHBaseRow instantiates this for 
Hive column values which correspond to HBase column families.

> The wiki also indicates that in future multiple versions of a cell could be 
> exposed to the storage handler since Hive can deal with non-unique rows. I 
> can definitely see how you should be able to  store non-unique Hive rows in 
> Hypertable (since Hypertable supports multi-versioned cells), however since 
> the fundamental unit of storage in the BigTable design is a cell, I don't 
> understand how you propose to map multiple cell versions back to non-unique 
> Hive rows. Maybe you're thinking of mapping them to a single Hive row, where 
> the columns are of the List type? And then maybe the query language allows 
> you to filter by the first, last or any value in the list?

Yeah, I realized this recently when I started thinking about it again :)

Exposing per-cell timestamps is possible, and there are a number of ways to do 
it, including the one you mention.  But they're all unwieldy, so we should 
probably defer them until there's a very good use case.

A simpler scheme I'm thinking about is to map a Hive partition to a particular 
timestamp.  Then for queries, this will specify a point-in-time (we would need 
to validate that only equality predicates are used on the partition key since 
returning multiple versions of a row isn't well-defined as you correctly point 
out).  For inserts, all cells created would get the same timestamp.  Maybe this 
would cover the majority of use-cases?

JVS

RE: SerDe and Rows

Reply via email to