John, theres some logic in the helper serialize method to serialize lists
and structs. Is this used currently? I was under the impression that maps
and primitives are the only types currently supported by the connector.

-Sanjit

On Thu, May 27, 2010 at 9:53 PM, Sanjit Jhala <[email protected]> wrote:

> I completely missed that. Being public threw me off.
>
> -Sanjit
>
>
> On Thu, May 27, 2010 at 9:43 PM, John Sichi <[email protected]> wrote:
>
>> It is a helper method called by the main serialize method; it shouldn't
>> actually be public at all.  Note that as an example, this code is good for
>> understanding the interactions, but it is not a good example for code
>> structure or performance; I'm working on refactoring it.
>>
>> JVS
>>
>> ________________________________________
>> From: Sanjit Jhala [[email protected]]
>> Sent: Thursday, May 27, 2010 9:29 PM
>> To: [email protected]
>> Subject: Re: SerDe and Rows
>>
>> Btw whats the purpose of the alternative serialize API in HBaseSerDe.java
>> ?
>> public static boolean serialize(ByteStream.Output out, Object obj,
>> ObjectInspector objInspector, byte[] separators, int level, Text
>> nullSequence, boolean escaped, byte escapeChar, boolean[] needsEscape)
>>
>> It doesn't look like this API is part of the SerDe interface and I'm
>> wondering where it gets called from?
>>
>> -Sanjit
>>
>> On Wed, May 26, 2010 at 9:38 AM, Sanjit Jhala <[email protected]<mailto:
>> [email protected]>> wrote:
>> Yes, I think you can do this as long as all inserts follow the scheme you
>> mention. In fact you can even think of having point in time row versioning
>> defined at some timestamp T as the collection of the latest versions of
>> cells with timestamps <= T. As you rightly point out, its all about the use
>> case :)
>>
>> -Sanjit
>>
>>
>>
>> On Tue, May 25, 2010 at 7:39 PM, John Sichi <[email protected]<mailto:
>> [email protected]>> wrote:
>> Sorry for my slow response; answers below.
>>
>> On May 20, 2010, at 5:23 PM, Sanjit Jhala wrote:
>>
>> > Thanks John, that does look quite interesting. It looks like in addition
>> to containing a bunch of cells, the row class needs to provide some
>> mechanism (eg a map) to efficiently lookup the cell corresponding to a given
>> qualified column (ie column family + qualifier). In the case where a Hive
>> column matches an entire column family, do you use this same map using the
>> property that the column family is a prefix of the map key or is there an
>> additional map that maps the column family to a set of qualifiers or
>> directly to a set of cells ?
>>
>> There is a separate map (LazyHBaseCellMap).  LazyHBaseRow instantiates
>> this for Hive column values which correspond to HBase column families.
>>
>> > The wiki also indicates that in future multiple versions of a cell could
>> be exposed to the storage handler since Hive can deal with non-unique rows.
>> I can definitely see how you should be able to  store non-unique Hive rows
>> in Hypertable (since Hypertable supports multi-versioned cells), however
>> since the fundamental unit of storage in the BigTable design is a cell, I
>> don't understand how you propose to map multiple cell versions back to
>> non-unique Hive rows. Maybe you're thinking of mapping them to a single Hive
>> row, where the columns are of the List type? And then maybe the query
>> language allows you to filter by the first, last or any value in the list?
>>
>>
>> Yeah, I realized this recently when I started thinking about it again :)
>>
>> Exposing per-cell timestamps is possible, and there are a number of ways
>> to do it, including the one you mention.  But they're all unwieldy, so we
>> should probably defer them until there's a very good use case.
>>
>> A simpler scheme I'm thinking about is to map a Hive partition to a
>> particular timestamp.  Then for queries, this will specify a point-in-time
>> (we would need to validate that only equality predicates are used on the
>> partition key since returning multiple versions of a row isn't well-defined
>> as you correctly point out).  For inserts, all cells created would get the
>> same timestamp.  Maybe this would cover the majority of use-cases?
>>
>> JVS
>>
>>
>>
>>
>

Reply via email to