John, theres some logic in the helper serialize method to serialize lists and structs. Is this used currently? I was under the impression that maps and primitives are the only types currently supported by the connector.
-Sanjit On Thu, May 27, 2010 at 9:53 PM, Sanjit Jhala <[email protected]> wrote: > I completely missed that. Being public threw me off. > > -Sanjit > > > On Thu, May 27, 2010 at 9:43 PM, John Sichi <[email protected]> wrote: > >> It is a helper method called by the main serialize method; it shouldn't >> actually be public at all. Note that as an example, this code is good for >> understanding the interactions, but it is not a good example for code >> structure or performance; I'm working on refactoring it. >> >> JVS >> >> ________________________________________ >> From: Sanjit Jhala [[email protected]] >> Sent: Thursday, May 27, 2010 9:29 PM >> To: [email protected] >> Subject: Re: SerDe and Rows >> >> Btw whats the purpose of the alternative serialize API in HBaseSerDe.java >> ? >> public static boolean serialize(ByteStream.Output out, Object obj, >> ObjectInspector objInspector, byte[] separators, int level, Text >> nullSequence, boolean escaped, byte escapeChar, boolean[] needsEscape) >> >> It doesn't look like this API is part of the SerDe interface and I'm >> wondering where it gets called from? >> >> -Sanjit >> >> On Wed, May 26, 2010 at 9:38 AM, Sanjit Jhala <[email protected]<mailto: >> [email protected]>> wrote: >> Yes, I think you can do this as long as all inserts follow the scheme you >> mention. In fact you can even think of having point in time row versioning >> defined at some timestamp T as the collection of the latest versions of >> cells with timestamps <= T. As you rightly point out, its all about the use >> case :) >> >> -Sanjit >> >> >> >> On Tue, May 25, 2010 at 7:39 PM, John Sichi <[email protected]<mailto: >> [email protected]>> wrote: >> Sorry for my slow response; answers below. >> >> On May 20, 2010, at 5:23 PM, Sanjit Jhala wrote: >> >> > Thanks John, that does look quite interesting. It looks like in addition >> to containing a bunch of cells, the row class needs to provide some >> mechanism (eg a map) to efficiently lookup the cell corresponding to a given >> qualified column (ie column family + qualifier). In the case where a Hive >> column matches an entire column family, do you use this same map using the >> property that the column family is a prefix of the map key or is there an >> additional map that maps the column family to a set of qualifiers or >> directly to a set of cells ? >> >> There is a separate map (LazyHBaseCellMap). LazyHBaseRow instantiates >> this for Hive column values which correspond to HBase column families. >> >> > The wiki also indicates that in future multiple versions of a cell could >> be exposed to the storage handler since Hive can deal with non-unique rows. >> I can definitely see how you should be able to store non-unique Hive rows >> in Hypertable (since Hypertable supports multi-versioned cells), however >> since the fundamental unit of storage in the BigTable design is a cell, I >> don't understand how you propose to map multiple cell versions back to >> non-unique Hive rows. Maybe you're thinking of mapping them to a single Hive >> row, where the columns are of the List type? And then maybe the query >> language allows you to filter by the first, last or any value in the list? >> >> >> Yeah, I realized this recently when I started thinking about it again :) >> >> Exposing per-cell timestamps is possible, and there are a number of ways >> to do it, including the one you mention. But they're all unwieldy, so we >> should probably defer them until there's a very good use case. >> >> A simpler scheme I'm thinking about is to map a Hive partition to a >> particular timestamp. Then for queries, this will specify a point-in-time >> (we would need to validate that only equality predicates are used on the >> partition key since returning multiple versions of a row isn't well-defined >> as you correctly point out). For inserts, all cells created would get the >> same timestamp. Maybe this would cover the majority of use-cases? >> >> JVS >> >> >> >> >
