2014 notes

Andrew Purtell Mon, 19 May 2014 06:32:39 -0700

So if I can summarize this thread so far, we are going to try and hammer out a 
types encoding spec agreeable to HBase, Phoenix, and Kite alike? As opposed to 
select a particular implementation today as both spec and reference 
implementation. Is that correct?


If so, that sounds like a promising direction. The HBase types library has the 
flexibility, if I understand Nick correctly, to accommodate whatever is agreed 
upon and we could then provide a reference implementation as a service for 
HBase users (or anyone) but there would be no strings attached, multiple 
implementations of the spec would interoperate by definition. 


> On May 19, 2014, at 3:20 AM, Nick Dimiduk <[email protected]> wrote:
> 
> On Thu, May 15, 2014 at 9:32 AM, James Taylor <[email protected]>wrote:
> 
>> @Nick - I like the abstraction of the DataType, but that doesn't solve the
>> problem for non Java usage.
> 
> 
> That's true. It's very much a Java construct. Likewise, Struct only codes
> for semantics; there's no encoding defined there. For correct
> multi-language support, we'll need to define these semantics the same way
> we do the encoding details so that implementations can reproduce them
> faithfully.
> 
> I'm also a bit worried that it might become a bottleneck for implementors
>> of the serialization spec as there are many different platform specific
>> operations that will likely be done on the row key. We can try to get
>> everything necessary in the DataType interface, but I suspect that
>> implementors will need to go under-the-covers at times (rather than waiting
>> for another release of the module that defines the DataType interface) -
>> might become a bottleneck.
> 
> Time will tell. DataType is just an interface, after all. If there are
> things it's missing (as there surely are, for Phoenix...), it'll need to be
> extended locally until these features can be pushed down into HBase. HBase
> release managers have been faithful to the monthly release train, so I
> think in practice dependent projects won't have to wait long. I'm content
> to take this on a case-by-case basis and watch for a trend. Do you have an
> alternative idea?
> 
>> On Wed, May 14, 2014 at 5:17 PM, Nick Dimiduk <[email protected]> wrote:
>> 
>>> On Tue, May 13, 2014 at 3:35 PM, Ryan Blue <[email protected]> wrote:
>>> 
>>> 
>>>> I think there's a little confusion in what we are trying to accomplish.
>>>> What I want to do is to write a minimal specification for how to store
>> a
>>>> set of types. I'm not trying to leave much flexibility, what I want is
>>>> clarity and simplicity.
>>> 
>>> This is admirable and was my initial goal as well. The trouble is, you
>>> cannot please everyone, current users and new. So, we decided it was
>> better
>>> to provide a pluggable framework for extension + some basic
>> implementations
>>> than to implement a closed system.
>>> 
>>> This is similar to OrderedBytes work, but a subset of it. A good example
>> is
>>>> that while it's possible to use different encodings (avro, protobuf,
>>>> thrift, ...) it isn't practical for an application to support all of
>>> those
>>>> encodings. So for interoperability between Kite, Phoenix, and others, I
>>>> want a set of requirements that is as small as possible.
>>> 
>>> Minimal is good. The surface area of o.a.h.h.types is as large as it is
>>> because there was always "just one more" type to support or encoding to
>>> provide.
>>> 
>>> To make the requirements small, I used off-the-shelf protobuf [1] plus a
>>>> small set of memcmp encodings: ints, floats, and binary. That way, we
>>> don't
>>>> have to talk about how to make a memcmp Date in bytes, for example. A
>>> Date
>>>> is an int, which we know how to encode, and we can agree separately on
>>> how
>>>> to a Date is represented (e.g., Julian vs unix epoch). [2] The same
>>> applies
>>>> to binary, where the encoding handles sorting and nulls, but not
>>> charsets.
>>> 
>>> I think you should focus on the primitives you want to support. The
>>> compound type stuff (ie, "rowkey encodings") is a can of worms because
>> you
>>> need to support existing users, new users, novice users, and advanced
>>> users. Hence the interop between the DataType interface and the Struct
>>> classes. These work together to support all of these use-cases with the
>>> same basic code. For example, the protobuf encoding of postion|wire-type
>> +
>>> encoded value is easily implemented using Struct.
>>> 
>>> I firmly believe that we cannot dictate rowkey composition. Applications,
>>> however, are free to implement their own. By using the common DataType
>>> interface, they can all interoperate.
>>> 
>>> This is the largest reason why I didn't include OrderedBytes directly in
>>>> the spec. For example, OB includes a varint that I don't think is
>>> needed. I
>>>> don't object to its inclusion in OB, but I think it isn't a necessary
>>>> requirement for implementing this spec.
>>> 
>>> Again, the surface area is as it is because of community consensus during
>>> the first phase of implementation. That consensus disagrees with you.
>>> 
>>> I think there are 3 things to clear up:
>>>> 1. What types from OB are not included, and why?
>>>> 2. Why not use OB-style structs?
>>>> 3. Why choose protobuf for complex records?
>>>> 
>>>> Does that sound like a reasonable direction to head with this
>> discussion?
>>> 
>>> Yes, sounds great!
>>> 
>>> As far as the DataType API, I think that works great with what I'm trying
>>>> to do. We'd build a DataType implementation for the encoding and the
>> API
>>>> will applications handle the underlying encoding. And other encoding
>>>> strategies can be swapped in as well, if we want to address
>> shortcomings
>>> in
>>>> this one, or have another for a different use case.
>>> 
>>> I'm quite pleased to hear that. Applications like Kite, Phoenix, Kiji are
>>> the target audience of the DataType API.
>>> 
>>> Thank you for picking back up this baton. It's sat for too long.
>>> 
>>> -n
>>> 
>>>> On 05/13/2014 02:33 PM, Nick Dimiduk wrote:
>>>> 
>>>>> Breaking off hackathon thread.
>>>>> 
>>>>> The conversation around HBASE-8089 concluded with two points:
>>>>>  - HBase should provide support for order-preserving encodings while
>>>>> not dropping support for the existing encoding formats.
>>>>>  - HBase is not in the business of schema management; that is a
>>>>> responsibility left to application developers.
>>>>> 
>>>>> To handle the first point, OrderedBytes is provided. For the
>> supporting
>>>>> the second, the DataType API is introduced. By introducing this layer
>>>>> above specific encoding formats, it gives us a hook for plugging in
>>>>> different implementations and for helper utilities to ship with HBase,
>>>>> such as HBASE-10091.
>>>>> 
>>>>> Things get fuzzy around complex data types: pojos, compound rowkeys (a
>>>>> special case of pojo), maps/dicts, and lists/arrays. These types are
>>>>> composed of other types and have different requirements based on where
>>>>> in the schema they're used. Again, by falling back on the DataType
>> API,
>>>>> we give application developers an "out" for doing what makes the most
>>>>> sense for them.
>>>>> 
>>>>> For compound rowkeys, the Struct class is designed to fill in this
>> gap,
>>>>> sitting between data encoding and schema expression. It gives the
>>>>> application implementer, the person managing the schema, enough
>>>>> flexibility express the key encoding in terms of the component types.
>>>>> These components are not limited to the simple primitives already
>>>>> defined, but any DataType implementation. Order preservation is likely
>>>>> important here.
>>>>> 
>>>>> For arrays/lists, there's no implementation yet, but you can see how
>> it
>>>>> might be done if you have a look at struct. Order preservation may or
>>>>> may not be important for arrays/list.
>>>>> 
>>>>> The situation for maps/dicts is similar to arrays/lists. The one
>>>>> complication is the case where you want to map to a column family. How
>>>>> can these APIs support this thing?
>>>>> 
>>>>> Pojos are a little more complicated. Probably Struct is sufficient for
>>>>> basic cases, but it doesn't support nice features like versioning --
>>>>> these are sacrificed in favor of order preservation. Luckily, there's
>>>>> plenty of tools out there for this already: Avro, MessagePack,
>> Protobuf,
>>>>> Thrift, &c. There's no need to reinvent the wheel here. Application
>>>>> developers can implement the DataType API backed by their management
>>>>> tool of choice. I created HBASE-11161 and will post a patch shortly.
>>>>> 
>>>>> Specific comments about the Hackathon notes inline.
>>>>> 
>>>>> Thanks,
>>>>> Nick
>>>> 
>>>> 
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Cloudera, Inc.
>>

Re: [common type encoding breakout] Re: HBase Hackathon @ Salesforce 05/06/2014 notes

Reply via email to