2014 notes

James Taylor Sun, 08 Jun 2014 17:25:25 -0700

Couple items I didn't see mentioned, but I think would be good to get
clarity on:
* variable length DECIMAL (Phoenix relies on this)
* ARRAY type (Phoenix supports this - arrays of fixed width data is
just concatenated together, while arrays of variable length data is
run-length-encoded with a double null byte terminator followed by an
index of the start position of each element )
* Optional use of mem comparable composite row key as the value of a
KeyValue (I think this makes things easier).
Thanks,
James


On Tue, May 20, 2014 at 3:40 AM, Nick Dimiduk <[email protected]> wrote:
> That's correct Andy. We're locking down the "default" primitive type
> implementations going forward, while maintaining a flexible API such that we
> can support existing users who want to migrate to the applicable new
> features without rewriting existing data. Obviously some of those features
> will depend on the new encoding semantics, but I think we can offer a net
> improvement even for existing applications.
>
>
> On Mon, May 19, 2014 at 6:31 AM, Andrew Purtell <[email protected]>
> wrote:
>>
>> So if I can summarize this thread so far, we are going to try and hammer
>> out a types encoding spec agreeable to HBase, Phoenix, and Kite alike? As
>> opposed to select a particular implementation today as both spec and
>> reference implementation. Is that correct?
>>
>> If so, that sounds like a promising direction. The HBase types library has
>> the flexibility, if I understand Nick correctly, to accommodate whatever is
>> agreed upon and we could then provide a reference implementation as a
>> service for HBase users (or anyone) but there would be no strings attached,
>> multiple implementations of the spec would interoperate by definition.
>>
>>
>> > On May 19, 2014, at 3:20 AM, Nick Dimiduk <[email protected]> wrote:
>> >
>> > On Thu, May 15, 2014 at 9:32 AM, James Taylor
>> > <[email protected]>wrote:
>> >
>> >> @Nick - I like the abstraction of the DataType, but that doesn't solve
>> >> the
>> >> problem for non Java usage.
>> >
>> >
>> > That's true. It's very much a Java construct. Likewise, Struct only
>> > codes
>> > for semantics; there's no encoding defined there. For correct
>> > multi-language support, we'll need to define these semantics the same
>> > way
>> > we do the encoding details so that implementations can reproduce them
>> > faithfully.
>> >
>> > I'm also a bit worried that it might become a bottleneck for
>> > implementors
>> >> of the serialization spec as there are many different platform specific
>> >> operations that will likely be done on the row key. We can try to get
>> >> everything necessary in the DataType interface, but I suspect that
>> >> implementors will need to go under-the-covers at times (rather than
>> >> waiting
>> >> for another release of the module that defines the DataType interface)
>> >> -
>> >> might become a bottleneck.
>> >
>> > Time will tell. DataType is just an interface, after all. If there are
>> > things it's missing (as there surely are, for Phoenix...), it'll need to
>> > be
>> > extended locally until these features can be pushed down into HBase.
>> > HBase
>> > release managers have been faithful to the monthly release train, so I
>> > think in practice dependent projects won't have to wait long. I'm
>> > content
>> > to take this on a case-by-case basis and watch for a trend. Do you have
>> > an
>> > alternative idea?
>> >
>> >> On Wed, May 14, 2014 at 5:17 PM, Nick Dimiduk <[email protected]>
>> >> wrote:
>> >>
>> >>> On Tue, May 13, 2014 at 3:35 PM, Ryan Blue <[email protected]> wrote:
>> >>>
>> >>>
>> >>>> I think there's a little confusion in what we are trying to
>> >>>> accomplish.
>> >>>> What I want to do is to write a minimal specification for how to
>> >>>> store
>> >> a
>> >>>> set of types. I'm not trying to leave much flexibility, what I want
>> >>>> is
>> >>>> clarity and simplicity.
>> >>>
>> >>> This is admirable and was my initial goal as well. The trouble is, you
>> >>> cannot please everyone, current users and new. So, we decided it was
>> >> better
>> >>> to provide a pluggable framework for extension + some basic
>> >> implementations
>> >>> than to implement a closed system.
>> >>>
>> >>> This is similar to OrderedBytes work, but a subset of it. A good
>> >>> example
>> >> is
>> >>>> that while it's possible to use different encodings (avro, protobuf,
>> >>>> thrift, ...) it isn't practical for an application to support all of
>> >>> those
>> >>>> encodings. So for interoperability between Kite, Phoenix, and others,
>> >>>> I
>> >>>> want a set of requirements that is as small as possible.
>> >>>
>> >>> Minimal is good. The surface area of o.a.h.h.types is as large as it
>> >>> is
>> >>> because there was always "just one more" type to support or encoding
>> >>> to
>> >>> provide.
>> >>>
>> >>> To make the requirements small, I used off-the-shelf protobuf [1] plus
>> >>> a
>> >>>> small set of memcmp encodings: ints, floats, and binary. That way, we
>> >>> don't
>> >>>> have to talk about how to make a memcmp Date in bytes, for example. A
>> >>> Date
>> >>>> is an int, which we know how to encode, and we can agree separately
>> >>>> on
>> >>> how
>> >>>> to a Date is represented (e.g., Julian vs unix epoch). [2] The same
>> >>> applies
>> >>>> to binary, where the encoding handles sorting and nulls, but not
>> >>> charsets.
>> >>>
>> >>> I think you should focus on the primitives you want to support. The
>> >>> compound type stuff (ie, "rowkey encodings") is a can of worms because
>> >> you
>> >>> need to support existing users, new users, novice users, and advanced
>> >>> users. Hence the interop between the DataType interface and the Struct
>> >>> classes. These work together to support all of these use-cases with
>> >>> the
>> >>> same basic code. For example, the protobuf encoding of
>> >>> postion|wire-type
>> >> +
>> >>> encoded value is easily implemented using Struct.
>> >>>
>> >>> I firmly believe that we cannot dictate rowkey composition.
>> >>> Applications,
>> >>> however, are free to implement their own. By using the common DataType
>> >>> interface, they can all interoperate.
>> >>>
>> >>> This is the largest reason why I didn't include OrderedBytes directly
>> >>> in
>> >>>> the spec. For example, OB includes a varint that I don't think is
>> >>> needed. I
>> >>>> don't object to its inclusion in OB, but I think it isn't a necessary
>> >>>> requirement for implementing this spec.
>> >>>
>> >>> Again, the surface area is as it is because of community consensus
>> >>> during
>> >>> the first phase of implementation. That consensus disagrees with you.
>> >>>
>> >>> I think there are 3 things to clear up:
>> >>>> 1. What types from OB are not included, and why?
>> >>>> 2. Why not use OB-style structs?
>> >>>> 3. Why choose protobuf for complex records?
>> >>>>
>> >>>> Does that sound like a reasonable direction to head with this
>> >> discussion?
>> >>>
>> >>> Yes, sounds great!
>> >>>
>> >>> As far as the DataType API, I think that works great with what I'm
>> >>> trying
>> >>>> to do. We'd build a DataType implementation for the encoding and the
>> >> API
>> >>>> will applications handle the underlying encoding. And other encoding
>> >>>> strategies can be swapped in as well, if we want to address
>> >> shortcomings
>> >>> in
>> >>>> this one, or have another for a different use case.
>> >>>
>> >>> I'm quite pleased to hear that. Applications like Kite, Phoenix, Kiji
>> >>> are
>> >>> the target audience of the DataType API.
>> >>>
>> >>> Thank you for picking back up this baton. It's sat for too long.
>> >>>
>> >>> -n
>> >>>
>> >>>> On 05/13/2014 02:33 PM, Nick Dimiduk wrote:
>> >>>>
>> >>>>> Breaking off hackathon thread.
>> >>>>>
>> >>>>> The conversation around HBASE-8089 concluded with two points:
>> >>>>>  - HBase should provide support for order-preserving encodings while
>> >>>>> not dropping support for the existing encoding formats.
>> >>>>>  - HBase is not in the business of schema management; that is a
>> >>>>> responsibility left to application developers.
>> >>>>>
>> >>>>> To handle the first point, OrderedBytes is provided. For the
>> >> supporting
>> >>>>> the second, the DataType API is introduced. By introducing this
>> >>>>> layer
>> >>>>> above specific encoding formats, it gives us a hook for plugging in
>> >>>>> different implementations and for helper utilities to ship with
>> >>>>> HBase,
>> >>>>> such as HBASE-10091.
>> >>>>>
>> >>>>> Things get fuzzy around complex data types: pojos, compound rowkeys
>> >>>>> (a
>> >>>>> special case of pojo), maps/dicts, and lists/arrays. These types are
>> >>>>> composed of other types and have different requirements based on
>> >>>>> where
>> >>>>> in the schema they're used. Again, by falling back on the DataType
>> >> API,
>> >>>>> we give application developers an "out" for doing what makes the
>> >>>>> most
>> >>>>> sense for them.
>> >>>>>
>> >>>>> For compound rowkeys, the Struct class is designed to fill in this
>> >> gap,
>> >>>>> sitting between data encoding and schema expression. It gives the
>> >>>>> application implementer, the person managing the schema, enough
>> >>>>> flexibility express the key encoding in terms of the component
>> >>>>> types.
>> >>>>> These components are not limited to the simple primitives already
>> >>>>> defined, but any DataType implementation. Order preservation is
>> >>>>> likely
>> >>>>> important here.
>> >>>>>
>> >>>>> For arrays/lists, there's no implementation yet, but you can see how
>> >> it
>> >>>>> might be done if you have a look at struct. Order preservation may
>> >>>>> or
>> >>>>> may not be important for arrays/list.
>> >>>>>
>> >>>>> The situation for maps/dicts is similar to arrays/lists. The one
>> >>>>> complication is the case where you want to map to a column family.
>> >>>>> How
>> >>>>> can these APIs support this thing?
>> >>>>>
>> >>>>> Pojos are a little more complicated. Probably Struct is sufficient
>> >>>>> for
>> >>>>> basic cases, but it doesn't support nice features like versioning --
>> >>>>> these are sacrificed in favor of order preservation. Luckily,
>> >>>>> there's
>> >>>>> plenty of tools out there for this already: Avro, MessagePack,
>> >> Protobuf,
>> >>>>> Thrift, &c. There's no need to reinvent the wheel here. Application
>> >>>>> developers can implement the DataType API backed by their management
>> >>>>> tool of choice. I created HBASE-11161 and will post a patch shortly.
>> >>>>>
>> >>>>> Specific comments about the Hackathon notes inline.
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Nick
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Ryan Blue
>> >>>> Software Engineer
>> >>>> Cloudera, Inc.
>> >>
>
>

Re: [common type encoding breakout] Re: HBase Hackathon @ Salesforce 05/06/2014 notes

Reply via email to