On Mon, May 19, 2014 at 6:31 AM, Andrew Purtell <[email protected]>wrote:
> So if I can summarize this thread so far, we are going to try and hammer > out a types encoding spec agreeable to HBase, Phoenix, and Kite alike? As > opposed to select a particular implementation today as both spec and > reference implementation. Is that correct? > That is the goal. We chatted and posted notes from the discussion last week and I believe we only have a few items to iron out now (how to encode and handle "comples primitives" like date, and decimals.) > > If so, that sounds like a promising direction. The HBase types library has > the flexibility, if I understand Nick correctly, to accommodate whatever is > agreed upon and we could then provide a reference implementation as a > service for HBase users (or anyone) but there would be no strings attached, > multiple implementations of the spec would interoperate by definition. > > I'll be working on a prototype in the next few weeks integrating phoenix with a slice of the new proposed encodings and trying to use the data type api. > > > On May 19, 2014, at 3:20 AM, Nick Dimiduk <[email protected]> wrote: > > > > On Thu, May 15, 2014 at 9:32 AM, James Taylor <[email protected] > >wrote: > > > >> @Nick - I like the abstraction of the DataType, but that doesn't solve > the > >> problem for non Java usage. > > > > > > That's true. It's very much a Java construct. Likewise, Struct only codes > > for semantics; there's no encoding defined there. For correct > > multi-language support, we'll need to define these semantics the same way > > we do the encoding details so that implementations can reproduce them > > faithfully. > > > > I'm also a bit worried that it might become a bottleneck for implementors > >> of the serialization spec as there are many different platform specific > >> operations that will likely be done on the row key. We can try to get > >> everything necessary in the DataType interface, but I suspect that > >> implementors will need to go under-the-covers at times (rather than > waiting > >> for another release of the module that defines the DataType interface) - > >> might become a bottleneck. > > > > Time will tell. DataType is just an interface, after all. If there are > > things it's missing (as there surely are, for Phoenix...), it'll need to > be > > extended locally until these features can be pushed down into HBase. > HBase > > release managers have been faithful to the monthly release train, so I > > think in practice dependent projects won't have to wait long. I'm content > > to take this on a case-by-case basis and watch for a trend. Do you have > an > > alternative idea? > > > >> On Wed, May 14, 2014 at 5:17 PM, Nick Dimiduk <[email protected]> > wrote: > >> > >>> On Tue, May 13, 2014 at 3:35 PM, Ryan Blue <[email protected]> wrote: > >>> > >>> > >>>> I think there's a little confusion in what we are trying to > accomplish. > >>>> What I want to do is to write a minimal specification for how to store > >> a > >>>> set of types. I'm not trying to leave much flexibility, what I want is > >>>> clarity and simplicity. > >>> > >>> This is admirable and was my initial goal as well. The trouble is, you > >>> cannot please everyone, current users and new. So, we decided it was > >> better > >>> to provide a pluggable framework for extension + some basic > >> implementations > >>> than to implement a closed system. > >>> > >>> This is similar to OrderedBytes work, but a subset of it. A good > example > >> is > >>>> that while it's possible to use different encodings (avro, protobuf, > >>>> thrift, ...) it isn't practical for an application to support all of > >>> those > >>>> encodings. So for interoperability between Kite, Phoenix, and others, > I > >>>> want a set of requirements that is as small as possible. > >>> > >>> Minimal is good. The surface area of o.a.h.h.types is as large as it is > >>> because there was always "just one more" type to support or encoding to > >>> provide. > >>> > >>> To make the requirements small, I used off-the-shelf protobuf [1] plus > a > >>>> small set of memcmp encodings: ints, floats, and binary. That way, we > >>> don't > >>>> have to talk about how to make a memcmp Date in bytes, for example. A > >>> Date > >>>> is an int, which we know how to encode, and we can agree separately on > >>> how > >>>> to a Date is represented (e.g., Julian vs unix epoch). [2] The same > >>> applies > >>>> to binary, where the encoding handles sorting and nulls, but not > >>> charsets. > >>> > >>> I think you should focus on the primitives you want to support. The > >>> compound type stuff (ie, "rowkey encodings") is a can of worms because > >> you > >>> need to support existing users, new users, novice users, and advanced > >>> users. Hence the interop between the DataType interface and the Struct > >>> classes. These work together to support all of these use-cases with the > >>> same basic code. For example, the protobuf encoding of > postion|wire-type > >> + > >>> encoded value is easily implemented using Struct. > >>> > >>> I firmly believe that we cannot dictate rowkey composition. > Applications, > >>> however, are free to implement their own. By using the common DataType > >>> interface, they can all interoperate. > >>> > >>> This is the largest reason why I didn't include OrderedBytes directly > in > >>>> the spec. For example, OB includes a varint that I don't think is > >>> needed. I > >>>> don't object to its inclusion in OB, but I think it isn't a necessary > >>>> requirement for implementing this spec. > >>> > >>> Again, the surface area is as it is because of community consensus > during > >>> the first phase of implementation. That consensus disagrees with you. > >>> > >>> I think there are 3 things to clear up: > >>>> 1. What types from OB are not included, and why? > >>>> 2. Why not use OB-style structs? > >>>> 3. Why choose protobuf for complex records? > >>>> > >>>> Does that sound like a reasonable direction to head with this > >> discussion? > >>> > >>> Yes, sounds great! > >>> > >>> As far as the DataType API, I think that works great with what I'm > trying > >>>> to do. We'd build a DataType implementation for the encoding and the > >> API > >>>> will applications handle the underlying encoding. And other encoding > >>>> strategies can be swapped in as well, if we want to address > >> shortcomings > >>> in > >>>> this one, or have another for a different use case. > >>> > >>> I'm quite pleased to hear that. Applications like Kite, Phoenix, Kiji > are > >>> the target audience of the DataType API. > >>> > >>> Thank you for picking back up this baton. It's sat for too long. > >>> > >>> -n > >>> > >>>> On 05/13/2014 02:33 PM, Nick Dimiduk wrote: > >>>> > >>>>> Breaking off hackathon thread. > >>>>> > >>>>> The conversation around HBASE-8089 concluded with two points: > >>>>> - HBase should provide support for order-preserving encodings while > >>>>> not dropping support for the existing encoding formats. > >>>>> - HBase is not in the business of schema management; that is a > >>>>> responsibility left to application developers. > >>>>> > >>>>> To handle the first point, OrderedBytes is provided. For the > >> supporting > >>>>> the second, the DataType API is introduced. By introducing this layer > >>>>> above specific encoding formats, it gives us a hook for plugging in > >>>>> different implementations and for helper utilities to ship with > HBase, > >>>>> such as HBASE-10091. > >>>>> > >>>>> Things get fuzzy around complex data types: pojos, compound rowkeys > (a > >>>>> special case of pojo), maps/dicts, and lists/arrays. These types are > >>>>> composed of other types and have different requirements based on > where > >>>>> in the schema they're used. Again, by falling back on the DataType > >> API, > >>>>> we give application developers an "out" for doing what makes the most > >>>>> sense for them. > >>>>> > >>>>> For compound rowkeys, the Struct class is designed to fill in this > >> gap, > >>>>> sitting between data encoding and schema expression. It gives the > >>>>> application implementer, the person managing the schema, enough > >>>>> flexibility express the key encoding in terms of the component types. > >>>>> These components are not limited to the simple primitives already > >>>>> defined, but any DataType implementation. Order preservation is > likely > >>>>> important here. > >>>>> > >>>>> For arrays/lists, there's no implementation yet, but you can see how > >> it > >>>>> might be done if you have a look at struct. Order preservation may or > >>>>> may not be important for arrays/list. > >>>>> > >>>>> The situation for maps/dicts is similar to arrays/lists. The one > >>>>> complication is the case where you want to map to a column family. > How > >>>>> can these APIs support this thing? > >>>>> > >>>>> Pojos are a little more complicated. Probably Struct is sufficient > for > >>>>> basic cases, but it doesn't support nice features like versioning -- > >>>>> these are sacrificed in favor of order preservation. Luckily, there's > >>>>> plenty of tools out there for this already: Avro, MessagePack, > >> Protobuf, > >>>>> Thrift, &c. There's no need to reinvent the wheel here. Application > >>>>> developers can implement the DataType API backed by their management > >>>>> tool of choice. I created HBASE-11161 and will post a patch shortly. > >>>>> > >>>>> Specific comments about the Hackathon notes inline. > >>>>> > >>>>> Thanks, > >>>>> Nick > >>>> > >>>> > >>>> -- > >>>> Ryan Blue > >>>> Software Engineer > >>>> Cloudera, Inc. > >> > -- // Jonathan Hsieh (shay) // HBase Tech Lead, Software Engineer, Cloudera // [email protected] // @jmhsieh
