Couple items I didn't see mentioned, but I think would be good to get clarity on: * variable length DECIMAL (Phoenix relies on this) * ARRAY type (Phoenix supports this - arrays of fixed width data is just concatenated together, while arrays of variable length data is run-length-encoded with a double null byte terminator followed by an index of the start position of each element ) * Optional use of mem comparable composite row key as the value of a KeyValue (I think this makes things easier). Thanks, James
On Tue, May 20, 2014 at 3:40 AM, Nick Dimiduk <[email protected]> wrote: > That's correct Andy. We're locking down the "default" primitive type > implementations going forward, while maintaining a flexible API such that we > can support existing users who want to migrate to the applicable new > features without rewriting existing data. Obviously some of those features > will depend on the new encoding semantics, but I think we can offer a net > improvement even for existing applications. > > > On Mon, May 19, 2014 at 6:31 AM, Andrew Purtell <[email protected]> > wrote: >> >> So if I can summarize this thread so far, we are going to try and hammer >> out a types encoding spec agreeable to HBase, Phoenix, and Kite alike? As >> opposed to select a particular implementation today as both spec and >> reference implementation. Is that correct? >> >> If so, that sounds like a promising direction. The HBase types library has >> the flexibility, if I understand Nick correctly, to accommodate whatever is >> agreed upon and we could then provide a reference implementation as a >> service for HBase users (or anyone) but there would be no strings attached, >> multiple implementations of the spec would interoperate by definition. >> >> >> > On May 19, 2014, at 3:20 AM, Nick Dimiduk <[email protected]> wrote: >> > >> > On Thu, May 15, 2014 at 9:32 AM, James Taylor >> > <[email protected]>wrote: >> > >> >> @Nick - I like the abstraction of the DataType, but that doesn't solve >> >> the >> >> problem for non Java usage. >> > >> > >> > That's true. It's very much a Java construct. Likewise, Struct only >> > codes >> > for semantics; there's no encoding defined there. For correct >> > multi-language support, we'll need to define these semantics the same >> > way >> > we do the encoding details so that implementations can reproduce them >> > faithfully. >> > >> > I'm also a bit worried that it might become a bottleneck for >> > implementors >> >> of the serialization spec as there are many different platform specific >> >> operations that will likely be done on the row key. We can try to get >> >> everything necessary in the DataType interface, but I suspect that >> >> implementors will need to go under-the-covers at times (rather than >> >> waiting >> >> for another release of the module that defines the DataType interface) >> >> - >> >> might become a bottleneck. >> > >> > Time will tell. DataType is just an interface, after all. If there are >> > things it's missing (as there surely are, for Phoenix...), it'll need to >> > be >> > extended locally until these features can be pushed down into HBase. >> > HBase >> > release managers have been faithful to the monthly release train, so I >> > think in practice dependent projects won't have to wait long. I'm >> > content >> > to take this on a case-by-case basis and watch for a trend. Do you have >> > an >> > alternative idea? >> > >> >> On Wed, May 14, 2014 at 5:17 PM, Nick Dimiduk <[email protected]> >> >> wrote: >> >> >> >>> On Tue, May 13, 2014 at 3:35 PM, Ryan Blue <[email protected]> wrote: >> >>> >> >>> >> >>>> I think there's a little confusion in what we are trying to >> >>>> accomplish. >> >>>> What I want to do is to write a minimal specification for how to >> >>>> store >> >> a >> >>>> set of types. I'm not trying to leave much flexibility, what I want >> >>>> is >> >>>> clarity and simplicity. >> >>> >> >>> This is admirable and was my initial goal as well. The trouble is, you >> >>> cannot please everyone, current users and new. So, we decided it was >> >> better >> >>> to provide a pluggable framework for extension + some basic >> >> implementations >> >>> than to implement a closed system. >> >>> >> >>> This is similar to OrderedBytes work, but a subset of it. A good >> >>> example >> >> is >> >>>> that while it's possible to use different encodings (avro, protobuf, >> >>>> thrift, ...) it isn't practical for an application to support all of >> >>> those >> >>>> encodings. So for interoperability between Kite, Phoenix, and others, >> >>>> I >> >>>> want a set of requirements that is as small as possible. >> >>> >> >>> Minimal is good. The surface area of o.a.h.h.types is as large as it >> >>> is >> >>> because there was always "just one more" type to support or encoding >> >>> to >> >>> provide. >> >>> >> >>> To make the requirements small, I used off-the-shelf protobuf [1] plus >> >>> a >> >>>> small set of memcmp encodings: ints, floats, and binary. That way, we >> >>> don't >> >>>> have to talk about how to make a memcmp Date in bytes, for example. A >> >>> Date >> >>>> is an int, which we know how to encode, and we can agree separately >> >>>> on >> >>> how >> >>>> to a Date is represented (e.g., Julian vs unix epoch). [2] The same >> >>> applies >> >>>> to binary, where the encoding handles sorting and nulls, but not >> >>> charsets. >> >>> >> >>> I think you should focus on the primitives you want to support. The >> >>> compound type stuff (ie, "rowkey encodings") is a can of worms because >> >> you >> >>> need to support existing users, new users, novice users, and advanced >> >>> users. Hence the interop between the DataType interface and the Struct >> >>> classes. These work together to support all of these use-cases with >> >>> the >> >>> same basic code. For example, the protobuf encoding of >> >>> postion|wire-type >> >> + >> >>> encoded value is easily implemented using Struct. >> >>> >> >>> I firmly believe that we cannot dictate rowkey composition. >> >>> Applications, >> >>> however, are free to implement their own. By using the common DataType >> >>> interface, they can all interoperate. >> >>> >> >>> This is the largest reason why I didn't include OrderedBytes directly >> >>> in >> >>>> the spec. For example, OB includes a varint that I don't think is >> >>> needed. I >> >>>> don't object to its inclusion in OB, but I think it isn't a necessary >> >>>> requirement for implementing this spec. >> >>> >> >>> Again, the surface area is as it is because of community consensus >> >>> during >> >>> the first phase of implementation. That consensus disagrees with you. >> >>> >> >>> I think there are 3 things to clear up: >> >>>> 1. What types from OB are not included, and why? >> >>>> 2. Why not use OB-style structs? >> >>>> 3. Why choose protobuf for complex records? >> >>>> >> >>>> Does that sound like a reasonable direction to head with this >> >> discussion? >> >>> >> >>> Yes, sounds great! >> >>> >> >>> As far as the DataType API, I think that works great with what I'm >> >>> trying >> >>>> to do. We'd build a DataType implementation for the encoding and the >> >> API >> >>>> will applications handle the underlying encoding. And other encoding >> >>>> strategies can be swapped in as well, if we want to address >> >> shortcomings >> >>> in >> >>>> this one, or have another for a different use case. >> >>> >> >>> I'm quite pleased to hear that. Applications like Kite, Phoenix, Kiji >> >>> are >> >>> the target audience of the DataType API. >> >>> >> >>> Thank you for picking back up this baton. It's sat for too long. >> >>> >> >>> -n >> >>> >> >>>> On 05/13/2014 02:33 PM, Nick Dimiduk wrote: >> >>>> >> >>>>> Breaking off hackathon thread. >> >>>>> >> >>>>> The conversation around HBASE-8089 concluded with two points: >> >>>>> - HBase should provide support for order-preserving encodings while >> >>>>> not dropping support for the existing encoding formats. >> >>>>> - HBase is not in the business of schema management; that is a >> >>>>> responsibility left to application developers. >> >>>>> >> >>>>> To handle the first point, OrderedBytes is provided. For the >> >> supporting >> >>>>> the second, the DataType API is introduced. By introducing this >> >>>>> layer >> >>>>> above specific encoding formats, it gives us a hook for plugging in >> >>>>> different implementations and for helper utilities to ship with >> >>>>> HBase, >> >>>>> such as HBASE-10091. >> >>>>> >> >>>>> Things get fuzzy around complex data types: pojos, compound rowkeys >> >>>>> (a >> >>>>> special case of pojo), maps/dicts, and lists/arrays. These types are >> >>>>> composed of other types and have different requirements based on >> >>>>> where >> >>>>> in the schema they're used. Again, by falling back on the DataType >> >> API, >> >>>>> we give application developers an "out" for doing what makes the >> >>>>> most >> >>>>> sense for them. >> >>>>> >> >>>>> For compound rowkeys, the Struct class is designed to fill in this >> >> gap, >> >>>>> sitting between data encoding and schema expression. It gives the >> >>>>> application implementer, the person managing the schema, enough >> >>>>> flexibility express the key encoding in terms of the component >> >>>>> types. >> >>>>> These components are not limited to the simple primitives already >> >>>>> defined, but any DataType implementation. Order preservation is >> >>>>> likely >> >>>>> important here. >> >>>>> >> >>>>> For arrays/lists, there's no implementation yet, but you can see how >> >> it >> >>>>> might be done if you have a look at struct. Order preservation may >> >>>>> or >> >>>>> may not be important for arrays/list. >> >>>>> >> >>>>> The situation for maps/dicts is similar to arrays/lists. The one >> >>>>> complication is the case where you want to map to a column family. >> >>>>> How >> >>>>> can these APIs support this thing? >> >>>>> >> >>>>> Pojos are a little more complicated. Probably Struct is sufficient >> >>>>> for >> >>>>> basic cases, but it doesn't support nice features like versioning -- >> >>>>> these are sacrificed in favor of order preservation. Luckily, >> >>>>> there's >> >>>>> plenty of tools out there for this already: Avro, MessagePack, >> >> Protobuf, >> >>>>> Thrift, &c. There's no need to reinvent the wheel here. Application >> >>>>> developers can implement the DataType API backed by their management >> >>>>> tool of choice. I created HBASE-11161 and will post a patch shortly. >> >>>>> >> >>>>> Specific comments about the Hackathon notes inline. >> >>>>> >> >>>>> Thanks, >> >>>>> Nick >> >>>> >> >>>> >> >>>> -- >> >>>> Ryan Blue >> >>>> Software Engineer >> >>>> Cloudera, Inc. >> >> > >
