2014 notes

Ryan Blue Tue, 13 May 2014 15:44:30 -0700

Hi Nick,

Thanks for taking the time for a close look at this, it's great to seethis discussion happening in depth.

I think there's a little confusion in what we are trying to accomplish.What I want to do is to write a minimal specification for how to store aset of types. I'm not trying to leave much flexibility, what I want isclarity and simplicity.

This is similar to OrderedBytes work, but a subset of it. A good exampleis that while it's possible to use different encodings (avro, protobuf,thrift, ...) it isn't practical for an application to support all ofthose encodings. So for interoperability between Kite, Phoenix, andothers, I want a set of requirements that is as small as possible.

To make the requirements small, I used off-the-shelf protobuf [1] plus asmall set of memcmp encodings: ints, floats, and binary. That way, wedon't have to talk about how to make a memcmp Date in bytes, forexample. A Date is an int, which we know how to encode, and we can agreeseparately on how to a Date is represented (e.g., Julian vs unix epoch).[2] The same applies to binary, where the encoding handles sorting andnulls, but not charsets.

This is the largest reason why I didn't include OrderedBytes directly inthe spec. For example, OB includes a varint that I don't think isneeded. I don't object to its inclusion in OB, but I think it isn't anecessary requirement for implementing this spec.


I think there are 3 things to clear up:
1. What types from OB are not included, and why?
2. Why not use OB-style structs?
3. Why choose protobuf for complex records?

Does that sound like a reasonable direction to head with this discussion?

As far as the DataType API, I think that works great with what I'mtrying to do. We'd build a DataType implementation for the encoding andthe API will applications handle the underlying encoding. And otherencoding strategies can be swapped in as well, if we want to addressshortcomings in this one, or have another for a different use case.

rb

[1]: I think there's some confusion around the protobuf part, I'm sayingwe should use standard protobuf so we can reuse existing libraries.[2]: We also know that a Date can be incremented, for example, becausean int can be.


On 05/13/2014 02:33 PM, Nick Dimiduk wrote:

Breaking off hackathon thread.

The conversation around HBASE-8089 concluded with two points:
  - HBase should provide support for order-preserving encodings while
not dropping support for the existing encoding formats.
  - HBase is not in the business of schema management; that is a
responsibility left to application developers.

To handle the first point, OrderedBytes is provided. For the supporting
the second, the DataType API is introduced. By introducing this layer
above specific encoding formats, it gives us a hook for plugging in
different implementations and for helper utilities to ship with HBase,
such as HBASE-10091.

Things get fuzzy around complex data types: pojos, compound rowkeys (a
special case of pojo), maps/dicts, and lists/arrays. These types are
composed of other types and have different requirements based on where
in the schema they're used. Again, by falling back on the DataType API,
we give application developers an "out" for doing what makes the most
sense for them.

For compound rowkeys, the Struct class is designed to fill in this gap,
sitting between data encoding and schema expression. It gives the
application implementer, the person managing the schema, enough
flexibility express the key encoding in terms of the component types.
These components are not limited to the simple primitives already
defined, but any DataType implementation. Order preservation is likely
important here.

For arrays/lists, there's no implementation yet, but you can see how it
might be done if you have a look at struct. Order preservation may or
may not be important for arrays/list.

The situation for maps/dicts is similar to arrays/lists. The one
complication is the case where you want to map to a column family. How
can these APIs support this thing?

Pojos are a little more complicated. Probably Struct is sufficient for
basic cases, but it doesn't support nice features like versioning --
these are sacrificed in favor of order preservation. Luckily, there's
plenty of tools out there for this already: Avro, MessagePack, Protobuf,
Thrift, &c. There's no need to reinvent the wheel here. Application
developers can implement the DataType API backed by their management
tool of choice. I created HBASE-11161 and will post a patch shortly.

Specific comments about the Hackathon notes inline.

Thanks,
Nick



--
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: [common type encoding breakout] Re: HBase Hackathon @ Salesforce 05/06/2014 notes

Reply via email to