Hi Nick,

Thanks for taking the time for a close look at this, it's great to see this discussion happening in depth.

I think there's a little confusion in what we are trying to accomplish. What I want to do is to write a minimal specification for how to store a set of types. I'm not trying to leave much flexibility, what I want is clarity and simplicity.

This is similar to OrderedBytes work, but a subset of it. A good example is that while it's possible to use different encodings (avro, protobuf, thrift, ...) it isn't practical for an application to support all of those encodings. So for interoperability between Kite, Phoenix, and others, I want a set of requirements that is as small as possible.

To make the requirements small, I used off-the-shelf protobuf [1] plus a small set of memcmp encodings: ints, floats, and binary. That way, we don't have to talk about how to make a memcmp Date in bytes, for example. A Date is an int, which we know how to encode, and we can agree separately on how to a Date is represented (e.g., Julian vs unix epoch). [2] The same applies to binary, where the encoding handles sorting and nulls, but not charsets.

This is the largest reason why I didn't include OrderedBytes directly in the spec. For example, OB includes a varint that I don't think is needed. I don't object to its inclusion in OB, but I think it isn't a necessary requirement for implementing this spec.

I think there are 3 things to clear up:
1. What types from OB are not included, and why?
2. Why not use OB-style structs?
3. Why choose protobuf for complex records?

Does that sound like a reasonable direction to head with this discussion?

As far as the DataType API, I think that works great with what I'm trying to do. We'd build a DataType implementation for the encoding and the API will applications handle the underlying encoding. And other encoding strategies can be swapped in as well, if we want to address shortcomings in this one, or have another for a different use case.

rb

[1]: I think there's some confusion around the protobuf part, I'm saying we should use standard protobuf so we can reuse existing libraries. [2]: We also know that a Date can be incremented, for example, because an int can be.

On 05/13/2014 02:33 PM, Nick Dimiduk wrote:
Breaking off hackathon thread.

The conversation around HBASE-8089 concluded with two points:
  - HBase should provide support for order-preserving encodings while
not dropping support for the existing encoding formats.
  - HBase is not in the business of schema management; that is a
responsibility left to application developers.

To handle the first point, OrderedBytes is provided. For the supporting
the second, the DataType API is introduced. By introducing this layer
above specific encoding formats, it gives us a hook for plugging in
different implementations and for helper utilities to ship with HBase,
such as HBASE-10091.

Things get fuzzy around complex data types: pojos, compound rowkeys (a
special case of pojo), maps/dicts, and lists/arrays. These types are
composed of other types and have different requirements based on where
in the schema they're used. Again, by falling back on the DataType API,
we give application developers an "out" for doing what makes the most
sense for them.

For compound rowkeys, the Struct class is designed to fill in this gap,
sitting between data encoding and schema expression. It gives the
application implementer, the person managing the schema, enough
flexibility express the key encoding in terms of the component types.
These components are not limited to the simple primitives already
defined, but any DataType implementation. Order preservation is likely
important here.

For arrays/lists, there's no implementation yet, but you can see how it
might be done if you have a look at struct. Order preservation may or
may not be important for arrays/list.

The situation for maps/dicts is similar to arrays/lists. The one
complication is the case where you want to map to a column family. How
can these APIs support this thing?

Pojos are a little more complicated. Probably Struct is sufficient for
basic cases, but it doesn't support nice features like versioning --
these are sacrificed in favor of order preservation. Luckily, there's
plenty of tools out there for this already: Avro, MessagePack, Protobuf,
Thrift, &c. There's no need to reinvent the wheel here. Application
developers can implement the DataType API backed by their management
tool of choice. I created HBASE-11161 and will post a patch shortly.

Specific comments about the Hackathon notes inline.

Thanks,
Nick


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to