Here are a few more specific responses.
Hopefully this clears up some remaining points in the context of my last
post.
Why not use protobuf directly instead of reimplementing a slight
variation of their format?
I intend to use protobuf directly for compound values. It isn't
practical right now for keys because protobuf doesn't have value
encodings that are memcmp, nor are its tags memcmp for fields > 16.
* memcmp encodings for primitives in cells desired for phoenix (2ndary
indices?)
This sounds like a Phoenix-specific decision.
I think it's okay for the spec to optimize for certain patterns. Using
the memcmp encodings in primitive cells allows us to do value comparison
on encoded bytes and speed up scans. I was under the impression that
this is something Phoenix does to speed up results, so we included it.
If we want to optimize for something else instead, what should we choose?
OrderedBytes implements a bit-shifting strategy for this.
{FixedLength,Terminated}Wrapper are provided to add flexibility. Ryan
has suggested a variation of run-length encoding as another alternative,
something we could add is there's sufficient need.
We went with the run-length encoding variant because in most cases, it
decreases the size of the data or doesn't increase it too much. It
increases the size only when there are single null bytes, in which case
it adds a byte for each single null. Size is the same or reduced with
two or more null bytes.
The reason for choosing this over the OB type is to support null bytes,
and because OB adds ceil(size / 7) + 1 bytes to each value, and requires
bit shifts to encode and decode.
* do we include 1 byte and 2 byte ints?
Following the initial commit of HBASE-8201, these were requested HBASE-9369.
+1 for small ints
The above date question is a perfece example of why I think it's
important that we have the DataType interface. Having the interface
means an application can implement it's own types when their needs are
too unique for commit to HBase. Other applications can still use that
implementation by including the relevant application jars. They enjoy
interoperability by agreeing on the DataType implementation, not on
something provided out of the box by a particular HBase version.
I think this spec would be a stronger interop guarantee. We should
discuss whether we can support this spec along with existing data,
although I suspect we probably can't.
rb
--
Ryan Blue
Software Engineer
Cloudera, Inc.