2014 notes

Ryan Blue Tue, 13 May 2014 16:29:31 -0700

Here are a few more specific responses.

Hopefully this clears up some remaining points in the context of my lastpost.

Why not use protobuf directly instead of reimplementing a slight
variation of their format?

I intend to use protobuf directly for compound values. It isn'tpractical right now for keys because protobuf doesn't have valueencodings that are memcmp, nor are its tags memcmp for fields > 16.

    * memcmp encodings for primitives in cells desired for phoenix (2ndary
    indices?)

This sounds like a Phoenix-specific decision.

I think it's okay for the spec to optimize for certain patterns. Usingthe memcmp encodings in primitive cells allows us to do value comparisonon encoded bytes and speed up scans. I was under the impression thatthis is something Phoenix does to speed up results, so we included it.


If we want to optimize for something else instead, what should we choose?

OrderedBytes implements a bit-shifting strategy for this.
{FixedLength,Terminated}Wrapper are provided to add flexibility. Ryan
has suggested a variation of run-length encoding as another alternative,
something we could add is there's sufficient need.

We went with the run-length encoding variant because in most cases, itdecreases the size of the data or doesn't increase it too much. Itincreases the size only when there are single null bytes, in which caseit adds a byte for each single null. Size is the same or reduced withtwo or more null bytes.

The reason for choosing this over the OB type is to support null bytes,and because OB adds ceil(size / 7) + 1 bytes to each value, and requiresbit shifts to encode and decode.

    * do we include 1 byte and 2 byte ints?

Following the initial commit of HBASE-8201, these were requested HBASE-9369.


+1 for small ints

The above date question is a perfece example of why I think it's
important that we have the DataType interface. Having the interface
means an application can implement it's own types when their needs are
too unique for commit to HBase. Other applications can still use that
implementation by including the relevant application jars. They enjoy
interoperability by agreeing on the DataType implementation, not on
something provided out of the box by a particular HBase version.

I think this spec would be a stronger interop guarantee. We shoulddiscuss whether we can support this spec along with existing data,although I suspect we probably can't.


rb

--
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: [common type encoding breakout] Re: HBase Hackathon @ Salesforce 05/06/2014 notes

Reply via email to