Re: HBase Types: Explicit Null Support

Nick Dimiduk Thu, 04 Apr 2013 19:55:45 -0700

On Thu, Apr 4, 2013 at 7:18 PM, James Taylor <jtay...@salesforce.com> wrote:


> Would it make sense to clean up the APIs a bit and post just the type
> system code somewhere to give us something to poke holes at?
>

That could be useful. I've been experimenting with implementations as I
update the spec doc and pushing as I go to
https://github.com/ndimiduk/serialization-play. I can make you a
collaborator or you can host your own repository, as you prefer.

On 04/04/2013 06:49 PM, Nick Dimiduk wrote:
>
>> On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jtay...@salesforce.com
>> >wrote:
>>
>>  Maybe if we can keep nullability separate from the
>>> serialization/deserialization, we can come up with a solution that works?
>>>
>>
>> I think implied null could work, but let's build out the matrix. I see two
>> kinds of types: fixed- and variable-width. These types are used in two
>> scenarios: on their own or as part of a compound type.
>>
>> A fixed-width type used standalone can enfer null from absence of a value.
>> When used in a compound type, absence isn't enough to indicate null unless
>> it's the last value in the sequence. To support a null field in the middle
>> of the compound type, it is forced to explicitly mark the field as null.
>> The only solution I can think of (without sacrificing the full value
>> range,
>> per my original question) is to write the full type width bytes, followed
>> by an isNull byte. Thus, for example, the INT type consumes 4 bytes when
>> serialized stand-alone, but 5 bytes when composed.
>>
>> James, how does Phoenix handle a null fixed-width rowkey component? I
>> don't
>> see that implemented in PDataType enum.
>>
>> Variable-width used standalone are simple enough because HBase handles
>> arbitrary length byte[]'s everywhere. Variable-width in composite is a
>> problem. Phoenix forces these value to only appear as the last position in
>> the composite, as I understand it. Orderly provides explicit null and
>> termination bytes by taking advantage of a feature of UTF-8 encoding.
>> Support for bytes is equally ugly (but clever) in that byte digits are
>> encoded in BCD. Both of these approaches bloat slightly the serialized
>> representation over the natural representation, but they allow the
>> variable-length types to be used anywhere within the compound type. As an
>> added bonus regarding code maintainability, their serialization entirely
>> self-contained within the type. That's in contrast to the fixed-width type
>> implementation described above, where null is explicitly encoded by the
>> compound type.
>>
>> My opinion is the computational and storage overhead imposed by Orderly's
>> implementation are worth the trade-off in flexibility in user consumption.
>> Correct me if i'm wrong James, but you're saying, from your experience
>> with
>> Phoenix, users are willing to work within that constraint?
>>
>> Thanks,
>> Nick
>>
>> On 04/01/2013 11:29 PM, Jesse Yates wrote:
>>
>>   Actually, that isn't all that far-fetched of a format Matt - pretty
>> common
>>
>>>  anytime anyone wants to do sortable lat/long (*cough* three letter
>>>> agencies
>>>> cough*).
>>>>
>>>> Wouldn't we get the same by providing a simple set of libraries (ala
>>>> orderly + other HBase useful things) and then still give access to the
>>>> underlying byte array? Perhaps a nullable key type in that lib makes
>>>> sense
>>>> if lots of people need it and it would be nice to have standard
>>>> libraries
>>>> so tools could interop much more easily.
>>>> -------------------
>>>> Jesse Yates
>>>> @jesse_yates
>>>> jyates.github.com
>>>>
>>>>
>>>> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mcor...@hotpads.com>
>>>> wrote:
>>>>
>>>>   Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of
>>>>
>>>>> the
>>>>> interfaces should be to provide first-class support for custom user
>>>>> types
>>>>> in addition to the standard ones included.  Part of the power of
>>>>> hbase's
>>>>> plain byte[] keys is that users can concoct the perfect key for their
>>>>> data
>>>>> type.  For example, I have a lot of geographic data where I interleave
>>>>> latitude/longitude bits into a sortable 64 bit value that would
>>>>> probably
>>>>> never be included in a standard library.
>>>>>
>>>>>
>>>>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <enis....@gmail.com>
>>>>> wrote:
>>>>>
>>>>>   I think having Int32, and NullableInt32 would support minimum
>>>>> overhead,
>>>>> as
>>>>>
>>>>>  well as allowing SQL semantics.
>>>>>>
>>>>>>
>>>>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <ndimi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>   Furthermore, is is more important to support null values than
>>>>>> squeeze
>>>>>> all
>>>>>> representations into minimum size (4-bytes for int32, &c.)?
>>>>>>
>>>>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <ndimi...@gmail.com> wrote:
>>>>>>>
>>>>>>>   On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <
>>>>>>> jtay...@salesforce.com
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>    From the SQL perspective, handling null is important.
>>>>>>>>   From your perspective, it is critical to support NULLs, even at
>>>>>>>> the
>>>>>>>> expense of fixed-width encodings at all or supporting representation
>>>>>>>>
>>>>>>>>  of a
>>>>>>> full range of values. That is, you'd rather be able to represent NULL
>>>>>>> than
>>>>>>>
>>>>>>>  -2^31?
>>>>>>>>
>>>>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
>>>>>>>>
>>>>>>>>  Thanks for the thoughtful response (and code!).
>>>>>>>>>
>>>>>>>>>> I'm thinking I will press forward with a base implementation that
>>>>>>>>>>
>>>>>>>>>>  does
>>>>>>>>>
>>>>>>>>   not
>>>>>>>
>>>>>>>> support nulls. The idea is to provide an extensible set of
>>>>>>>>>>
>>>>>>>>>>  interfaces,
>>>>>>>>>
>>>>>>>>   so I
>>>>>>>
>>>>>>>> think this will not box us into a corner later. That is, a
>>>>>>>>>>
>>>>>>>>>>  mirroring
>>>>>>>>>
>>>>>>>>   package could be implemented that supports null values and accepts
>>>>>>
>>>>>>>  the relevant trade-offs.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Nick
>>>>>>>>>>
>>>>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mcor...@hotpads.com
>>>>>>>>>> >
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>    I spent some time this weekend extracting bits of our
>>>>>>>>>>
>>>>>>>>>>  serialization
>>>>>>>>>
>>>>>>>>   code to
>>>>>>
>>>>>>>  a public github repo at 
>>>>>>> http://github.com/hotpads/******data-tools<http://github.com/hotpads/****data-tools>
>>>>>>>>>>> <http://github.com/**hotpads/**data-tools<http://github.com/hotpads/**data-tools>
>>>>>>>>>>> >
>>>>>>>>>>> <
>>>>>>>>>>>
>>>>>>>>>>>  
>>>>>>>>>>> http://github.com/hotpads/****data-tools<http://github.com/hotpads/**data-tools>
>>>>>>>>>> <http://github.com/**hotpads/data-tools<http://github.com/hotpads/data-tools>
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>   .
>>>>>>>>
>>>>>>>>>     Contributions are welcome - i'm sure we all have this stuff
>>>>>>>>>>>
>>>>>>>>>>>  laying
>>>>>>>>>>
>>>>>>>>>   around.
>>>>>>
>>>>>>>  You can see I've bumped into the NULL problem in a few places:
>>>>>>>>>>> *
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/hotpads/******data-tools/blob/master/src/**<https://github.com/hotpads/****data-tools/blob/master/src/**>
>>>>>>>>>>> **<https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>>>> >
>>>>>>>>>>> main/java/com/hotpads/data/******primitive/lists/**
>>>>>>>>>>> LongArrayList.**
>>>>>>>>>>> **java<
>>>>>>>>>>>
>>>>>>>>>>>  
>>>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>>>
>>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
>>>>> **java<https://github.com/**hotpads/data-tools/blob/**
>>>>> master/src/main/java/com/**hotpads/data/primitive/lists/**
>>>>> LongArrayList.java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
>>>>> >
>>>>>
>>>>>    *
>>>>>>
>>>>>>>  
>>>>>>> https://github.com/hotpads/******data-tools/blob/master/src/**<https://github.com/hotpads/****data-tools/blob/master/src/**>
>>>>>>>>>>> **<https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>>>> >
>>>>>>>>>>> main/java/com/hotpads/data/******types/floats/DoubleByteTool.***
>>>>>>>>>>> ***
>>>>>>>>>>> java<
>>>>>>>>>>>
>>>>>>>>>>>  
>>>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**>
>>>>>>>>>>
>>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
>>>>> java<https://github.com/**hotpads/data-tools/blob/**
>>>>> master/src/main/java/com/**hotpads/data/types/floats/**
>>>>> DoubleByteTool.java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>
>>>>> >
>>>>>
>>>>>    Looking back, I think my latest opinion on the topic is to reject
>>>>>>
>>>>>>>  nullability as the rule since it can cause unexpected behavior and
>>>>>>>>>>> confusion.  It's cleaner to provide a wrapper class (so both
>>>>>>>>>>> LongArrayList
>>>>>>>>>>> plus NullableLongArrayList) that explicitly defines the behavior,
>>>>>>>>>>>
>>>>>>>>>>>  and
>>>>>>>>>>
>>>>>>>>>   costs
>>>>>>>
>>>>>>>>  a little more in performance.  If the user can't find a pre-made
>>>>>>>>>>>
>>>>>>>>>>>  wrapper
>>>>>>>>>>
>>>>>>>>>   class, it's not very difficult for each user to provide their own
>>>>>>>>
>>>>>>>>> interpretation of null and check for it themselves.
>>>>>>>>>>>
>>>>>>>>>>> If you reject nullability, the question becomes what to do in
>>>>>>>>>>>
>>>>>>>>>>>  situations
>>>>>>>>>>
>>>>>>>>>   where you're implementing existing interfaces that accept
>>>>>>>> nullable
>>>>>>>>
>>>>>>>>> params.
>>>>>>>>>>>     The LongArrayList above implements List<Long> which requires
>>>>>>>>>>> an
>>>>>>>>>>> add(Long)
>>>>>>>>>>> method.  In the above implementation I chose to swap nulls with
>>>>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the
>>>>>>>>>>> user
>>>>>>>>>>>
>>>>>>>>>>>  to
>>>>>>>>>>
>>>>>>>>>   make
>>>>>>>
>>>>>>>>  that swap and then throw IllegalArgumentException if they pass
>>>>>>>>>>>
>>>>>>>>>>>  null.
>>>>>>>>>>
>>>>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
>>>>>>>>>>> doug.m...@explorysmedical.com
>>>>>>>>>>>
>>>>>>>>>>>   wrote:
>>>>>>>>>>>
>>>>>>>>>>>> HmmmŠ good question.
>>>>>>>>>>>>
>>>>>>>>>>>> I think that fixed width support is important for a great many
>>>>>>>>>>>>
>>>>>>>>>>>>  rowkey
>>>>>>>>>>>
>>>>>>>>>>   constructs cases, so I'd rather see something like losing
>>>>>>>
>>>>>>>>  MIN_VALUE
>>>>>>>>>>>
>>>>>>>>>> and
>>>>>>
>>>>>>>   keeping fixed width.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <ndimi...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>    Heya,
>>>>>>>>>>>>
>>>>>>>>>>>>  Thinking about data types and serialization. I think null
>>>>>>>>>>>>>
>>>>>>>>>>>>>  support
>>>>>>>>>>>>
>>>>>>>>>>> is
>>>>>>
>>>>>>    an
>>>>>>>
>>>>>>>>  important characteristic for the serialized representations,
>>>>>>>>>>>>> especially
>>>>>>>>>>>>> when considering the compound type. However, doing so in
>>>>>>>>>>>>>
>>>>>>>>>>>>>  directly
>>>>>>>>>>>>
>>>>>>>>>>>   incompatible with fixed-width representations for numerics. For
>>>>>>
>>>>>>>    instance,
>>>>>>>>>>>>>
>>>>>>>>>>>> if we want to have a fixed-width signed long stored on 8-bytes,
>>>>>>>>>>>>
>>>>>>>>>>>>  where
>>>>>>>>>>>
>>>>>>>>>>   do
>>>>>>>
>>>>>>>>  you put null? float and double types can cheat a little by
>>>>>>>>>>>>>
>>>>>>>>>>>>>  folding
>>>>>>>>>>>>
>>>>>>>>>>>   negative
>>>>>>
>>>>>>>  and positive NaN's into a single representation (this isn't
>>>>>>>>>>>>>
>>>>>>>>>>>>>  strictly
>>>>>>>>>>>>
>>>>>>>>>>>   correct!), leaving a place to represent null. In the long
>>>>>>>
>>>>>>>>  example
>>>>>>>>>>>>
>>>>>>>>>>>   case,
>>>>>>
>>>>>>>  the
>>>>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
>>>>>>>>>>>>>
>>>>>>>>>>>>>  one.
>>>>>>>>>>>>
>>>>>>>>>>>   This
>>>>>>>
>>>>>>>>  will allocate an additional encoding which can be used for null.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  My
>>>>>>>>>>>>
>>>>>>>>>>>   experience working with scientific data, however, makes me
>>>>>>> wince
>>>>>>>
>>>>>>>>  at
>>>>>>>>>>>>
>>>>>>>>>>>   the
>>>>>>>
>>>>>>>>  idea.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The variable-width encodings have it a little easier. There's
>>>>>>>>>>>>>
>>>>>>>>>>>>>  already
>>>>>>>>>>>>
>>>>>>>>>>>   enough going on that it's simpler to make room.
>>>>>>>>
>>>>>>>>>  Remember, the final goal is to support order-preserving
>>>>>>>>>>>>>
>>>>>>>>>>>>>  serialization.
>>>>>>>>>>>>
>>>>>>>>>>>   This
>>>>>>>>
>>>>>>>>>  imposes some limitations on our encoding strategies. For
>>>>>>>>>>>>>
>>>>>>>>>>>>>  instance,
>>>>>>>>>>>>
>>>>>>>>>>>   it's
>>>>>>
>>>>>>>  not
>>>>>>>>>>>>> enough to simply encode null, it really needs to be encoded as
>>>>>>>>>>>>>
>>>>>>>>>>>>>  0x00
>>>>>>>>>>>>
>>>>>>>>>>> so
>>>>>>>
>>>>>>>    as
>>>>>>>>
>>>>>>>>>  to sort lexicographically earlier than any other value.
>>>>>>>>>>>>
>>>>>>>>>>>>  What do you think? Any ideas, experiences, etc?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>

Re: HBase Types: Explicit Null Support

Reply via email to