Re: HBase Types: Explicit Null Support

Jesse Yates Mon, 01 Apr 2013 23:30:49 -0700

Actually, that isn't all that far-fetched of a format Matt - pretty common
anytime anyone wants to do sortable lat/long (*cough* three letter agencies
cough*).


Wouldn't we get the same by providing a simple set of libraries (ala
orderly + other HBase useful things) and then still give access to the
underlying byte array? Perhaps a nullable key type in that lib makes sense
if lots of people need it and it would be nice to have standard libraries
so tools could interop much more easily.
-------------------
Jesse Yates
@jesse_yates
jyates.github.com


On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <[email protected]> wrote:

> Ah, I didn't even realize sql allowed null key parts.  Maybe a goal of the
> interfaces should be to provide first-class support for custom user types
> in addition to the standard ones included.  Part of the power of hbase's
> plain byte[] keys is that users can concoct the perfect key for their data
> type.  For example, I have a lot of geographic data where I interleave
> latitude/longitude bits into a sortable 64 bit value that would probably
> never be included in a standard library.
>
>
> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <[email protected]> wrote:
>
> > I think having Int32, and NullableInt32 would support minimum overhead,
> as
> > well as allowing SQL semantics.
> >
> >
> > On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <[email protected]> wrote:
> >
> > > Furthermore, is is more important to support null values than squeeze
> all
> > > representations into minimum size (4-bytes for int32, &c.)?
> > > On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <[email protected]> wrote:
> > >
> > > > On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <[email protected]
> > > >wrote:
> > > >
> > > >> From the SQL perspective, handling null is important.
> > > >
> > > >
> > > > From your perspective, it is critical to support NULLs, even at the
> > > > expense of fixed-width encodings at all or supporting representation
> > of a
> > > > full range of values. That is, you'd rather be able to represent NULL
> > > than
> > > > -2^31?
> > > >
> > > > On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
> > > >>
> > > >>> Thanks for the thoughtful response (and code!).
> > > >>>
> > > >>> I'm thinking I will press forward with a base implementation that
> > does
> > > >>> not
> > > >>> support nulls. The idea is to provide an extensible set of
> > interfaces,
> > > >>> so I
> > > >>> think this will not box us into a corner later. That is, a
> mirroring
> > > >>> package could be implemented that supports null values and accepts
> > > >>> the relevant trade-offs.
> > > >>>
> > > >>> Thanks,
> > > >>> Nick
> > > >>>
> > > >>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <[email protected]>
> > > >>> wrote:
> > > >>>
> > > >>>  I spent some time this weekend extracting bits of our
> serialization
> > > >>>> code to
> > > >>>> a public github repo at http://github.com/hotpads/**data-tools<
> > > http://github.com/hotpads/data-tools>
> > > >>>> .
> > > >>>>   Contributions are welcome - i'm sure we all have this stuff
> laying
> > > >>>> around.
> > > >>>>
> > > >>>> You can see I've bumped into the NULL problem in a few places:
> > > >>>> *
> > > >>>>
> > > >>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> > > >>>> main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
> > >
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
> > > >
> > > >>>> *
> > > >>>>
> > > >>>> https://github.com/hotpads/**data-tools/blob/master/src/**
> > > >>>> main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
> > >
> >
> https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
> > > >
> > > >>>>
> > > >>>> Looking back, I think my latest opinion on the topic is to reject
> > > >>>> nullability as the rule since it can cause unexpected behavior and
> > > >>>> confusion.  It's cleaner to provide a wrapper class (so both
> > > >>>> LongArrayList
> > > >>>> plus NullableLongArrayList) that explicitly defines the behavior,
> > and
> > > >>>> costs
> > > >>>> a little more in performance.  If the user can't find a pre-made
> > > wrapper
> > > >>>> class, it's not very difficult for each user to provide their own
> > > >>>> interpretation of null and check for it themselves.
> > > >>>>
> > > >>>> If you reject nullability, the question becomes what to do in
> > > situations
> > > >>>> where you're implementing existing interfaces that accept nullable
> > > >>>> params.
> > > >>>>   The LongArrayList above implements List<Long> which requires an
> > > >>>> add(Long)
> > > >>>> method.  In the above implementation I chose to swap nulls with
> > > >>>> Long.MIN_VALUE, however I'm now thinking it best to force the user
> > to
> > > >>>> make
> > > >>>> that swap and then throw IllegalArgumentException if they pass
> null.
> > > >>>>
> > > >>>>
> > > >>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
> > > >>>> [email protected]
> > > >>>>
> > > >>>>> wrote:
> > > >>>>> HmmmŠ good question.
> > > >>>>>
> > > >>>>> I think that fixed width support is important for a great many
> > rowkey
> > > >>>>> constructs cases, so I'd rather see something like losing
> MIN_VALUE
> > > and
> > > >>>>> keeping fixed width.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <[email protected]> wrote:
> > > >>>>>
> > > >>>>>  Heya,
> > > >>>>>>
> > > >>>>>> Thinking about data types and serialization. I think null
> support
> > is
> > > >>>>>> an
> > > >>>>>> important characteristic for the serialized representations,
> > > >>>>>> especially
> > > >>>>>> when considering the compound type. However, doing so in
> directly
> > > >>>>>> incompatible with fixed-width representations for numerics. For
> > > >>>>>>
> > > >>>>> instance,
> > > >>>>
> > > >>>>> if we want to have a fixed-width signed long stored on 8-bytes,
> > where
> > > >>>>>> do
> > > >>>>>> you put null? float and double types can cheat a little by
> folding
> > > >>>>>> negative
> > > >>>>>> and positive NaN's into a single representation (this isn't
> > strictly
> > > >>>>>> correct!), leaving a place to represent null. In the long
> example
> > > >>>>>> case,
> > > >>>>>> the
> > > >>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
> > one.
> > > >>>>>> This
> > > >>>>>> will allocate an additional encoding which can be used for null.
> > My
> > > >>>>>> experience working with scientific data, however, makes me wince
> > at
> > > >>>>>> the
> > > >>>>>> idea.
> > > >>>>>>
> > > >>>>>> The variable-width encodings have it a little easier. There's
> > > already
> > > >>>>>> enough going on that it's simpler to make room.
> > > >>>>>>
> > > >>>>>> Remember, the final goal is to support order-preserving
> > > serialization.
> > > >>>>>> This
> > > >>>>>> imposes some limitations on our encoding strategies. For
> instance,
> > > >>>>>> it's
> > > >>>>>> not
> > > >>>>>> enough to simply encode null, it really needs to be encoded as
> > 0x00
> > > so
> > > >>>>>>
> > > >>>>> as
> > > >>>>
> > > >>>>> to sort lexicographically earlier than any other value.
> > > >>>>>>
> > > >>>>>> What do you think? Any ideas, experiences, etc?
> > > >>>>>>
> > > >>>>>> Thanks,
> > > >>>>>> Nick
> > > >>>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>
> > > >
> > >
> >
>

Re: HBase Types: Explicit Null Support

Reply via email to