On Thu, Apr 4, 2013 at 7:18 PM, James Taylor <jtay...@salesforce.com> wrote:
> Would it make sense to clean up the APIs a bit and post just the type > system code somewhere to give us something to poke holes at? > That could be useful. I've been experimenting with implementations as I update the spec doc and pushing as I go to https://github.com/ndimiduk/serialization-play. I can make you a collaborator or you can host your own repository, as you prefer. On 04/04/2013 06:49 PM, Nick Dimiduk wrote: > >> On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jtay...@salesforce.com >> >wrote: >> >> Maybe if we can keep nullability separate from the >>> serialization/deserialization, we can come up with a solution that works? >>> >> >> I think implied null could work, but let's build out the matrix. I see two >> kinds of types: fixed- and variable-width. These types are used in two >> scenarios: on their own or as part of a compound type. >> >> A fixed-width type used standalone can enfer null from absence of a value. >> When used in a compound type, absence isn't enough to indicate null unless >> it's the last value in the sequence. To support a null field in the middle >> of the compound type, it is forced to explicitly mark the field as null. >> The only solution I can think of (without sacrificing the full value >> range, >> per my original question) is to write the full type width bytes, followed >> by an isNull byte. Thus, for example, the INT type consumes 4 bytes when >> serialized stand-alone, but 5 bytes when composed. >> >> James, how does Phoenix handle a null fixed-width rowkey component? I >> don't >> see that implemented in PDataType enum. >> >> Variable-width used standalone are simple enough because HBase handles >> arbitrary length byte[]'s everywhere. Variable-width in composite is a >> problem. Phoenix forces these value to only appear as the last position in >> the composite, as I understand it. Orderly provides explicit null and >> termination bytes by taking advantage of a feature of UTF-8 encoding. >> Support for bytes is equally ugly (but clever) in that byte digits are >> encoded in BCD. Both of these approaches bloat slightly the serialized >> representation over the natural representation, but they allow the >> variable-length types to be used anywhere within the compound type. As an >> added bonus regarding code maintainability, their serialization entirely >> self-contained within the type. That's in contrast to the fixed-width type >> implementation described above, where null is explicitly encoded by the >> compound type. >> >> My opinion is the computational and storage overhead imposed by Orderly's >> implementation are worth the trade-off in flexibility in user consumption. >> Correct me if i'm wrong James, but you're saying, from your experience >> with >> Phoenix, users are willing to work within that constraint? >> >> Thanks, >> Nick >> >> On 04/01/2013 11:29 PM, Jesse Yates wrote: >> >> Actually, that isn't all that far-fetched of a format Matt - pretty >> common >> >>> anytime anyone wants to do sortable lat/long (*cough* three letter >>>> agencies >>>> cough*). >>>> >>>> Wouldn't we get the same by providing a simple set of libraries (ala >>>> orderly + other HBase useful things) and then still give access to the >>>> underlying byte array? Perhaps a nullable key type in that lib makes >>>> sense >>>> if lots of people need it and it would be nice to have standard >>>> libraries >>>> so tools could interop much more easily. >>>> ------------------- >>>> Jesse Yates >>>> @jesse_yates >>>> jyates.github.com >>>> >>>> >>>> On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan <mcor...@hotpads.com> >>>> wrote: >>>> >>>> Ah, I didn't even realize sql allowed null key parts. Maybe a goal of >>>> >>>>> the >>>>> interfaces should be to provide first-class support for custom user >>>>> types >>>>> in addition to the standard ones included. Part of the power of >>>>> hbase's >>>>> plain byte[] keys is that users can concoct the perfect key for their >>>>> data >>>>> type. For example, I have a lot of geographic data where I interleave >>>>> latitude/longitude bits into a sortable 64 bit value that would >>>>> probably >>>>> never be included in a standard library. >>>>> >>>>> >>>>> On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <enis....@gmail.com> >>>>> wrote: >>>>> >>>>> I think having Int32, and NullableInt32 would support minimum >>>>> overhead, >>>>> as >>>>> >>>>> well as allowing SQL semantics. >>>>>> >>>>>> >>>>>> On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <ndimi...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Furthermore, is is more important to support null values than >>>>>> squeeze >>>>>> all >>>>>> representations into minimum size (4-bytes for int32, &c.)? >>>>>> >>>>>>> On Apr 1, 2013 4:41 PM, "Nick Dimiduk" <ndimi...@gmail.com> wrote: >>>>>>> >>>>>>> On Mon, Apr 1, 2013 at 4:31 PM, James Taylor < >>>>>>> jtay...@salesforce.com >>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> From the SQL perspective, handling null is important. >>>>>>>> From your perspective, it is critical to support NULLs, even at >>>>>>>> the >>>>>>>> expense of fixed-width encodings at all or supporting representation >>>>>>>> >>>>>>>> of a >>>>>>> full range of values. That is, you'd rather be able to represent NULL >>>>>>> than >>>>>>> >>>>>>> -2^31? >>>>>>>> >>>>>>>> On 04/01/2013 01:32 PM, Nick Dimiduk wrote: >>>>>>>> >>>>>>>> Thanks for the thoughtful response (and code!). >>>>>>>>> >>>>>>>>>> I'm thinking I will press forward with a base implementation that >>>>>>>>>> >>>>>>>>>> does >>>>>>>>> >>>>>>>> not >>>>>>> >>>>>>>> support nulls. The idea is to provide an extensible set of >>>>>>>>>> >>>>>>>>>> interfaces, >>>>>>>>> >>>>>>>> so I >>>>>>> >>>>>>>> think this will not box us into a corner later. That is, a >>>>>>>>>> >>>>>>>>>> mirroring >>>>>>>>> >>>>>>>> package could be implemented that supports null values and accepts >>>>>> >>>>>>> the relevant trade-offs. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Nick >>>>>>>>>> >>>>>>>>>> On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mcor...@hotpads.com >>>>>>>>>> > >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> I spent some time this weekend extracting bits of our >>>>>>>>>> >>>>>>>>>> serialization >>>>>>>>> >>>>>>>> code to >>>>>> >>>>>>> a public github repo at >>>>>>> http://github.com/hotpads/******data-tools<http://github.com/hotpads/****data-tools> >>>>>>>>>>> <http://github.com/**hotpads/**data-tools<http://github.com/hotpads/**data-tools> >>>>>>>>>>> > >>>>>>>>>>> < >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> http://github.com/hotpads/****data-tools<http://github.com/hotpads/**data-tools> >>>>>>>>>> <http://github.com/**hotpads/data-tools<http://github.com/hotpads/data-tools> >>>>>>>>>> > >>>>>>>>>> >>>>>>>>> . >>>>>>>> >>>>>>>>> Contributions are welcome - i'm sure we all have this stuff >>>>>>>>>>> >>>>>>>>>>> laying >>>>>>>>>> >>>>>>>>> around. >>>>>> >>>>>>> You can see I've bumped into the NULL problem in a few places: >>>>>>>>>>> * >>>>>>>>>>> >>>>>>>>>>> https://github.com/hotpads/******data-tools/blob/master/src/**<https://github.com/hotpads/****data-tools/blob/master/src/**> >>>>>>>>>>> **<https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**> >>>>>>>>>>> > >>>>>>>>>>> main/java/com/hotpads/data/******primitive/lists/** >>>>>>>>>>> LongArrayList.** >>>>>>>>>>> **java< >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**> >>>>>>>>>> >>>>>>>>> main/java/com/hotpads/data/****primitive/lists/LongArrayList.** >>>>> **java<https://github.com/**hotpads/data-tools/blob/** >>>>> master/src/main/java/com/**hotpads/data/primitive/lists/** >>>>> LongArrayList.java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java> >>>>> > >>>>> >>>>> * >>>>>> >>>>>>> >>>>>>> https://github.com/hotpads/******data-tools/blob/master/src/**<https://github.com/hotpads/****data-tools/blob/master/src/**> >>>>>>>>>>> **<https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**> >>>>>>>>>>> > >>>>>>>>>>> main/java/com/hotpads/data/******types/floats/DoubleByteTool.*** >>>>>>>>>>> *** >>>>>>>>>>> java< >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://github.com/hotpads/****data-tools/blob/master/src/**<https://github.com/hotpads/**data-tools/blob/master/src/**> >>>>>>>>>> >>>>>>>>> main/java/com/hotpads/data/****types/floats/DoubleByteTool.**** >>>>> java<https://github.com/**hotpads/data-tools/blob/** >>>>> master/src/main/java/com/**hotpads/data/types/floats/** >>>>> DoubleByteTool.java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java> >>>>> > >>>>> >>>>> Looking back, I think my latest opinion on the topic is to reject >>>>>> >>>>>>> nullability as the rule since it can cause unexpected behavior and >>>>>>>>>>> confusion. It's cleaner to provide a wrapper class (so both >>>>>>>>>>> LongArrayList >>>>>>>>>>> plus NullableLongArrayList) that explicitly defines the behavior, >>>>>>>>>>> >>>>>>>>>>> and >>>>>>>>>> >>>>>>>>> costs >>>>>>> >>>>>>>> a little more in performance. If the user can't find a pre-made >>>>>>>>>>> >>>>>>>>>>> wrapper >>>>>>>>>> >>>>>>>>> class, it's not very difficult for each user to provide their own >>>>>>>> >>>>>>>>> interpretation of null and check for it themselves. >>>>>>>>>>> >>>>>>>>>>> If you reject nullability, the question becomes what to do in >>>>>>>>>>> >>>>>>>>>>> situations >>>>>>>>>> >>>>>>>>> where you're implementing existing interfaces that accept >>>>>>>> nullable >>>>>>>> >>>>>>>>> params. >>>>>>>>>>> The LongArrayList above implements List<Long> which requires >>>>>>>>>>> an >>>>>>>>>>> add(Long) >>>>>>>>>>> method. In the above implementation I chose to swap nulls with >>>>>>>>>>> Long.MIN_VALUE, however I'm now thinking it best to force the >>>>>>>>>>> user >>>>>>>>>>> >>>>>>>>>>> to >>>>>>>>>> >>>>>>>>> make >>>>>>> >>>>>>>> that swap and then throw IllegalArgumentException if they pass >>>>>>>>>>> >>>>>>>>>>> null. >>>>>>>>>> >>>>>>>>>>> On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil < >>>>>>>>>>> doug.m...@explorysmedical.com >>>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> HmmmŠ good question. >>>>>>>>>>>> >>>>>>>>>>>> I think that fixed width support is important for a great many >>>>>>>>>>>> >>>>>>>>>>>> rowkey >>>>>>>>>>> >>>>>>>>>> constructs cases, so I'd rather see something like losing >>>>>>> >>>>>>>> MIN_VALUE >>>>>>>>>>> >>>>>>>>>> and >>>>>> >>>>>>> keeping fixed width. >>>>>>>> >>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 4/1/13 2:00 PM, "Nick Dimiduk" <ndimi...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Heya, >>>>>>>>>>>> >>>>>>>>>>>> Thinking about data types and serialization. I think null >>>>>>>>>>>>> >>>>>>>>>>>>> support >>>>>>>>>>>> >>>>>>>>>>> is >>>>>> >>>>>> an >>>>>>> >>>>>>>> important characteristic for the serialized representations, >>>>>>>>>>>>> especially >>>>>>>>>>>>> when considering the compound type. However, doing so in >>>>>>>>>>>>> >>>>>>>>>>>>> directly >>>>>>>>>>>> >>>>>>>>>>> incompatible with fixed-width representations for numerics. For >>>>>> >>>>>>> instance, >>>>>>>>>>>>> >>>>>>>>>>>> if we want to have a fixed-width signed long stored on 8-bytes, >>>>>>>>>>>> >>>>>>>>>>>> where >>>>>>>>>>> >>>>>>>>>> do >>>>>>> >>>>>>>> you put null? float and double types can cheat a little by >>>>>>>>>>>>> >>>>>>>>>>>>> folding >>>>>>>>>>>> >>>>>>>>>>> negative >>>>>> >>>>>>> and positive NaN's into a single representation (this isn't >>>>>>>>>>>>> >>>>>>>>>>>>> strictly >>>>>>>>>>>> >>>>>>>>>>> correct!), leaving a place to represent null. In the long >>>>>>> >>>>>>>> example >>>>>>>>>>>> >>>>>>>>>>> case, >>>>>> >>>>>>> the >>>>>>>>>>>>> obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by >>>>>>>>>>>>> >>>>>>>>>>>>> one. >>>>>>>>>>>> >>>>>>>>>>> This >>>>>>> >>>>>>>> will allocate an additional encoding which can be used for null. >>>>>>>>>>>>> >>>>>>>>>>>>> My >>>>>>>>>>>> >>>>>>>>>>> experience working with scientific data, however, makes me >>>>>>> wince >>>>>>> >>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>> the >>>>>>> >>>>>>>> idea. >>>>>>>>>>>>> >>>>>>>>>>>>> The variable-width encodings have it a little easier. There's >>>>>>>>>>>>> >>>>>>>>>>>>> already >>>>>>>>>>>> >>>>>>>>>>> enough going on that it's simpler to make room. >>>>>>>> >>>>>>>>> Remember, the final goal is to support order-preserving >>>>>>>>>>>>> >>>>>>>>>>>>> serialization. >>>>>>>>>>>> >>>>>>>>>>> This >>>>>>>> >>>>>>>>> imposes some limitations on our encoding strategies. For >>>>>>>>>>>>> >>>>>>>>>>>>> instance, >>>>>>>>>>>> >>>>>>>>>>> it's >>>>>> >>>>>>> not >>>>>>>>>>>>> enough to simply encode null, it really needs to be encoded as >>>>>>>>>>>>> >>>>>>>>>>>>> 0x00 >>>>>>>>>>>> >>>>>>>>>>> so >>>>>>> >>>>>>> as >>>>>>>> >>>>>>>>> to sort lexicographically earlier than any other value. >>>>>>>>>>>> >>>>>>>>>>>> What do you think? Any ideas, experiences, etc? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Nick >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >