My two bits:

1) I support making 64-byte alignment the default. We can always
retrofit the metadata later with a different alignment type, but in
the absence of such metadata, 512 bits can be assumed. I realize this
will have bad optics with small arrays (a lot of unused bytes), but
that's okay.

2) I also support using int8_t for the union ordinal type number.
Since we permit unions-of-unions, I expect very large unions will be
specialized enough that creating union "namespaces" to increase the
number of effective union types will be an acceptable compromise (as a
storage and performance win for the vast majority of use cases). We
could always add a "LARGE_UNION" primitive type, later, if it becomes
enough of a problem.

On Sat, Apr 9, 2016 at 9:29 AM, Micah Kornfield <emkornfi...@gmail.com> wrote:
> An additional data-point, it looks like Apache Hive also uses one byte
> for unions: 
> https://github.com/apache/hive/blob/26b5c7b56a4f28ce3eabc0207566cce46b29b558/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryUnion.java
>
> On Fri, Apr 8, 2016 at 8:21 PM, Micah Kornfield <emkornfi...@gmail.com> wrote:
>> I think one of Arrow's initial design goals should be simplicity of
>> implementation of the spec.   We can always make things more
>> complicated in the future.
>>
>> This leads me to prefer a fixed size.   Wes (or anyone else) in
>> practice have you seen a union of structs with more then 127 members?
>>
>> I would vote for int8_t for the types array for unions and letting
>> consumers of Arrow nest Unions at the application layer if they need
>> more slots.
>>
>>
>> On Fri, Apr 8, 2016 at 8:33 AM, Wes McKinney <w...@cloudera.com> wrote:
>>> On Fri, Apr 8, 2016 at 8:07 AM, Jacques Nadeau <jacq...@apache.org> wrote:
>>>>>
>>>>>
>>>>> > I believe this choice was primarily about simplifying the code (similar
>>>>> to why we have a n+1
>>>>> > offsets instead of just n in the list/varchar representations (even
>>>>> though n=0 is always 0)). In both
>>>>> > situations, you don't have to worry about writing special code (and a
>>>>> condition) for the boundary
>>>>> > condition inside tight loops (e.g. the last few bytes need to be handled
>>>>> differently since they
>>>>> > aren't word width).
>>>>>
>>>>> Sounds reasonable.  It might be worth illustrating this with a
>>>>> concrete example.  One scenario that this scheme seems useful for is a
>>>>> creating a new bitmap based on evaluating a predicate (i.e. all
>>>>> elements >X).  In this case would it make sense to make it a multiple
>>>>> of 16, so we can consistently use SIMD instructions for the logical
>>>>> "and" operation?
>>>>>
>>>>
>>>> Hmm... interesting thought. I'd have to look but I also recall some of the
>>>> newer stuff supporting even wider widths. What do others think?
>>>>
>>>>
>>>>> I think the spec is slightly inconsistent.  It says there is 6 bytes
>>>>> of overhead per entry but then follows: "with the smallest byte width
>>>>> capable of representing the number of types in the union."  I'm
>>>>> perfectly happy to say it is always 1, always 2, or always capped at
>>>>> 2.  I agree 32K/64K+ types is a very unlikely scenario.  We just need
>>>>> to clear up the ambiguity.
>>>>>
>>>>
>>>> Agreed. Do you want to propose an approach & patch to clarify?
>>>
>>> I can also take responsibility for the ambiguity here. My preference
>>> is to use int16_t for the types array (memory suitably aligned), but
>>> as 1 byte will be sufficient nearly all of the time, it's a slight
>>> trade-off in memory use vs. code complexity, e.g.
>>>
>>> if (children_.size() < 128) {
>>>   // types is only 1 byte
>>> } else {
>>>   // types is 2 bytes
>>> }
>>>
>>> Realistically there won't be that many affected code paths, so I'm
>>> comfortable with either choice (2-bytes always, or 1 or 2 bytes
>>> depending on the size of the union).

Reply via email to