My two bits: 1) I support making 64-byte alignment the default. We can always retrofit the metadata later with a different alignment type, but in the absence of such metadata, 512 bits can be assumed. I realize this will have bad optics with small arrays (a lot of unused bytes), but that's okay.
2) I also support using int8_t for the union ordinal type number. Since we permit unions-of-unions, I expect very large unions will be specialized enough that creating union "namespaces" to increase the number of effective union types will be an acceptable compromise (as a storage and performance win for the vast majority of use cases). We could always add a "LARGE_UNION" primitive type, later, if it becomes enough of a problem. On Sat, Apr 9, 2016 at 9:29 AM, Micah Kornfield <emkornfi...@gmail.com> wrote: > An additional data-point, it looks like Apache Hive also uses one byte > for unions: > https://github.com/apache/hive/blob/26b5c7b56a4f28ce3eabc0207566cce46b29b558/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryUnion.java > > On Fri, Apr 8, 2016 at 8:21 PM, Micah Kornfield <emkornfi...@gmail.com> wrote: >> I think one of Arrow's initial design goals should be simplicity of >> implementation of the spec. We can always make things more >> complicated in the future. >> >> This leads me to prefer a fixed size. Wes (or anyone else) in >> practice have you seen a union of structs with more then 127 members? >> >> I would vote for int8_t for the types array for unions and letting >> consumers of Arrow nest Unions at the application layer if they need >> more slots. >> >> >> On Fri, Apr 8, 2016 at 8:33 AM, Wes McKinney <w...@cloudera.com> wrote: >>> On Fri, Apr 8, 2016 at 8:07 AM, Jacques Nadeau <jacq...@apache.org> wrote: >>>>> >>>>> >>>>> > I believe this choice was primarily about simplifying the code (similar >>>>> to why we have a n+1 >>>>> > offsets instead of just n in the list/varchar representations (even >>>>> though n=0 is always 0)). In both >>>>> > situations, you don't have to worry about writing special code (and a >>>>> condition) for the boundary >>>>> > condition inside tight loops (e.g. the last few bytes need to be handled >>>>> differently since they >>>>> > aren't word width). >>>>> >>>>> Sounds reasonable. It might be worth illustrating this with a >>>>> concrete example. One scenario that this scheme seems useful for is a >>>>> creating a new bitmap based on evaluating a predicate (i.e. all >>>>> elements >X). In this case would it make sense to make it a multiple >>>>> of 16, so we can consistently use SIMD instructions for the logical >>>>> "and" operation? >>>>> >>>> >>>> Hmm... interesting thought. I'd have to look but I also recall some of the >>>> newer stuff supporting even wider widths. What do others think? >>>> >>>> >>>>> I think the spec is slightly inconsistent. It says there is 6 bytes >>>>> of overhead per entry but then follows: "with the smallest byte width >>>>> capable of representing the number of types in the union." I'm >>>>> perfectly happy to say it is always 1, always 2, or always capped at >>>>> 2. I agree 32K/64K+ types is a very unlikely scenario. We just need >>>>> to clear up the ambiguity. >>>>> >>>> >>>> Agreed. Do you want to propose an approach & patch to clarify? >>> >>> I can also take responsibility for the ambiguity here. My preference >>> is to use int16_t for the types array (memory suitably aligned), but >>> as 1 byte will be sufficient nearly all of the time, it's a slight >>> trade-off in memory use vs. code complexity, e.g. >>> >>> if (children_.size() < 128) { >>> // types is only 1 byte >>> } else { >>> // types is 2 bytes >>> } >>> >>> Realistically there won't be that many affected code paths, so I'm >>> comfortable with either choice (2-bytes always, or 1 or 2 bytes >>> depending on the size of the union).