Wes, Jacques, others... Any thoughts on this? Let me know if you would like to clarify something, I think I was a little long winded. It would be good to come to a consensus one way or another.
Thanks, Micah On Sun, Jul 17, 2016 at 1:43 PM, Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Wes and Jacques, > > Thanks for the thorough analysis. I agree that Strings should be easy to > work with. I'm just trying to understand how making a distinct string type > defined in the memory layout spec [1] brings a lot of additional utility. > > I think of there being two distinct concerns with Arrow: > > 1. Layout - What metadata and data elements are required to represent a > specific type in a flat address space. > > 2. Manipulation - How we build interfaces for working with the memory > layout. > > With respect to Memory Layout, introducing a new string type seems to add > redundancy. As Wes noted, List<uint8 [not-null]> is sufficient to > represent the layout for strings. So the main benefits for introducing a > new memory layout for a string type is an optimization. By introducing the > new type we avoid invalid string construction (having uint_t elements > marked as null in the nested array) and to save a few bytes/extra function > call when "(de)serializing" a string column. > > With respect to manipulation, I agree, that having the right API/modeling > to treat strings as first class objects makes a lot of sense. But I don't > think that the specification needs to explicitly make allowances for it. > Once you have constructed a Java/C++ wrapper around the memory layout you > can choose to expose the right convenience APIs through OO abstraction. > The construction of the correct object wrapper is governed by Metadata > defined in [2] and an understanding of how the logical type maps to the > appropriate memory layout. At the moment metadata doesn't specify any sort > of class hierarchy which I believe is the correct thing to do from a > specification perspective. > > The C++ implementation currently has StringArrays inheriting from a > ListArrays which was an implementation convenience and something we should > revisit (I agree with Wes's point on not relying on C++'s type system for > casting). > The primary argument for changing the existing implementation seems to be > that strings should be considered "non-nested" types. Whether strings are > nested or not seems to fall squarely into the manipulation concern (except > for the optimizations mentioned above) and therefore, IMO, an > implementation detail. When thinking about how this plays out in code > I imagine a visitor pattern. I've provided some pseudo-code below for two > possible visitor classes make StringArrays first class objects but wouldn't > require updates to the specification. > > I've tried to think where testing a particular object for "nested"-ness > makes sense by itself and couldn't come up with something off the top of my > head. It seems once you determine an Array is non-nested you still want to > test for exact primitive type you are dealing with. > > Given these points I'm still ambivalent about adding a new string/binary > type to the spec. It would be an improvement but it seems like a somewhat > minor improvement. If people can provide stronger use-cases for adding the > new type I'd be less ambivalent, but at the moment this seems like more of > an implementation concern. > > Thanks, > Micah > > // Visitor patterns for arrays, that do not require any updates to the > memory layout. > class ClassVisitor { > void visit(Int32Array ); > void visit(UInt32Array ); > void visit(DoubleArray ); > void visit(ListArray ); > void visit(StringArray ); // if we changed the hierarchy, this would > be sufficient to treat strings as a first class type > // Other types elided > } > > or > > class BufferVisitor { // type disambiguation happens by calling the > correctly > // overloaded method > void visit_numeric(TypeMedata, null_bitmap, value_buffer); > void visit_list(TypeMedata, null_bitmap, offset_buffer, Array > nested_type); > void visit_string(TypeMetadata, null_bitmap, offset_buffer, > byte_buffer); // sufficient for treating string types as non-nested. > // Other types elided. > } > > [1] https://github.com/apache/arrow/blob/master/format/Layout.md > [2] https://github.com/apache/arrow/blob/master/format/Message.fbs > >