Hi, Agreed. To paraphrase/complement what has been said: The types in format/Message.fbs [1] are "Logical types" or "user facing types", close to SQL types (they include String, Timestamp, Decimal, ...) and are related to Parquet's logical types [2][3]. For each of those types there's a corresponding physical layout that is formally specified (example discussed here: String => List<UInt8-not null>). I'm going to open a couple of JIRA's to finalise the types and clarify the layout.
[1] https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf0045532d6607/format/Message.fbs#L63 [2] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md [3] https://github.com/apache/parquet-format/blob/66a5a7b982e291e06afb1da7ffe9da211318caba/src/main/thrift/parquet.thrift#L48 Julien On Tue, Aug 9, 2016 at 4:20 PM, Wes McKinney <wesmck...@gmail.com> wrote: > hi Micah > > I'm sorry for dropping the ball on this discussion. copying Julien as > he's been looking at the metadata recently. > > My thinking is that we should indicate in the format document that the > String and Binary logical types, as a matter of cross-implementation > convention, will have List<UInt8-not null> memory layout. > > In the C++ library at least, we can collapse the class structure to > make BinaryArray and StringArray not a subclass of ListArray, > factoring out common code that can be reused into helper inline > functions. > > Class hierarchy aside the main impact is adding entries to the Type > union in the Flatbuffers metadata > https://github.com/apache/arrow/blob/master/format/Message.fbs#L63 > > Accordingly, in the metadata and in RPC/IPC scenarios, binary/string > would be a single array unit in the buffer stream and flattened Field > metadata rather than nested types (2 array units as they are > presently). > > Separately, I am very interested in discussing a form of logical > Binary/StringArray in the C++ implementation that is internally > dictionary encoded. I'm proposing this as a possible new UTF-8 > representation for pandas in the future: > https://wesm.github.io/pandas2-design/strings.html# > possible-solution-new-non-numpy-string-memory-layout > > Hopefully this isn't too incoherent, but it would be good to arrive at > some conclusion in this discussion if we need to implement the > changes. > > Thanks > Wes > > On Tue, Jul 26, 2016 at 10:09 PM, Micah Kornfield <emkornfi...@gmail.com> > wrote: > > Wes, Jacques, others... > > > > Any thoughts on this? Let me know if you would like to clarify > something, > > I think I was a little long winded. It would be good to come to a > > consensus one way or another. > > > > Thanks, > > Micah > > > > On Sun, Jul 17, 2016 at 1:43 PM, Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > >> Hi Wes and Jacques, > >> > >> Thanks for the thorough analysis. I agree that Strings should be easy > to > >> work with. I'm just trying to understand how making a distinct string > type > >> defined in the memory layout spec [1] brings a lot of additional > utility. > >> > >> I think of there being two distinct concerns with Arrow: > >> > >> 1. Layout - What metadata and data elements are required to represent a > >> specific type in a flat address space. > >> > >> 2. Manipulation - How we build interfaces for working with the memory > >> layout. > >> > >> With respect to Memory Layout, introducing a new string type seems to > add > >> redundancy. As Wes noted, List<uint8 [not-null]> is sufficient to > >> represent the layout for strings. So the main benefits for introducing > a > >> new memory layout for a string type is an optimization. By introducing > the > >> new type we avoid invalid string construction (having uint_t elements > >> marked as null in the nested array) and to save a few bytes/extra > function > >> call when "(de)serializing" a string column. > >> > >> With respect to manipulation, I agree, that having the right > API/modeling > >> to treat strings as first class objects makes a lot of sense. But I > don't > >> think that the specification needs to explicitly make allowances for it. > >> Once you have constructed a Java/C++ wrapper around the memory layout > you > >> can choose to expose the right convenience APIs through OO abstraction. > >> The construction of the correct object wrapper is governed by Metadata > >> defined in [2] and an understanding of how the logical type maps to the > >> appropriate memory layout. At the moment metadata doesn't specify any > sort > >> of class hierarchy which I believe is the correct thing to do from a > >> specification perspective. > >> > >> The C++ implementation currently has StringArrays inheriting from a > >> ListArrays which was an implementation convenience and something we > should > >> revisit (I agree with Wes's point on not relying on C++'s type system > for > >> casting). > >> The primary argument for changing the existing implementation seems to > be > >> that strings should be considered "non-nested" types. Whether strings > are > >> nested or not seems to fall squarely into the manipulation concern > (except > >> for the optimizations mentioned above) and therefore, IMO, an > >> implementation detail. When thinking about how this plays out in > code > >> I imagine a visitor pattern. I've provided some pseudo-code below for > two > >> possible visitor classes make StringArrays first class objects but > wouldn't > >> require updates to the specification. > >> > >> I've tried to think where testing a particular object for "nested"-ness > >> makes sense by itself and couldn't come up with something off the top > of my > >> head. It seems once you determine an Array is non-nested you still > want to > >> test for exact primitive type you are dealing with. > >> > >> Given these points I'm still ambivalent about adding a new string/binary > >> type to the spec. It would be an improvement but it seems like a > somewhat > >> minor improvement. If people can provide stronger use-cases for adding > the > >> new type I'd be less ambivalent, but at the moment this seems like more > of > >> an implementation concern. > >> > >> Thanks, > >> Micah > >> > >> // Visitor patterns for arrays, that do not require any updates to the > >> memory layout. > >> class ClassVisitor { > >> void visit(Int32Array ); > >> void visit(UInt32Array ); > >> void visit(DoubleArray ); > >> void visit(ListArray ); > >> void visit(StringArray ); // if we changed the hierarchy, this would > >> be sufficient to treat strings as a first class type > >> // Other types elided > >> } > >> > >> or > >> > >> class BufferVisitor { // type disambiguation happens by calling the > >> correctly > >> // overloaded method > >> void visit_numeric(TypeMedata, null_bitmap, value_buffer); > >> void visit_list(TypeMedata, null_bitmap, offset_buffer, Array > >> nested_type); > >> void visit_string(TypeMetadata, null_bitmap, offset_buffer, > >> byte_buffer); // sufficient for treating string types as non-nested. > >> // Other types elided. > >> } > >> > >> [1] https://github.com/apache/arrow/blob/master/format/Layout.md > >> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs > >> > >> > -- Julien