Hi Wes and Jacques, Thanks for the thorough analysis. I agree that Strings should be easy to work with. I'm just trying to understand how making a distinct string type defined in the memory layout spec [1] brings a lot of additional utility.
I think of there being two distinct concerns with Arrow: 1. Layout - What metadata and data elements are required to represent a specific type in a flat address space. 2. Manipulation - How we build interfaces for working with the memory layout. With respect to Memory Layout, introducing a new string type seems to add redundancy. As Wes noted, List<uint8 [not-null]> is sufficient to represent the layout for strings. So the main benefits for introducing a new memory layout for a string type is an optimization. By introducing the new type we avoid invalid string construction (having uint_t elements marked as null in the nested array) and to save a few bytes/extra function call when "(de)serializing" a string column. With respect to manipulation, I agree, that having the right API/modeling to treat strings as first class objects makes a lot of sense. But I don't think that the specification needs to explicitly make allowances for it. Once you have constructed a Java/C++ wrapper around the memory layout you can choose to expose the right convenience APIs through OO abstraction. The construction of the correct object wrapper is governed by Metadata defined in [2] and an understanding of how the logical type maps to the appropriate memory layout. At the moment metadata doesn't specify any sort of class hierarchy which I believe is the correct thing to do from a specification perspective. The C++ implementation currently has StringArrays inheriting from a ListArrays which was an implementation convenience and something we should revisit (I agree with Wes's point on not relying on C++'s type system for casting). The primary argument for changing the existing implementation seems to be that strings should be considered "non-nested" types. Whether strings are nested or not seems to fall squarely into the manipulation concern (except for the optimizations mentioned above) and therefore, IMO, an implementation detail. When thinking about how this plays out in code I imagine a visitor pattern. I've provided some pseudo-code below for two possible visitor classes make StringArrays first class objects but wouldn't require updates to the specification. I've tried to think where testing a particular object for "nested"-ness makes sense by itself and couldn't come up with something off the top of my head. It seems once you determine an Array is non-nested you still want to test for exact primitive type you are dealing with. Given these points I'm still ambivalent about adding a new string/binary type to the spec. It would be an improvement but it seems like a somewhat minor improvement. If people can provide stronger use-cases for adding the new type I'd be less ambivalent, but at the moment this seems like more of an implementation concern. Thanks, Micah // Visitor patterns for arrays, that do not require any updates to the memory layout. class ClassVisitor { void visit(Int32Array ); void visit(UInt32Array ); void visit(DoubleArray ); void visit(ListArray ); void visit(StringArray ); // if we changed the hierarchy, this would be sufficient to treat strings as a first class type // Other types elided } or class BufferVisitor { // type disambiguation happens by calling the correctly // overloaded method void visit_numeric(TypeMedata, null_bitmap, value_buffer); void visit_list(TypeMedata, null_bitmap, offset_buffer, Array nested_type); void visit_string(TypeMetadata, null_bitmap, offset_buffer, byte_buffer); // sufficient for treating string types as non-nested. // Other types elided. } [1] https://github.com/apache/arrow/blob/master/format/Layout.md [2] https://github.com/apache/arrow/blob/master/format/Message.fbs