Hi Wes and Jacques,

Thanks for the thorough analysis.  I agree that Strings should be easy to
work with.  I'm just trying to understand how making a distinct string type
defined in the memory layout spec [1] brings a lot of additional utility.

I think of there being two distinct concerns with Arrow:

1.  Layout - What metadata and data elements are required to represent a
specific type in a flat address space.

2.  Manipulation - How we build interfaces for working with the memory
layout.

With respect to Memory Layout, introducing a new string type seems to add
redundancy.  As Wes noted, List<uint8 [not-null]> is sufficient to
represent the layout for strings.  So the main benefits for introducing a
new memory layout for a string type is an optimization.  By introducing the
new type we avoid invalid string construction (having uint_t elements
marked as null in the nested array) and to save a few bytes/extra function
call when "(de)serializing" a string column.

With respect to manipulation, I agree, that having the right API/modeling
to treat strings as first class objects makes a lot of sense.   But I don't
think that the specification needs to explicitly make allowances for it.
Once you have constructed a Java/C++ wrapper around the memory layout you
can choose to expose the right convenience APIs through OO abstraction.
The construction of the correct object wrapper is governed by Metadata
defined in [2] and an understanding of how the logical type maps to the
appropriate memory layout.  At the moment metadata doesn't specify any sort
of class hierarchy which I believe is the correct thing to do from a
specification perspective.

The C++ implementation currently has StringArrays inheriting from a
ListArrays which was an implementation convenience and something we should
revisit (I agree with Wes's point on not relying on  C++'s type system for
casting).
The primary argument for changing the existing implementation seems to be
that strings should be considered "non-nested" types.  Whether strings are
nested or not seems to fall squarely into the manipulation concern (except
for the optimizations mentioned above) and therefore, IMO, an
implementation detail.     When thinking about how this plays out in code I
imagine a visitor pattern.  I've provided some pseudo-code below for two
possible visitor classes make StringArrays first class objects but wouldn't
require updates to the specification.

I've tried to think where testing a particular object for "nested"-ness
makes sense by itself and couldn't come up with something off the top of my
head.  It seems once you determine an Array is non-nested you still want to
test for exact primitive type you are dealing with.

Given these points I'm still ambivalent about adding a new string/binary
type to the spec. It would be an improvement but it seems like a somewhat
minor improvement.  If people can provide stronger use-cases for adding the
new type I'd be less ambivalent, but at the moment this seems like more of
an implementation concern.

Thanks,
Micah

// Visitor patterns for arrays, that do not require any updates to the
memory layout.
class ClassVisitor {
    void visit(Int32Array );
    void visit(UInt32Array );
    void visit(DoubleArray );
    void visit(ListArray );
    void visit(StringArray ); // if we changed the hierarchy, this would be
sufficient to treat strings as a first class type
    // Other types elided
}

or

class BufferVisitor { // type disambiguation happens by calling the
correctly
                                // overloaded method
    void visit_numeric(TypeMedata, null_bitmap, value_buffer);
    void visit_list(TypeMedata, null_bitmap, offset_buffer, Array
nested_type);
    void visit_string(TypeMetadata, null_bitmap, offset_buffer,
byte_buffer); // sufficient for treating string types as non-nested.
    // Other types elided.
}

[1] https://github.com/apache/arrow/blob/master/format/Layout.md
[2] https://github.com/apache/arrow/blob/master/format/Message.fbs

Reply via email to