hi Micah

I'm sorry for dropping the ball on this discussion. copying Julien as
he's been looking at the metadata recently.

My thinking is that we should indicate in the format document that the
String and Binary logical types, as a matter of cross-implementation
convention, will have List<UInt8-not null> memory layout.

In the C++ library at least, we can collapse the class structure to
make BinaryArray and StringArray not a subclass of ListArray,
factoring out common code that can be reused into helper inline
functions.

Class hierarchy aside the main impact is adding entries to the Type
union in the Flatbuffers metadata
https://github.com/apache/arrow/blob/master/format/Message.fbs#L63

Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
would be a single array unit in the buffer stream and flattened Field
metadata rather than nested types (2 array units as they are
presently).

Separately, I am very interested in discussing a form of logical
Binary/StringArray in the C++ implementation that is internally
dictionary encoded. I'm proposing this as a possible new UTF-8
representation for pandas in the future:
https://wesm.github.io/pandas2-design/strings.html#possible-solution-new-non-numpy-string-memory-layout

Hopefully this isn't too incoherent, but it would be good to arrive at
some conclusion in this discussion if we need to implement the
changes.

Thanks
Wes

On Tue, Jul 26, 2016 at 10:09 PM, Micah Kornfield <emkornfi...@gmail.com> wrote:
> Wes, Jacques, others...
>
> Any thoughts on this?   Let me know if you would like to clarify something,
> I think I was a little long winded.  It would be good to come to a
> consensus one way or another.
>
> Thanks,
> Micah
>
> On Sun, Jul 17, 2016 at 1:43 PM, Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Hi Wes and Jacques,
>>
>> Thanks for the thorough analysis.  I agree that Strings should be easy to
>> work with.  I'm just trying to understand how making a distinct string type
>> defined in the memory layout spec [1] brings a lot of additional utility.
>>
>> I think of there being two distinct concerns with Arrow:
>>
>> 1.  Layout - What metadata and data elements are required to represent a
>> specific type in a flat address space.
>>
>> 2.  Manipulation - How we build interfaces for working with the memory
>> layout.
>>
>> With respect to Memory Layout, introducing a new string type seems to add
>> redundancy.  As Wes noted, List<uint8 [not-null]> is sufficient to
>> represent the layout for strings.  So the main benefits for introducing a
>> new memory layout for a string type is an optimization.  By introducing the
>> new type we avoid invalid string construction (having uint_t elements
>> marked as null in the nested array) and to save a few bytes/extra function
>> call when "(de)serializing" a string column.
>>
>> With respect to manipulation, I agree, that having the right API/modeling
>> to treat strings as first class objects makes a lot of sense.   But I don't
>> think that the specification needs to explicitly make allowances for it.
>> Once you have constructed a Java/C++ wrapper around the memory layout you
>> can choose to expose the right convenience APIs through OO abstraction.
>> The construction of the correct object wrapper is governed by Metadata
>> defined in [2] and an understanding of how the logical type maps to the
>> appropriate memory layout.  At the moment metadata doesn't specify any sort
>> of class hierarchy which I believe is the correct thing to do from a
>> specification perspective.
>>
>> The C++ implementation currently has StringArrays inheriting from a
>> ListArrays which was an implementation convenience and something we should
>> revisit (I agree with Wes's point on not relying on  C++'s type system for
>> casting).
>> The primary argument for changing the existing implementation seems to be
>> that strings should be considered "non-nested" types.  Whether strings are
>> nested or not seems to fall squarely into the manipulation concern (except
>> for the optimizations mentioned above) and therefore, IMO, an
>> implementation detail.     When thinking about how this plays out in code
>> I imagine a visitor pattern.  I've provided some pseudo-code below for two
>> possible visitor classes make StringArrays first class objects but wouldn't
>> require updates to the specification.
>>
>> I've tried to think where testing a particular object for "nested"-ness
>> makes sense by itself and couldn't come up with something off the top of my
>> head.  It seems once you determine an Array is non-nested you still want to
>> test for exact primitive type you are dealing with.
>>
>> Given these points I'm still ambivalent about adding a new string/binary
>> type to the spec. It would be an improvement but it seems like a somewhat
>> minor improvement.  If people can provide stronger use-cases for adding the
>> new type I'd be less ambivalent, but at the moment this seems like more of
>> an implementation concern.
>>
>> Thanks,
>> Micah
>>
>> // Visitor patterns for arrays, that do not require any updates to the
>> memory layout.
>> class ClassVisitor {
>>     void visit(Int32Array );
>>     void visit(UInt32Array );
>>     void visit(DoubleArray );
>>     void visit(ListArray );
>>     void visit(StringArray ); // if we changed the hierarchy, this would
>> be sufficient to treat strings as a first class type
>>     // Other types elided
>> }
>>
>> or
>>
>> class BufferVisitor { // type disambiguation happens by calling the
>> correctly
>>                                 // overloaded method
>>     void visit_numeric(TypeMedata, null_bitmap, value_buffer);
>>     void visit_list(TypeMedata, null_bitmap, offset_buffer, Array
>> nested_type);
>>     void visit_string(TypeMetadata, null_bitmap, offset_buffer,
>> byte_buffer); // sufficient for treating string types as non-nested.
>>     // Other types elided.
>> }
>>
>> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
>>
>>

Reply via email to