Re: Discussion: Should we make string/binary types first class Arrow Array types?

Julien Le Dem Wed, 10 Aug 2016 10:55:31 -0700

Hi,
Agreed.
To paraphrase/complement what has been said:
The types in format/Message.fbs [1] are "Logical types" or "user facing
types", close to SQL types (they include String, Timestamp, Decimal, ...)
and are related to Parquet's logical types [2][3].
For each of those types there's a corresponding physical layout that is
formally specified (example discussed here: String => List<UInt8-not null>).
I'm going to open a couple of JIRA's to finalise the types and clarify the
layout.


[1]
https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf0045532d6607/format/Message.fbs#L63
[2] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
[3]
https://github.com/apache/parquet-format/blob/66a5a7b982e291e06afb1da7ffe9da211318caba/src/main/thrift/parquet.thrift#L48

Julien

On Tue, Aug 9, 2016 at 4:20 PM, Wes McKinney <wesmck...@gmail.com> wrote:

> hi Micah
>
> I'm sorry for dropping the ball on this discussion. copying Julien as
> he's been looking at the metadata recently.
>
> My thinking is that we should indicate in the format document that the
> String and Binary logical types, as a matter of cross-implementation
> convention, will have List<UInt8-not null> memory layout.
>
> In the C++ library at least, we can collapse the class structure to
> make BinaryArray and StringArray not a subclass of ListArray,
> factoring out common code that can be reused into helper inline
> functions.
>
> Class hierarchy aside the main impact is adding entries to the Type
> union in the Flatbuffers metadata
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L63
>
> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
> would be a single array unit in the buffer stream and flattened Field
> metadata rather than nested types (2 array units as they are
> presently).
>
> Separately, I am very interested in discussing a form of logical
> Binary/StringArray in the C++ implementation that is internally
> dictionary encoded. I'm proposing this as a possible new UTF-8
> representation for pandas in the future:
> https://wesm.github.io/pandas2-design/strings.html#
> possible-solution-new-non-numpy-string-memory-layout
>
> Hopefully this isn't too incoherent, but it would be good to arrive at
> some conclusion in this discussion if we need to implement the
> changes.
>
> Thanks
> Wes
>
> On Tue, Jul 26, 2016 at 10:09 PM, Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> > Wes, Jacques, others...
> >
> > Any thoughts on this?   Let me know if you would like to clarify
> something,
> > I think I was a little long winded.  It would be good to come to a
> > consensus one way or another.
> >
> > Thanks,
> > Micah
> >
> > On Sun, Jul 17, 2016 at 1:43 PM, Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> >> Hi Wes and Jacques,
> >>
> >> Thanks for the thorough analysis.  I agree that Strings should be easy
> to
> >> work with.  I'm just trying to understand how making a distinct string
> type
> >> defined in the memory layout spec [1] brings a lot of additional
> utility.
> >>
> >> I think of there being two distinct concerns with Arrow:
> >>
> >> 1.  Layout - What metadata and data elements are required to represent a
> >> specific type in a flat address space.
> >>
> >> 2.  Manipulation - How we build interfaces for working with the memory
> >> layout.
> >>
> >> With respect to Memory Layout, introducing a new string type seems to
> add
> >> redundancy.  As Wes noted, List<uint8 [not-null]> is sufficient to
> >> represent the layout for strings.  So the main benefits for introducing
> a
> >> new memory layout for a string type is an optimization.  By introducing
> the
> >> new type we avoid invalid string construction (having uint_t elements
> >> marked as null in the nested array) and to save a few bytes/extra
> function
> >> call when "(de)serializing" a string column.
> >>
> >> With respect to manipulation, I agree, that having the right
> API/modeling
> >> to treat strings as first class objects makes a lot of sense.   But I
> don't
> >> think that the specification needs to explicitly make allowances for it.
> >> Once you have constructed a Java/C++ wrapper around the memory layout
> you
> >> can choose to expose the right convenience APIs through OO abstraction.
> >> The construction of the correct object wrapper is governed by Metadata
> >> defined in [2] and an understanding of how the logical type maps to the
> >> appropriate memory layout.  At the moment metadata doesn't specify any
> sort
> >> of class hierarchy which I believe is the correct thing to do from a
> >> specification perspective.
> >>
> >> The C++ implementation currently has StringArrays inheriting from a
> >> ListArrays which was an implementation convenience and something we
> should
> >> revisit (I agree with Wes's point on not relying on  C++'s type system
> for
> >> casting).
> >> The primary argument for changing the existing implementation seems to
> be
> >> that strings should be considered "non-nested" types.  Whether strings
> are
> >> nested or not seems to fall squarely into the manipulation concern
> (except
> >> for the optimizations mentioned above) and therefore, IMO, an
> >> implementation detail.     When thinking about how this plays out in
> code
> >> I imagine a visitor pattern.  I've provided some pseudo-code below for
> two
> >> possible visitor classes make StringArrays first class objects but
> wouldn't
> >> require updates to the specification.
> >>
> >> I've tried to think where testing a particular object for "nested"-ness
> >> makes sense by itself and couldn't come up with something off the top
> of my
> >> head.  It seems once you determine an Array is non-nested you still
> want to
> >> test for exact primitive type you are dealing with.
> >>
> >> Given these points I'm still ambivalent about adding a new string/binary
> >> type to the spec. It would be an improvement but it seems like a
> somewhat
> >> minor improvement.  If people can provide stronger use-cases for adding
> the
> >> new type I'd be less ambivalent, but at the moment this seems like more
> of
> >> an implementation concern.
> >>
> >> Thanks,
> >> Micah
> >>
> >> // Visitor patterns for arrays, that do not require any updates to the
> >> memory layout.
> >> class ClassVisitor {
> >>     void visit(Int32Array );
> >>     void visit(UInt32Array );
> >>     void visit(DoubleArray );
> >>     void visit(ListArray );
> >>     void visit(StringArray ); // if we changed the hierarchy, this would
> >> be sufficient to treat strings as a first class type
> >>     // Other types elided
> >> }
> >>
> >> or
> >>
> >> class BufferVisitor { // type disambiguation happens by calling the
> >> correctly
> >>                                 // overloaded method
> >>     void visit_numeric(TypeMedata, null_bitmap, value_buffer);
> >>     void visit_list(TypeMedata, null_bitmap, offset_buffer, Array
> >> nested_type);
> >>     void visit_string(TypeMetadata, null_bitmap, offset_buffer,
> >> byte_buffer); // sufficient for treating string types as non-nested.
> >>     // Other types elided.
> >> }
> >>
> >> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
> >> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
> >>
> >>
>



-- 
Julien

Re: Discussion: Should we make string/binary types first class Arrow Array types?

Reply via email to