Yes On Wed, Aug 10, 2016 at 11:24 AM, Wes McKinney <wesmck...@gmail.com> wrote:
> I see the primary point of discussion on this to be whether String/Binary > have the same layout on the wire as List<uint8-not null> (i.e. one > Field/Node in the type tree versus two). I think what we are working > towards is a single field rather than a List field and an Int field (bit > width 8). > > On Aug 10, 2016, at 10:46 AM, Julien Le Dem <jul...@dremio.com> wrote: > > Hi, > Agreed. > To paraphrase/complement what has been said: > The types in format/Message.fbs [1] are "Logical types" or "user facing > types", close to SQL types (they include String, Timestamp, Decimal, ...) > and are related to Parquet's logical types [2][3]. > For each of those types there's a corresponding physical layout that is > formally specified (example discussed here: String => List<UInt8-not null>). > I'm going to open a couple of JIRA's to finalise the types and clarify the > layout. > > [1] https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf00 > 45532d6607/format/Message.fbs#L63 > [2] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md > [3] https://github.com/apache/parquet-format/blob/ > 66a5a7b982e291e06afb1da7ffe9da211318caba/src/main/thrift/ > parquet.thrift#L48 > > Julien > > On Tue, Aug 9, 2016 at 4:20 PM, Wes McKinney <wesmck...@gmail.com> wrote: > >> hi Micah >> >> I'm sorry for dropping the ball on this discussion. copying Julien as >> he's been looking at the metadata recently. >> >> My thinking is that we should indicate in the format document that the >> String and Binary logical types, as a matter of cross-implementation >> convention, will have List<UInt8-not null> memory layout. >> >> In the C++ library at least, we can collapse the class structure to >> make BinaryArray and StringArray not a subclass of ListArray, >> factoring out common code that can be reused into helper inline >> functions. >> >> Class hierarchy aside the main impact is adding entries to the Type >> union in the Flatbuffers metadata >> https://github.com/apache/arrow/blob/master/format/Message.fbs#L63 >> >> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string >> would be a single array unit in the buffer stream and flattened Field >> metadata rather than nested types (2 array units as they are >> presently). >> >> Separately, I am very interested in discussing a form of logical >> Binary/StringArray in the C++ implementation that is internally >> dictionary encoded. I'm proposing this as a possible new UTF-8 >> representation for pandas in the future: >> https://wesm.github.io/pandas2-design/strings.html#possible- >> solution-new-non-numpy-string-memory-layout >> >> Hopefully this isn't too incoherent, but it would be good to arrive at >> some conclusion in this discussion if we need to implement the >> changes. >> >> Thanks >> Wes >> >> On Tue, Jul 26, 2016 at 10:09 PM, Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> > Wes, Jacques, others... >> > >> > Any thoughts on this? Let me know if you would like to clarify >> something, >> > I think I was a little long winded. It would be good to come to a >> > consensus one way or another. >> > >> > Thanks, >> > Micah >> > >> > On Sun, Jul 17, 2016 at 1:43 PM, Micah Kornfield <emkornfi...@gmail.com >> > >> > wrote: >> > >> >> Hi Wes and Jacques, >> >> >> >> Thanks for the thorough analysis. I agree that Strings should be easy >> to >> >> work with. I'm just trying to understand how making a distinct string >> type >> >> defined in the memory layout spec [1] brings a lot of additional >> utility. >> >> >> >> I think of there being two distinct concerns with Arrow: >> >> >> >> 1. Layout - What metadata and data elements are required to represent >> a >> >> specific type in a flat address space. >> >> >> >> 2. Manipulation - How we build interfaces for working with the memory >> >> layout. >> >> >> >> With respect to Memory Layout, introducing a new string type seems to >> add >> >> redundancy. As Wes noted, List<uint8 [not-null]> is sufficient to >> >> represent the layout for strings. So the main benefits for >> introducing a >> >> new memory layout for a string type is an optimization. By >> introducing the >> >> new type we avoid invalid string construction (having uint_t elements >> >> marked as null in the nested array) and to save a few bytes/extra >> function >> >> call when "(de)serializing" a string column. >> >> >> >> With respect to manipulation, I agree, that having the right >> API/modeling >> >> to treat strings as first class objects makes a lot of sense. But I >> don't >> >> think that the specification needs to explicitly make allowances for >> it. >> >> Once you have constructed a Java/C++ wrapper around the memory layout >> you >> >> can choose to expose the right convenience APIs through OO abstraction. >> >> The construction of the correct object wrapper is governed by Metadata >> >> defined in [2] and an understanding of how the logical type maps to the >> >> appropriate memory layout. At the moment metadata doesn't specify any >> sort >> >> of class hierarchy which I believe is the correct thing to do from a >> >> specification perspective. >> >> >> >> The C++ implementation currently has StringArrays inheriting from a >> >> ListArrays which was an implementation convenience and something we >> should >> >> revisit (I agree with Wes's point on not relying on C++'s type system >> for >> >> casting). >> >> The primary argument for changing the existing implementation seems to >> be >> >> that strings should be considered "non-nested" types. Whether strings >> are >> >> nested or not seems to fall squarely into the manipulation concern >> (except >> >> for the optimizations mentioned above) and therefore, IMO, an >> >> implementation detail. When thinking about how this plays out in >> code >> >> I imagine a visitor pattern. I've provided some pseudo-code below for >> two >> >> possible visitor classes make StringArrays first class objects but >> wouldn't >> >> require updates to the specification. >> >> >> >> I've tried to think where testing a particular object for "nested"-ness >> >> makes sense by itself and couldn't come up with something off the top >> of my >> >> head. It seems once you determine an Array is non-nested you still >> want to >> >> test for exact primitive type you are dealing with. >> >> >> >> Given these points I'm still ambivalent about adding a new >> string/binary >> >> type to the spec. It would be an improvement but it seems like a >> somewhat >> >> minor improvement. If people can provide stronger use-cases for >> adding the >> >> new type I'd be less ambivalent, but at the moment this seems like >> more of >> >> an implementation concern. >> >> >> >> Thanks, >> >> Micah >> >> >> >> // Visitor patterns for arrays, that do not require any updates to the >> >> memory layout. >> >> class ClassVisitor { >> >> void visit(Int32Array ); >> >> void visit(UInt32Array ); >> >> void visit(DoubleArray ); >> >> void visit(ListArray ); >> >> void visit(StringArray ); // if we changed the hierarchy, this >> would >> >> be sufficient to treat strings as a first class type >> >> // Other types elided >> >> } >> >> >> >> or >> >> >> >> class BufferVisitor { // type disambiguation happens by calling the >> >> correctly >> >> // overloaded method >> >> void visit_numeric(TypeMedata, null_bitmap, value_buffer); >> >> void visit_list(TypeMedata, null_bitmap, offset_buffer, Array >> >> nested_type); >> >> void visit_string(TypeMetadata, null_bitmap, offset_buffer, >> >> byte_buffer); // sufficient for treating string types as non-nested. >> >> // Other types elided. >> >> } >> >> >> >> [1] https://github.com/apache/arrow/blob/master/format/Layout.md >> >> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs >> >> >> >> >> > > > > -- > Julien > > -- Julien