See ARROW-262 On Mon, Aug 15, 2016 at 3:38 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> These IPC details we should definitely document outside of the code. > > For the String/Binary type question, I want to start a document that > explains the logical data types in Message.fbs in terms of > > - what Arrow memory layout they use (for example: Int32 uses fixed bit > width 32 bits), and String uses List<UInt8> (with the restriction that > the inner buffer must not have any nulls, and the validity bitmap is > omitted) > > - any type-specific custom logic around deconstructing or > reconstructing an Arrow container in an IPC/RPC setting. What we have > been debating in this e-mail thread is altering the appearance of > String/Binary representation in a record batch > (https://github.com/apache/arrow/blob/master/format/Message.fbs#L147) > from being identical to List<UInt8> (4 buffers in the flattened buffer > list -- 2 for the list node, 2 for the UInt8 node) to its own > "collapsed" form (3 buffers: bitmap, offsets, data). This means that > any code that is sending/receiving a record batch will need separate > code paths to handle List and String respectively (in the C++ code, we > are currently using the same code path for both) > > For example, changes like ARROW-253 > (https://github.com/apache/arrow/commit/dc01f099d966b92f4de7679b4a1caf > 97c363e08e) > would be documented outside of the code and message IDL. > > I will open one or more JIRAs and write a patch to try to close the > loop on this. > > - Wes > > On Tue, Aug 16, 2016 at 6:04 AM, Julien Le Dem <jul...@dremio.com> wrote: > > There's ARROW-258 which is about clarifying difference (if any) in > metadata > > across RPC (sockets), IPC (shared memory) and files. > > The vector layout is the same except in RPC or files they get > concatenated > > together when copied over. > > The metadata should be mostly the same (ideally the same). Buffer offsets > > are relative to the beginning of the body in the context of RPC and file > > start in files. In the context of IPC it looks like we need an extra > page id > > (from Message.fbs). Is this correct? > > > > On Mon, Aug 15, 2016 at 12:01 PM, Micah Kornfield <emkornfi...@gmail.com > > > > wrote: > >> > >> Thanks Wes, > >> This makes sense. +1 on the "Logical Types / IPC layout > >> document" is there a JIRA open for this? > >> > >> I'll open a JIRA item to change the inheritance of string/binary in the > >> C++ code base. > >> > >> Thanks, > >> Micah > >> > >> On Sun, Aug 14, 2016 at 10:51 PM, Wes McKinney <wesmck...@gmail.com> > >> wrote: > >>> > >>> On Fri, Aug 12, 2016 at 5:57 PM, Micah Kornfield < > emkornfi...@gmail.com> > >>> wrote: > >>> > Sorry for the late reply. > >>> > > >>> > This all sounds reasonable to me. But I'm not sure I understand > >>> > exactly > >>> > what you mean by > >>> > > >>> >> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string > >>> >> would be a single array unit in the buffer stream and flattened > Field > >>> >> metadata rather than nested types (2 array units as they are > >>> >> presently). > >>> > > >>> > > >>> > The way I read it this seems to me to contradict the > >>> > cross-implementation as > >>> > "List<UInt8-not null>"? > >>> > > >>> > Thanks, > >>> > Micah > >>> > > >>> > >>> I think we can resolve this by starting a "Logical Types and IPC/RPC > >>> layout" specification document. > >>> > >>> The schema metadata > >>> (https://github.com/apache/arrow/blob/master/format/Message.fbs) is, > >>> as I understand it, strictly the domain of logical types. I believe > >>> there is some minor conflation of the notions of primitive physical > >>> types and primitive logical types. > >>> > >>> While String / Binary have identical physical layouts to List<UInt8 > >>> not null>, in the domain of logical types and IPC, what we are saying > >>> is that these types are: > >>> > >>> - logically speaking: primitive, non-nested types > >>> - their IPC layout is the flattened version of the nested List<UInt8> > >>> counterpart -- a single Field node having String type (with a null > >>> count, etc.), and 3 memory buffers: validity bitmap, offsets, and > >>> data. Structurally on the wire / in shared memory (compared with > >>> List<UInt8 not null>) the only difference is the Field metadata (since > >>> if null count is 0 for the inner UInt8 values, then there is only a > >>> single buffer) -- one node versus two > >>> > >>> Let me know if this does not make sense. > >>> > >>> To move this forward I propose to begin a Logical Types / IPC layout > >>> document and begin to document the mapping between logical types and > >>> their physical in-memory representation and layout on the wire. > >>> > >>> - Wes > >> > >> > > > > > > > > -- > > Julien >