On Fri, Aug 12, 2016 at 5:57 PM, Micah Kornfield <emkornfi...@gmail.com> wrote: > Sorry for the late reply. > > This all sounds reasonable to me. But I'm not sure I understand exactly > what you mean by > >> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string >> would be a single array unit in the buffer stream and flattened Field >> metadata rather than nested types (2 array units as they are >> presently). > > > The way I read it this seems to me to contradict the cross-implementation as > "List<UInt8-not null>"? > > Thanks, > Micah >
I think we can resolve this by starting a "Logical Types and IPC/RPC layout" specification document. The schema metadata (https://github.com/apache/arrow/blob/master/format/Message.fbs) is, as I understand it, strictly the domain of logical types. I believe there is some minor conflation of the notions of primitive physical types and primitive logical types. While String / Binary have identical physical layouts to List<UInt8 not null>, in the domain of logical types and IPC, what we are saying is that these types are: - logically speaking: primitive, non-nested types - their IPC layout is the flattened version of the nested List<UInt8> counterpart -- a single Field node having String type (with a null count, etc.), and 3 memory buffers: validity bitmap, offsets, and data. Structurally on the wire / in shared memory (compared with List<UInt8 not null>) the only difference is the Field metadata (since if null count is 0 for the inner UInt8 values, then there is only a single buffer) -- one node versus two Let me know if this does not make sense. To move this forward I propose to begin a Logical Types / IPC layout document and begin to document the mapping between logical types and their physical in-memory representation and layout on the wire. - Wes