[
https://issues.apache.org/jira/browse/ARROW-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16205177#comment-16205177
]
ASF GitHub Bot commented on ARROW-1674:
---------------------------------------
Github user wesm commented on the issue:
https://github.com/apache/arrow/pull/1201
@jacques-n @julienledem @kou could you give your thoughts on this
particular issue?
What we are running into on the C++ / Python side at least is that Arrow is
becoming effectively a "platform for in-memory data management". So we have
some overlapping pieces of tech:
* The Arrow columnar format -- i.e. the memory that can be described by a
RecordBatch, and that we document in the Markdown documents in
https://github.com/apache/arrow/tree/master/format
* The C++ libraries, which handle shared memory transport, zero copy reads,
and metadata conversions to other runtimes
Zero copy transport and full-fidelity metadata when dealing with other
runtimes is important. So if a system hands us a 1GB buffer intended to
transport it to and from shared memory with some metadata to describe the
contents, it would be good to maximize what we can reasonably describe with the
Arrow metadata to minimize the need for copying and conversions. I believe that
we need to be able to expand our ability to represent diverse scalar type
metadata without necessarily expanding what the "Arrow columnar format" means.
In this particular case, other frameworks which may give memory to the
Arrow libraries represent boolean data as a uint8/int8 array of 1's and 0's. So
somehow we have to be able to distinguish uint-8-as-boolean vs. uint8-as-uint8
(and you can do a zero-copy cast from boolean-uint8 to plain uint8, of course).
As this is a metadata-only concern I do not believe it constitutes expanding
the definition of what constitutes boolean data according to the Arrow columnar
format.
This particular implementation does require that Arrow implementations
check whether the bit width of received boolean data is 1 -- if not, if they
have no special handling for 8-bit boolean, they could simply treat the data as
uint8/int8, possibly with some loss of metadata. As a result of this, I would
not expect to have integration test this type, as it would not necessarily be
reasonable to expect other Arrow libraries to support extra metadata falling
outside the primary Arrow specification
> [Format] Add bit width metadata to Bool logical type
> ----------------------------------------------------
>
> Key: ARROW-1674
> URL: https://issues.apache.org/jira/browse/ARROW-1674
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Format
> Reporter: Wes McKinney
> Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Some libraries represent boolean data as a single byte per value as a vector
> of int8/uint8 1's and 0's. It would be useful to be able to retain this
> metadata as an optional field on the {{Bool}} table in {{Schema.fbs}}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)