[ 
https://issues.apache.org/jira/browse/ARROW-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16205177#comment-16205177
 ] 

ASF GitHub Bot commented on ARROW-1674:
---------------------------------------

Github user wesm commented on the issue:

    https://github.com/apache/arrow/pull/1201
  
    @jacques-n @julienledem @kou could you give your thoughts on this 
particular issue?
    
    What we are running into on the C++ / Python side at least is that Arrow is 
becoming effectively a "platform for in-memory data management". So we have 
some overlapping pieces of tech:
    
    * The Arrow columnar format -- i.e. the memory that can be described by a 
RecordBatch, and that we document in the Markdown documents in 
https://github.com/apache/arrow/tree/master/format
    * The C++ libraries, which handle shared memory transport, zero copy reads, 
and metadata conversions to other runtimes
    
    Zero copy transport and full-fidelity metadata when dealing with other 
runtimes is important. So if a system hands us a 1GB buffer intended to 
transport it to and from shared memory with some metadata to describe the 
contents, it would be good to maximize what we can reasonably describe with the 
Arrow metadata to minimize the need for copying and conversions. I believe that 
we need to be able to expand our ability to represent diverse scalar type 
metadata without necessarily expanding what the "Arrow columnar format" means.
    
    In this particular case, other frameworks which may give memory to the 
Arrow libraries represent boolean data as a uint8/int8 array of 1's and 0's. So 
somehow we have to be able to distinguish uint-8-as-boolean vs. uint8-as-uint8 
(and you can do a zero-copy cast from boolean-uint8 to plain uint8, of course). 
As this is a metadata-only concern I do not believe it constitutes expanding 
the definition of what constitutes boolean data according to the Arrow columnar 
format.
    
    This particular implementation does require that Arrow implementations 
check whether the bit width of received boolean data is 1 -- if not, if they 
have no special handling for 8-bit boolean, they could simply treat the data as 
uint8/int8, possibly with some loss of metadata. As a result of this, I would 
not expect to have integration test this type, as it would not necessarily be 
reasonable to expect other Arrow libraries to support extra metadata falling 
outside the primary Arrow specification


> [Format] Add bit width metadata to Bool logical type
> ----------------------------------------------------
>
>                 Key: ARROW-1674
>                 URL: https://issues.apache.org/jira/browse/ARROW-1674
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Format
>            Reporter: Wes McKinney
>              Labels: pull-request-available
>             Fix For: 0.8.0
>
>
> Some libraries represent boolean data as a single byte per value as a vector 
> of int8/uint8 1's and 0's. It would be useful to be able to retain this 
> metadata as an optional field on the {{Bool}} table in {{Schema.fbs}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to