[ 
https://issues.apache.org/jira/browse/ARROW-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931812#comment-16931812
 ] 

Wes McKinney commented on ARROW-3263:
-------------------------------------

In Python, note that we don't require any memory copying when converting 
between null-as-sentinels from pandas to Arrow format. Only a validity/null 
bitmap has to be allocated. 

Here's an example

{code}
In [1]: arr = np.random.randn(1000000)                                          
                                                         

In [2]: arr[::2] = np.nan                                                       
                                                         

In [3]: arrow_arr = pa.array(arr, from_pandas=True)                             
                                                         

In [4]: arrow_arr.null_count                                                    
                                                         
Out[4]: 500000

In [5]: pa.total_allocated_bytes()                                              
                                                         
Out[5]: 125056
{code}

Here {{arrow_arr}} has two buffers for its memory layout: 

* validity bitmap
* values buffer

the values buffer is a zero-copy reference to the from-pandas NumPy array. The 
validity bitmap must be allocated and populated according to the Arrow format, 
hence only ~125K memory has to be allocated rather than ~8+MB as with creating 
a new double array with 1e6 values

> [R] Use R sentinel values for missingness in addition to bitmask
> ----------------------------------------------------------------
>
>                 Key: ARROW-3263
>                 URL: https://issues.apache.org/jira/browse/ARROW-3263
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Format, R
>            Reporter: Gabriel Becker
>            Priority: Major
>
> R uses sentinal values to indicate missingness within Atomic vectors (read 
> arrays in Arrow parlance, AFAIK). 
> Currently according to [~wesmckinn], the current value in the array in memory 
> is undefined if the bitmap indicating missingness is set to 1. 
> This will force R to copy and modify data whenever adopting Arrow data which 
> has missingness present as a native vector.
> If the value were written to the relevant sentinal values (INT_MIN for 32 bit 
> integers, and NaN with payload 1954 for double precision floats) _in addition 
> to_ the bit mask, then R would be able to use Arrow as intended while not 
> breaking any other systems.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to