[
https://issues.apache.org/jira/browse/ARROW-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619562#comment-16619562
]
Gabriel Becker commented on ARROW-3263:
---------------------------------------
{quote}
I would suggest defining optional metadata to indicate that a field's null
values use the R sentinel value conventions.
{quote}
This could be ok, see below.
{quote}
That way an R consumer, if they see the custom metadata, do not have to examine
the valid bits and simply memcpy the values buffer for numbers. R, for its
part, could roundtrip data to Arrow format with less serialization work
{quote}
Small clarifcation here, with ALTREP, R would able to operate in a read-only
manner on Arrow data with *zero* copies, not with a single copy. That is what
we want, I think.
{quote}
I don't think that using a specific value for null value slots is a good idea,
since it would introduce brittleness into implementations, as there are many
ways that a value could end up null. If you had to make a pass over the memory
to "sanitize" the null slots to use a particular value, then that would require
extra computing work in many cases.
{quote}
Well if it is optional, the question then becomes twofold in my mind:
# What is the default. Is Arrow going to produce R-compatible data unless an
option is turned off in cases where people don't care about R and want the
extra speed, or is it going to be incompatible by default.
# Will the core machinery either automate or offer tools to do this sanitizing
pass or will people be forced to write their own.
If the answer to 2. is that that is left to application owners, the result of
that in practice would be that the vast majority of arrow data would not be R
compatible, which I suspect would dramatically curtail R-user's interest in and
ability to use the Arrow ecosystem.
> Use R sentinel values for missingness in addition to bitmask
> ------------------------------------------------------------
>
> Key: ARROW-3263
> URL: https://issues.apache.org/jira/browse/ARROW-3263
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Format
> Reporter: Gabriel Becker
> Priority: Major
>
> R uses sentinal values to indicate missingness within Atomic vectors (read
> arrays in Arrow parlance, AFAIK).
> Currently according to [~wesmckinn], the current value in the array in memory
> is undefined if the bitmap indicating missingness is set to 1.
> This will force R to copy and modify data whenever adopting Arrow data which
> has missingness present as a native vector.
> If the value were written to the relevant sentinal values (INT_MIN for 32 bit
> integers, and NaN with payload 1954 for double precision floats) _in addition
> to_ the bit mask, then R would be able to use Arrow as intended while not
> breaking any other systems.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)