[jira] [Updated] (ARROW-8899) [R] Add R metadata like pandas metadata for round-trip fidelity

Neal Richardson (Jira) Wed, 10 Jun 2020 09:39:24 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Neal Richardson updated ARROW-8899:
-----------------------------------
    Description: 
Arrow Schema and Field objects have custom_metadata fields to store arbitrary 
strings in a key-value store. Pandas stores JSON in a "pandas" key and uses 
that to improve the fidelity of round-tripping data to Arrow/Parquet/Feather 
and back. 
https://pandas.pydata.org/docs/dev/development/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
 describes this a bit.

You can see this pandas metadata in the sample Parquet file:

{code:r}
tab <- read_parquet(system.file("v0.7.1.parquet", package="arrow"), 
as_data_frame = FALSE)
tab

# Table
# 10 rows x 11 columns
# $carat <double>
# $cut <string>
# $color <string>
# $clarity <string>
# $depth <double>
# $table <double>
# $price <int64>
# $x <double>
# $y <double>
# $z <double>
# $__index_level_0__ <int64>

tab$metadata

# $pandas
# [1] "{\"index_columns\": [\"__index_level_0__\"], \"column_indexes\": 
[{\"name\": null, \"pandas_type\": \"string\", \"numpy_type\": \"object\", 
\"metadata\": null}], \"columns\": [{\"name\": \"carat\", \"pandas_type\": 
\"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": 
\"cut\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", 
\"metadata\": null}, {\"name\": \"color\", \"pandas_type\": \"unicode\", 
\"numpy_type\": \"object\", \"metadata\": null}, {\"name\": \"clarity\", 
\"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, 
{\"name\": \"depth\", \"pandas_type\": \"float64\", \"numpy_type\": 
\"float64\", \"metadata\": null}, {\"name\": \"table\", \"pandas_type\": 
\"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": 
\"price\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": 
null}, {\"name\": \"x\", \"pandas_type\": \"float64\", \"numpy_type\": 
\"float64\", \"metadata\": null}, {\"name\": \"y\", \"pandas_type\": 
\"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": 
\"z\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": 
null}, {\"name\": \"__index_level_0__\", \"pandas_type\": \"int64\", 
\"numpy_type\": \"int64\", \"metadata\": null}], \"pandas_version\": 
\"0.20.1\"}"
{code}

We should do something similar in R: store the "attributes" for each column in 
a data.frame when we convert to Arrow, and restore those attributes when we 
read from Arrow. 

Since ARROW-8703, you could naively do this all in R, something like:

{code:r}
tab$metadata$r <- lapply(df, attributes)
{code}

on the conversion to Arrow, and in as.data.frame(), do

{code:r}
if (!is.null(tab$metadata$r)) {
  df[] <- mapply(function(col, meta) {
    attributes(col) <- meta
  }, col = df, meta = tab$metadata$r)
}
{code}

However, it's trickier than this because:

* {{tab$metadata$r}} needs to be serialized to string and deserialized on the 
way back. Pandas uses JSON but arrow doesn't currently have a JSON R 
dependency. We could {{dput()}} to dump the R attributes, but that could 
introduce risks since you have to parse/eval code to consume it. My best idea 
at the moment is to try {{rawToChar(serialize(x, ascii = TRUE))}} on the way 
out (ascii = TRUE doesn't mean it requires ASCII inputs, it's about how it 
serializes) and {{unserialize(charToRaw(x))}} on the way back. But maybe 
there's some lower-level way to do this better.
* We'll need to do the same for all places where Tables and RecordBatches are 
created/converted
* We'll need to make sure that nested types (structs) get the same coverage
* This metadata only is attached to Schemas, meaning that Arrays/ChunkedArrays 
don't have a place to store extra metadata. So we probably want to attach to 
the R6 (Chunked)Array objects a metadata/attributes field so that if we convert 
an R vector to array, or if we extract an array out of a record batch, we don't 
lose the attributes.

Doing this should resolve ARROW-4390 and make ARROW-8867 trivial as well.

Finally, a note about this custom metadata vs. extension types. Extension types 
can be defined by [adding metadata to a 
Field|https://arrow.apache.org/docs/format/Columnar.html#extension-types] (in a 
Schema). I think this is out of scope here because we're only concerned with R 
roundtrip fidelity. If there were a type that (for example) R and Pandas both 
had that Arrow did not, we could define an extension type so that we could 
share that across the implementations. But unless/until there is value in 
establishing that extension type standard, let's not worry with it. (In other 
words, in R we should ignore pandas metadata; if there's anything that pandas 
wants to share with R, it will define it somewhere else.)

  was:
Arrow Schema and Field objects have custom_metadata fields to store arbitrary 
strings in a key-value store. Pandas stores JSON in a "pandas" key and uses 
that to improve the fidelity of round-tripping data to Arrow/Parquet/Feather 
and back. 
https://pandas.pydata.org/docs/dev/development/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
 describes this a bit.

You can see this pandas metadata in the sample Parquet file:

{code:r}
tab <- read_parquet(system.file("v0.7.1.parquet", package="arrow"), 
as_data_frame = FALSE)
tab

# Table
# 10 rows x 11 columns
# $carat <double>
# $cut <string>
# $color <string>
# $clarity <string>
# $depth <double>
# $table <double>
# $price <int64>
# $x <double>
# $y <double>
# $z <double>
# $__index_level_0__ <int64>

tab$metadata

# $pandas
# [1] "{\"index_columns\": [\"__index_level_0__\"], \"column_indexes\": 
[{\"name\": null, \"pandas_type\": \"string\", \"numpy_type\": \"object\", 
\"metadata\": null}], \"columns\": [{\"name\": \"carat\", \"pandas_type\": 
\"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": 
\"cut\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", 
\"metadata\": null}, {\"name\": \"color\", \"pandas_type\": \"unicode\", 
\"numpy_type\": \"object\", \"metadata\": null}, {\"name\": \"clarity\", 
\"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null}, 
{\"name\": \"depth\", \"pandas_type\": \"float64\", \"numpy_type\": 
\"float64\", \"metadata\": null}, {\"name\": \"table\", \"pandas_type\": 
\"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": 
\"price\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": 
null}, {\"name\": \"x\", \"pandas_type\": \"float64\", \"numpy_type\": 
\"float64\", \"metadata\": null}, {\"name\": \"y\", \"pandas_type\": 
\"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": 
\"z\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": 
null}, {\"name\": \"__index_level_0__\", \"pandas_type\": \"int64\", 
\"numpy_type\": \"int64\", \"metadata\": null}], \"pandas_version\": 
\"0.20.1\"}"
{code}

We should do something similar in R: store the "attributes" for each column in 
a data.frame when we convert to Arrow, and restore those attributes when we 
read from Arrow. 

Since ARROW-8703, you could naively do this all in R, something like:

{code:r}
tab$metadata$r <- lapply(df, attributes)
{code}

on the conversion to Arrow, and in as.data.frame(), do

{code:r}
if (!is.null(tab$metadata$r)) {
  df[] <- mapply(function(col, meta) {
    attributes(col) <- meta
  }, col = df, meta = tab$metadata$r)
}
{code}

However, it's trickier than this because:

* {{tab$metadata$r}} needs to be serialized to string and deserialized on the 
way back. Pandas uses JSON but arrow doesn't currently have a JSON R 
dependency. The C++ build does include rapidjson, maybe we could tap into that? 
Alternatively, we could {{dput()}} to dump the R attributes, which might have 
higher fidelity in addition to zero dependencies, but there are tradeoffs.
* We'll need to do the same for all places where Tables and RecordBatches are 
created/converted
* We'll need to make sure that nested types (structs) get the same coverage
* This metadata only is attached to Schemas, meaning that Arrays/ChunkedArrays 
don't have a place to store extra metadata. So we probably want to attach to 
the R6 (Chunked)Array objects a metadata/attributes field so that if we convert 
an R vector to array, or if we extract an array out of a record batch, we don't 
lose the attributes.

Doing this should resolve ARROW-4390 and make ARROW-8867 trivial as well.

Finally, a note about this custom metadata vs. extension types. Extension types 
can be defined by [adding metadata to a 
Field|https://arrow.apache.org/docs/format/Columnar.html#extension-types] (in a 
Schema). I think this is out of scope here because we're only concerned with R 
roundtrip fidelity. If there were a type that (for example) R and Pandas both 
had that Arrow did not, we could define an extension type so that we could 
share that across the implementations. But unless/until there is value in 
establishing that extension type standard, let's not worry with it. (In other 
words, in R we should ignore pandas metadata; if there's anything that pandas 
wants to share with R, it will define it somewhere else.)


> [R] Add R metadata like pandas metadata for round-trip fidelity
> ---------------------------------------------------------------
>
>                 Key: ARROW-8899
>                 URL: https://issues.apache.org/jira/browse/ARROW-8899
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Neal Richardson
>            Priority: Critical
>             Fix For: 1.0.0
>
>
> Arrow Schema and Field objects have custom_metadata fields to store arbitrary 
> strings in a key-value store. Pandas stores JSON in a "pandas" key and uses 
> that to improve the fidelity of round-tripping data to Arrow/Parquet/Feather 
> and back. 
> https://pandas.pydata.org/docs/dev/development/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
>  describes this a bit.
> You can see this pandas metadata in the sample Parquet file:
> {code:r}
> tab <- read_parquet(system.file("v0.7.1.parquet", package="arrow"), 
> as_data_frame = FALSE)
> tab
> # Table
> # 10 rows x 11 columns
> # $carat <double>
> # $cut <string>
> # $color <string>
> # $clarity <string>
> # $depth <double>
> # $table <double>
> # $price <int64>
> # $x <double>
> # $y <double>
> # $z <double>
> # $__index_level_0__ <int64>
> tab$metadata
> # $pandas
> # [1] "{\"index_columns\": [\"__index_level_0__\"], \"column_indexes\": 
> [{\"name\": null, \"pandas_type\": \"string\", \"numpy_type\": \"object\", 
> \"metadata\": null}], \"columns\": [{\"name\": \"carat\", \"pandas_type\": 
> \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": 
> \"cut\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", 
> \"metadata\": null}, {\"name\": \"color\", \"pandas_type\": \"unicode\", 
> \"numpy_type\": \"object\", \"metadata\": null}, {\"name\": \"clarity\", 
> \"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": 
> null}, {\"name\": \"depth\", \"pandas_type\": \"float64\", \"numpy_type\": 
> \"float64\", \"metadata\": null}, {\"name\": \"table\", \"pandas_type\": 
> \"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": 
> \"price\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", 
> \"metadata\": null}, {\"name\": \"x\", \"pandas_type\": \"float64\", 
> \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\": \"y\", 
> \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\": 
> null}, {\"name\": \"z\", \"pandas_type\": \"float64\", \"numpy_type\": 
> \"float64\", \"metadata\": null}, {\"name\": \"__index_level_0__\", 
> \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}], 
> \"pandas_version\": \"0.20.1\"}"
> {code}
> We should do something similar in R: store the "attributes" for each column 
> in a data.frame when we convert to Arrow, and restore those attributes when 
> we read from Arrow. 
> Since ARROW-8703, you could naively do this all in R, something like:
> {code:r}
> tab$metadata$r <- lapply(df, attributes)
> {code}
> on the conversion to Arrow, and in as.data.frame(), do
> {code:r}
> if (!is.null(tab$metadata$r)) {
>   df[] <- mapply(function(col, meta) {
>     attributes(col) <- meta
>   }, col = df, meta = tab$metadata$r)
> }
> {code}
> However, it's trickier than this because:
> * {{tab$metadata$r}} needs to be serialized to string and deserialized on the 
> way back. Pandas uses JSON but arrow doesn't currently have a JSON R 
> dependency. We could {{dput()}} to dump the R attributes, but that could 
> introduce risks since you have to parse/eval code to consume it. My best idea 
> at the moment is to try {{rawToChar(serialize(x, ascii = TRUE))}} on the way 
> out (ascii = TRUE doesn't mean it requires ASCII inputs, it's about how it 
> serializes) and {{unserialize(charToRaw(x))}} on the way back. But maybe 
> there's some lower-level way to do this better.
> * We'll need to do the same for all places where Tables and RecordBatches are 
> created/converted
> * We'll need to make sure that nested types (structs) get the same coverage
> * This metadata only is attached to Schemas, meaning that 
> Arrays/ChunkedArrays don't have a place to store extra metadata. So we 
> probably want to attach to the R6 (Chunked)Array objects a 
> metadata/attributes field so that if we convert an R vector to array, or if 
> we extract an array out of a record batch, we don't lose the attributes.
> Doing this should resolve ARROW-4390 and make ARROW-8867 trivial as well.
> Finally, a note about this custom metadata vs. extension types. Extension 
> types can be defined by [adding metadata to a 
> Field|https://arrow.apache.org/docs/format/Columnar.html#extension-types] (in 
> a Schema). I think this is out of scope here because we're only concerned 
> with R roundtrip fidelity. If there were a type that (for example) R and 
> Pandas both had that Arrow did not, we could define an extension type so that 
> we could share that across the implementations. But unless/until there is 
> value in establishing that extension type standard, let's not worry with it. 
> (In other words, in R we should ignore pandas metadata; if there's anything 
> that pandas wants to share with R, it will define it somewhere else.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8899) [R] Add R metadata like pandas metadata for round-trip fidelity

Reply via email to