Neal Richardson created ARROW-8899:
--------------------------------------
Summary: [R] Add R metadata like pandas metadata for round-trip
fidelity
Key: ARROW-8899
URL: https://issues.apache.org/jira/browse/ARROW-8899
Project: Apache Arrow
Issue Type: Improvement
Components: R
Reporter: Neal Richardson
Fix For: 1.0.0
Arrow Schema and Field objects have custom_metadata fields to store arbitrary
strings in a key-value store. Pandas stores JSON in a "pandas" key and uses
that to improve the fidelity of round-tripping data to Arrow/Parquet/Feather
and back.
https://pandas.pydata.org/docs/dev/development/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
describes this a bit.
You can see this pandas metadata in the sample Parquet file:
{code:r}
tab <- read_parquet(system.file("v0.7.1.parquet", package="arrow"),
as_data_frame = FALSE)
tab
# Table
# 10 rows x 11 columns
# $carat <double>
# $cut <string>
# $color <string>
# $clarity <string>
# $depth <double>
# $table <double>
# $price <int64>
# $x <double>
# $y <double>
# $z <double>
# $__index_level_0__ <int64>
tab$metadata
# $pandas
# [1] "{\"index_columns\": [\"__index_level_0__\"], \"column_indexes\":
[{\"name\": null, \"pandas_type\": \"string\", \"numpy_type\": \"object\",
\"metadata\": null}], \"columns\": [{\"name\": \"carat\", \"pandas_type\":
\"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\":
\"cut\", \"pandas_type\": \"unicode\", \"numpy_type\": \"object\",
\"metadata\": null}, {\"name\": \"color\", \"pandas_type\": \"unicode\",
\"numpy_type\": \"object\", \"metadata\": null}, {\"name\": \"clarity\",
\"pandas_type\": \"unicode\", \"numpy_type\": \"object\", \"metadata\": null},
{\"name\": \"depth\", \"pandas_type\": \"float64\", \"numpy_type\":
\"float64\", \"metadata\": null}, {\"name\": \"table\", \"pandas_type\":
\"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\":
\"price\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\":
null}, {\"name\": \"x\", \"pandas_type\": \"float64\", \"numpy_type\":
\"float64\", \"metadata\": null}, {\"name\": \"y\", \"pandas_type\":
\"float64\", \"numpy_type\": \"float64\", \"metadata\": null}, {\"name\":
\"z\", \"pandas_type\": \"float64\", \"numpy_type\": \"float64\", \"metadata\":
null}, {\"name\": \"__index_level_0__\", \"pandas_type\": \"int64\",
\"numpy_type\": \"int64\", \"metadata\": null}], \"pandas_version\":
\"0.20.1\"}"
{code}
We should do something similar in R: store the "attributes" for each column in
a data.frame when we convert to Arrow, and restore those attributes when we
read from Arrow.
Since ARROW-8703, you could naively do this all in R, something like:
{code:r}
tab$metadata$r <- lapply(df, attributes)
{code}
on the conversion to Arrow, and in as.data.frame(), do
{code:r}
if (!is.null(tab$metadata$r)) {
df[] <- mapply(function(col, meta) {
attributes(col) <- meta
}, col = df, meta = tab$metadata$r)
}
{code}
However, it's trickier than this because:
* {{tab$metadata$r}} needs to be serialized to string and deserialized on the
way back. Pandas uses JSON but arrow doesn't currently have a JSON R
dependency. The C++ build does include rapidjson, maybe we could tap into that?
Alternatively, we could {{dput()}} to dump the R attributes, which might have
higher fidelity in addition to zero dependencies, but there are tradeoffs.
* We'll need to do the same for all places where Tables and RecordBatches are
created/converted
* We'll need to make sure that nested types (structs) get the same coverage
* This metadata only is attached to Schemas, meaning that Arrays/ChunkedArrays
don't have a place to store extra metadata. So we probably want to attach to
the R6 (Chunked)Array objects a metadata/attributes field so that if we convert
an R vector to array, or if we extract an array out of a record batch, we don't
lose the attributes.
Doing this should resolve ARROW-4390 and make ARROW-8867 trivial as well.
Finally, a note about this custom metadata vs. extension types. Extension types
can be defined by [adding metadata to a
Field|https://arrow.apache.org/docs/format/Columnar.html#extension-types] (in a
Schema). I think this is out of scope here because we're only concerned with R
roundtrip fidelity. If there were a type that (for example) R and Pandas both
had that Arrow did not, we could define an extension type so that we could
share that across the implementations. But unless/until there is value in
establishing that extension type standard, let's not worry with it. (In other
words, in R we should ignore pandas metadata; if there's anything that pandas
wants to share with R, it will define it somewhere else.)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)