[
https://issues.apache.org/jira/browse/ARROW-15471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487247#comment-17487247
]
Dewey Dunnington commented on ARROW-15471:
------------------------------------------
Thanks...super relevant! I'm definitely not familiar with the details here.
In the process of playing with this, I found that R does preserve the
field-level metadata when handled as part of a RecordBatch just as you noted.
The comment I made above was a line where I noticed that a user can't specify
metadata for a field (but importing via the C API works fine).
Implementing the ExtensionArray/ExtensionType and the registration mechanism is
what I hope to do with this ticket!
> [R] ExtensionType support in R
> ------------------------------
>
> Key: ARROW-15471
> URL: https://issues.apache.org/jira/browse/ARROW-15471
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Dewey Dunnington
> Priority: Major
>
> In Python there is support for extension types that consists of a
> registration step that defines functions to handle metadata serialization and
> deserialization. In R, any extension name or metadata at the top level is
> currently obliterated on import. To implement geometry reading and writing to
> Parquet, IPC, and/or Feather, we will need to at the very least have the
> extension name and metadata preserved (in R), and at best provide a
> registration step to customize the behaviour of the resulting Array/DataType.
> Reprex for R:
> {code:R}
> # remotes::install_github("paleolimbot/narrow")
> library(narrow)
> carray <- as_narrow_array(1:5)
> carray$schema$metadata[["ARROW:extension:name"]] <- "extension name!"
> carray$schema$metadata[["ARROW:extension:metadata"]] <- "bananas"
> carray$schema$metadata[["something else"]] <- "more bananas"
> array <- from_narrow_array(carray, arrow::Array)
> carray2 <- as_narrow_array(array)
> carray2$schema$metadata[["ARROW:extension:name"]]
> #> NULL
> carray2$schema$metadata[["ARROW:extension:metadata"]]
> #> NULL
> carray2$schema$metadata[["something else"]]
> #> NULL
> {code}
> There is some discussion of that as a solution to ARROW-14378, including an
> example of how pandas implements the 'interval' extension type (example
> contributed by [~jorisvandenbossche]).
> For the Interval example, there are some different parts living in different
> places:
> - The Arrow Extension Type definition for pandas' interval type:
> https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/arrays/_arrow_utils.py#L88-L136
> - The __from_arrow__ implementation (doing the conversion to arrow):
> https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/arrays/interval.py#L1405-L1455
> - The __from_arrow__ implementation (conversion arrow -> pandas):
> https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/dtypes/dtypes.py#L1227-L1255
--
This message was sent by Atlassian Jira
(v8.20.1#820001)