[
https://issues.apache.org/jira/browse/ARROW-15471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487156#comment-17487156
]
Joris Van den Bossche commented on ARROW-15471:
-----------------------------------------------
I don't know if it is exactly relevant here, but a few notes:
- In the C++ implementation, the field-level metadata (where the extension name
and metadata is stored in serialized (IPC / C Data Interface schema) form)
lives in the {{Field}} class. So an Array object itself cannot hold this
metadata in C++. Thus, if you recreate an array from importing it from the C
Data Interface, it is "expected" that the metadata is gone. And that's also the
reason that if the array is part of a RecordBatch (which has a schema, with
fields, potentially with metadata), that this metadata is preserved (in the
schema of the RecordBatch)
- The "registration" of extension types enables that, eg while deserializing an
IPC schema message, if we encounter extension type metadata, we (meaning the
C++ implementation) create an ExtensionArray with ExtensionType (and dropping
the field metadata). If the name of the extension type is _not_ registered, we
keep the actual storage array/type and preserve the metadata in the field.
- As far as I know, ExtensionArray / ExtensionType is not yet exposed in the R
bindings? I suppose that means you basically always get the storage array/type
(I don't know what would happen if you actually have an extension type in a C++
RecordBatch and then accessing that from R, but I suppose this is quite
difficult to achieve right now, since there are no extension types registered
by default)
- The fact that R doesn't preserve field-level metadata in the schema seems
like a separate issue? (I mean not needing full extension types support in R to
fix it, but of course relevant here because of the fact that extension types
are not yet supported in R and thus falls back to keeping this information in
field metadata)
> [R] ExtensionType support in R
> ------------------------------
>
> Key: ARROW-15471
> URL: https://issues.apache.org/jira/browse/ARROW-15471
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Dewey Dunnington
> Priority: Major
>
> In Python there is support for extension types that consists of a
> registration step that defines functions to handle metadata serialization and
> deserialization. In R, any extension name or metadata at the top level is
> currently obliterated on import. To implement geometry reading and writing to
> Parquet, IPC, and/or Feather, we will need to at the very least have the
> extension name and metadata preserved (in R), and at best provide a
> registration step to customize the behaviour of the resulting Array/DataType.
> Reprex for R:
> {code:R}
> # remotes::install_github("paleolimbot/narrow")
> library(narrow)
> carray <- as_narrow_array(1:5)
> carray$schema$metadata[["ARROW:extension:name"]] <- "extension name!"
> carray$schema$metadata[["ARROW:extension:metadata"]] <- "bananas"
> carray$schema$metadata[["something else"]] <- "more bananas"
> array <- from_narrow_array(carray, arrow::Array)
> carray2 <- as_narrow_array(array)
> carray2$schema$metadata[["ARROW:extension:name"]]
> #> NULL
> carray2$schema$metadata[["ARROW:extension:metadata"]]
> #> NULL
> carray2$schema$metadata[["something else"]]
> #> NULL
> {code}
> There is some discussion of that as a solution to ARROW-14378, including an
> example of how pandas implements the 'interval' extension type (example
> contributed by [~jorisvandenbossche]).
> For the Interval example, there are some different parts living in different
> places:
> - The Arrow Extension Type definition for pandas' interval type:
> https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/arrays/_arrow_utils.py#L88-L136
> - The __from_arrow__ implementation (doing the conversion to arrow):
> https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/arrays/interval.py#L1405-L1455
> - The __from_arrow__ implementation (conversion arrow -> pandas):
> https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/dtypes/dtypes.py#L1227-L1255
--
This message was sent by Atlassian Jira
(v8.20.1#820001)