stephhazlitt commented on code in PR #14514: URL: https://github.com/apache/arrow/pull/14514#discussion_r1029880019
########## r/vignettes/metadata.Rmd: ########## @@ -0,0 +1,82 @@ +--- +title: "Metadata" +description: > + Learn how Arrow uses Schemas to document structure of data objects, + and how R metadata are supported in Arrow +output: rmarkdown::html_vignette +--- + +This article describes the various data and metadata object types supplied by arrow, and documents how these objects are structured. + +```{r include=FALSE} +library(arrow, warn.conflicts = FALSE) +``` + +## Arrow metadata classes + +The arrow package defines the following classes for representing metadata: + +- A `Schema` is a list of `Field` objects used to describe the structure of a tabular data object; where +- A `Field` specifies a character string name and a `DataType`; and +- A `DataType` is an attribute controlling how values are represented + +Consider this: + +```{r} +df <- data.frame(x = 1:3, y = c("a", "b", "c")) +tb <- arrow_table(df) +tb$schema +``` + +The schema that has been automatically inferred could also be manually created: + +```{r} +schema( + field(name = "x", type = int32()), + field(name = "y", type = utf8()) +) +``` + +The `schema()` function allows the following shorthand to define fields: + +```{r} +schema(x = int32(), y = utf8()) +``` + +Sometimes it is important to specify the schema manually, particularly if you want fine-grained control over the Arrow data types: + +```{r} +arrow_table(df, schema = schema(x = int64(), y = utf8())) +arrow_table(df, schema = schema(x = float64(), y = utf8())) +``` + + +## R object attributes + +Arrow supports custom key-value metadata attached to Schemas. When we convert a `data.frame` to an Arrow Table or RecordBatch, the package stores any `attributes()` attached to the columns of the `data.frame` in the Arrow object Schema. Attributes added to objects in this fashion are stored under the `r` key, as shown below: + +```{r} +# data frame with custom metadata +df <- data.frame(x = 1:3, y = c("a", "b", "c")) +attr(df, "df_meta") <- "custom data frame metadata" +attr(df$y, "col_meta") <- "custom column metadata" + +# when converted to a Table, the metadata is preserved +tb <- arrow_table(df) +tb$metadata +``` + +It is also possible to assign additional string metadata under any other key you wish, using a command like this: + +```{r} +tb$metadata$new_key <- "new value" +``` + +Metadata attached to a Schema is preserved when writing the Table to Feather or Parquet. When reading those files into R, or when calling `as.data.frame()` on a Table or RecordBatch, the column attributes are restored to the columns of the resulting `data.frame`. This means that custom data types, including `haven::labelled`, `vctrs` annotations, and others, are preserved when doing a round-trip through Arrow. Review Comment: ```suggestion Metadata attached to a Schema is preserved when writing the Table to Arrow/Feather or Parquet formats. When reading those files into R, or when calling `as.data.frame()` on a Table or RecordBatch, the column attributes are restored to the columns of the resulting `data.frame`. This means that custom data types, including `haven::labelled`, `vctrs` annotations, and others, are preserved when doing a round-trip through Arrow. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org