[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

GitBox Wed, 16 Nov 2022 14:25:28 -0800


djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1024566747



##########
r/vignettes/data_types.Rmd:
##########
@@ -0,0 +1,342 @@
+---
+title: "Data types"
+description: >
+  Learn about fundamental data types in Apache Arrow and how those 
+  types are mapped onto corresponding data types in R 
+output: rmarkdown::html_vignette
+---
+
+Arrow has a rich data type system that includes direct analogs of many R data 
types, and many data types that do not have a counterpart in R. This article 
describes the Arrow type system, compares it to R data types, and outlines the 
default mappings used when data are transferred from Arrow to R. At the end of 
the article there are two lookup tables: one describing the default "R to 
Arrow" type mappings and the other describing the "Arrow to R" mappings.
+
+## Motivating example
+
+To illustrate the conversion that needs to take place, consider the 
differences between the output when obtain we use `dplyr::glimpse()` to inspect 
the `starwars` data in its original format -- as a data frame in R -- and the 
output we obtain when we convert it to an Arrow Table first by calling 
`arrow_table()`:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+glimpse(starwars)
+glimpse(arrow_table(starwars))
+```
+
+The data represented are essentially the same, but the descriptions of the 
data types for the columns have changed. For example:
+
+- `name` is labelled `<chr>` (character vector) in the data frame; it is 
labelled `<string>` (a string type, also referred to as utf8 type) in the Arrow 
Table 
+- `height` is labelled `<int>` (integer vector) in the data frame; it is 
labelled `<int32>` (32-bit signed integer) in the Arrow Table
+- `mass` is labelled `<dbl>` (numeric vector) in the data frame; it is 
labelled `<double>` (64-bit floating point number) in the Arrow Table
+
+Some of these differences are purely cosmetic: integers in R are in fact 
32-bit signed integers, so the underlying data types in Arrow and R are direct 
analogs of one another. In other cases the differences are purely about the 
implementation: Arrow and R have different ways to store a vector of strings, 
but at a high level of abstraction the R character type and the Arrow string 
type can be viewed as direct analogs. In some cases, however, there are no 
clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it 
does not have an analog of POSIXlt; converselt, while R can represent 32 bit 
signed integers, it does not have an equivalent of a 64 bit unsigned integer.
+
+When the `arrow` package converts between R data and Arrow data, it will first 
check to see if a Schema has been provided -- see `schema()` for more 
information -- and if none is available it will attempt to guess the 
appropriate type by following the default mappings. A complete listing of these 
mappings is provided at the end of the article, but the most common cases are 
depicted in the illustration below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./data_types.png")
+```
+
+In this image, black boxes refer to R data types and light blue boxes refer to 
Arrow data types. Directional arrows specify conversions (e.g., the 
bidirectional arrow between the logical R type and the boolean Arrow type means 
that R logicals convert to Arrow booleans and vice versa). Solid lines indicate 
that the this conversion rule is always the default; dashed lines mean that it 
only sometimes applies (the rules and special cases are described below). 

Review Comment:
   Latest push makes connecting lines and arrow heads bigger, and the dashed 
lines now stay as dashed lines the whole time. It doesn't feel like an ideal 
solution even still because the dashed lines kind of go all over the place, but 
I'm not 100% sure what the right answer is here because the reality is messy. 
One possibility might be to include a dashed line only for the "most typical 
case" (e.g., int64 in arrow -> integer in R) and direct the reader to the text 
for a detailed explanation?
   
   As an aside one change in the pkgdown configuration is that the vignettes 
aren't being bundled into the build anymore (I think): it's web-only now. So 
the png files shouldn't contribute to the size of the package on CRAN. That 
seems like the right thing to do if we want to simultaneously have pretty docs 
and not run afoul of the CRAN size restrictions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Reply via email to