[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

GitBox Tue, 15 Nov 2022 04:01:56 -0800


thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1022687152



##########
r/vignettes/data_types.Rmd:
##########
@@ -0,0 +1,342 @@
+---
+title: "Data types"
+description: >
+  Learn about fundamental data types in Apache Arrow and how those 
+  types are mapped onto corresponding data types in R 
+output: rmarkdown::html_vignette
+---
+
+Arrow has a rich data type system that includes direct analogs of many R data 
types, and many data types that do not have a counterpart in R. This article 
describes the Arrow type system, compares it to R data types, and outlines the 
default mappings used when data are transferred from Arrow to R. At the end of 
the article there are two lookup tables: one describing the default "R to 
Arrow" type mappings and the other describing the "Arrow to R" mappings.
+
+## Motivating example
+
+To illustrate the conversion that needs to take place, consider the 
differences between the output when obtain we use `dplyr::glimpse()` to inspect 
the `starwars` data in its original format -- as a data frame in R -- and the 
output we obtain when we convert it to an Arrow Table first by calling 
`arrow_table()`:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+glimpse(starwars)
+glimpse(arrow_table(starwars))
+```
+
+The data represented are essentially the same, but the descriptions of the 
data types for the columns have changed. For example:
+
+- `name` is labelled `<chr>` (character vector) in the data frame; it is 
labelled `<string>` (a string type, also referred to as utf8 type) in the Arrow 
Table 
+- `height` is labelled `<int>` (integer vector) in the data frame; it is 
labelled `<int32>` (32-bit signed integer) in the Arrow Table
+- `mass` is labelled `<dbl>` (numeric vector) in the data frame; it is 
labelled `<double>` (64-bit floating point number) in the Arrow Table
+
+Some of these differences are purely cosmetic: integers in R are in fact 
32-bit signed integers, so the underlying data types in Arrow and R are direct 
analogs of one another. In other cases the differences are purely about the 
implementation: Arrow and R have different ways to store a vector of strings, 
but at a high level of abstraction the R character type and the Arrow string 
type can be viewed as direct analogs. In some cases, however, there are no 
clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it 
does not have an analog of POSIXlt; converselt, while R can represent 32 bit 
signed integers, it does not have an equivalent of a 64 bit unsigned integer.

Review Comment:
   ```suggestion
   Some of these differences are purely cosmetic: integers in R are in fact 
32-bit signed integers, so the underlying data types in Arrow and R are direct 
analogs of one another. In other cases the differences are purely about the 
implementation: Arrow and R have different ways to store a vector of strings, 
but at a high level of abstraction the R character type and the Arrow string 
type can be viewed as direct analogs. In some cases, however, there are no 
clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it 
does not have an analog of POSIXlt; conversely, while R can represent 32 bit 
signed integers, it does not have an equivalent of a 64 bit unsigned integer.
   ```



##########
r/vignettes/data_types.Rmd:
##########
@@ -0,0 +1,342 @@
+---
+title: "Data types"
+description: >
+  Learn about fundamental data types in Apache Arrow and how those 
+  types are mapped onto corresponding data types in R 
+output: rmarkdown::html_vignette
+---
+
+Arrow has a rich data type system that includes direct analogs of many R data 
types, and many data types that do not have a counterpart in R. This article 
describes the Arrow type system, compares it to R data types, and outlines the 
default mappings used when data are transferred from Arrow to R. At the end of 
the article there are two lookup tables: one describing the default "R to 
Arrow" type mappings and the other describing the "Arrow to R" mappings.
+
+## Motivating example
+
+To illustrate the conversion that needs to take place, consider the 
differences between the output when obtain we use `dplyr::glimpse()` to inspect 
the `starwars` data in its original format -- as a data frame in R -- and the 
output we obtain when we convert it to an Arrow Table first by calling 
`arrow_table()`:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+glimpse(starwars)
+glimpse(arrow_table(starwars))
+```
+
+The data represented are essentially the same, but the descriptions of the 
data types for the columns have changed. For example:
+
+- `name` is labelled `<chr>` (character vector) in the data frame; it is 
labelled `<string>` (a string type, also referred to as utf8 type) in the Arrow 
Table 
+- `height` is labelled `<int>` (integer vector) in the data frame; it is 
labelled `<int32>` (32-bit signed integer) in the Arrow Table
+- `mass` is labelled `<dbl>` (numeric vector) in the data frame; it is 
labelled `<double>` (64-bit floating point number) in the Arrow Table
+
+Some of these differences are purely cosmetic: integers in R are in fact 
32-bit signed integers, so the underlying data types in Arrow and R are direct 
analogs of one another. In other cases the differences are purely about the 
implementation: Arrow and R have different ways to store a vector of strings, 
but at a high level of abstraction the R character type and the Arrow string 
type can be viewed as direct analogs. In some cases, however, there are no 
clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it 
does not have an analog of POSIXlt; converselt, while R can represent 32 bit 
signed integers, it does not have an equivalent of a 64 bit unsigned integer.
+
+When the `arrow` package converts between R data and Arrow data, it will first 
check to see if a Schema has been provided -- see `schema()` for more 
information -- and if none is available it will attempt to guess the 
appropriate type by following the default mappings. A complete listing of these 
mappings is provided at the end of the article, but the most common cases are 
depicted in the illustration below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./data_types.png")
+```
+
+In this image, black boxes refer to R data types and light blue boxes refer to 
Arrow data types. Directional arrows specify conversions (e.g., the 
bidirectional arrow between the logical R type and the boolean Arrow type means 
that R logicals convert to Arrow booleans and vice versa). Solid lines indicate 
that the this conversion rule is always the default; dashed lines mean that it 
only sometimes applies (the rules and special cases are described below). 
+
+## Logical/boolean types
+
+Arrow and R both use three-valued logic. In R, logical values can be `TRUE` or 
`FALSE`, with `NA` used to represent missing data. In Arrow, the corresponding 
boolean type can take values `true`, `false`, or `null`, as shown below:
+
+```{r}
+chunked_array(c(TRUE, FALSE, NA), type = boolean()) # default
+```
+
+It is not strictly necessary to set `type = boolean()` in this example because 
the default behavior in `arrow` is to translate R logical vectors to Arrow 
booleans and vice versa. However, for the sake of clarity we will specify the 
data types explicitly throughout this article. We will likewise use 
`chunked_array()` to create Arrow data from R objects and `as.vector()` to 
create R data from Arrow objects, but similar results are obtained if we use 
other methods. 
+
+## Integer types
+
+Base R natively supports only one type of integer, using 32 bits to represent 
signed numbers between -2147483648 and 2147483647, though R can also support 64 
bit integers via the [`bit64`](https://cran.r-project.org/package=bit64) 
package. Arrow inherits signed and unsigned integer types from C++ in 8-bit, 
16-bit, 32-bit, and 64-bit versions:
+
+| Description     | Data Type Function | Smallest Value       |        Largest 
Value |
+| --------------- | -----------------: | -------------------: | 
-------------------: |
+| 8 bit unsigned  | `uint8()`          | 0                    |                
  255 |
+| 16 bit unsigned | `uint16()`         | 0                    |                
65535 |
+| 32 bit unsigned | `uint32()`         | 0                    |           
4294967295 |
+| 64 bit unsigned | `uint64()`         | 0                    | 
18446744073709551615 |
+| 8 bit signed    | `int8()`           | -128                 |                
  127 |
+| 16 bit signed   | `int16()`          | -32768               |                
32767 |
+| 32 bit signed   | `int32()`          | -2147483648          |           
2147483647 |
+| 64 bit signed   | `int64()`          | -9223372036854775808 |  
9223372036854775807 |
+
+By default, `arrow` translates R integers to the int32 type in Arrow, but you 
can override this by explicitly specifying another integer type:
+
+```{r}
+chunked_array(c(10L, 3L, 200L), type = int32()) # default
+chunked_array(c(10L, 3L, 200L), type = int64())
+```
+
+If the value in R does not fall within the permissible range for the 
corresponding Arrow type, `arrow` throws an error:
+
+```{r, error=TRUE}
+chunked_array(c(10L, 3L, 200L), type = int8())
+```
+
+When translating from Arrow to R, integer types alway translate to R integers 
unless one of the following exceptions applies:
+
+- If the value of an Arrow uint32 or uint64 falls outside the range allowed 
for R integers, the result will be a numeric vector in R 
+- If the value of an Arrow int64 variable falls outside the range allowed for 
R integers, the result will be a `bit64::integer64` vector in R
+- If the user sets `options(arrow.int64_downcast = FALSE)`, the Arrow int64 
type always yields a `bit64::integer64` vector in R regardless of the value
+
+## Floating point numeric types
+
+R has one double-precision (64-bit) numeric type, which translates to the 
Arrow 64-bit floating point type by default. Arrow supports both 
single-precision (32-bit) and double-precision (64-bit) floating point numbers, 
specified using the `float32()` and `float64()` data type functions. Both of 
these are translated to doubles in R. Examples are shown below:
+
+```{r}
+chunked_array(c(0.1, 0.2, 0.3), type = float64()) # default
+chunked_array(c(0.1, 0.2, 0.3), type = float32())
+
+arrow_double <- chunked_array(c(0.1, 0.2, 0.3), type = float64())
+as.vector(arrow_double)
+```
+
+Note that the Arrow specification also permits half-precision (16-bit) 
floating point numbers, but these have not yet been implemented. 
+
+## Fixed point decimal types
+
+Arrow also contains `decimal()` data types, in which numeric values are 
specified in decimal format rather than binary. Decimals in Arrow come in two 
varieties, a 128-bit version and a 256-bit version, but in most cases users 
should be able to use the more general `decimal()` data type function rather 
than the specific `decimal128()` and `decimal256()` functions. 
+
+The decimal types in Arrow are fixed-precision numbers (rather than 
floating-point), which means it is necessary to explicitly specify the 
`precision` and `scale` arguments:
+
+- `precision` specifies the number of significant digits to store.
+- `scale` specifies the number of digits that should be stored after the 
decimal point. If you set `scale = 2`, exactly two digits will be stored after 
the decimal point. If you set `scale = 0`, values will be rounded to the 
nearest whole number. Negative scales are also permitted (handy when dealing 
with extremely large numbers), so `scale = -2` stores the value to the nearest 
100.
+
+Because R does not have any way to create decimal types natively, the example 
below is a little circuitous. First we create some floating point numbers as 
Chunked Arrays, and then explicitly cast these to decimal types within Arrow. 
This is possible because Chunked Array objects possess a `cast()` method:
+
+```{r}
+arrow_floating <- chunked_array(c(.01, .1, 1, 10, 100))
+arrow_decimals <- arrow_floating$cast(decimal(precision = 5, scale = 2))
+arrow_decimals
+```
+
+Though not natively used in R, decimal types can be useful in situations where 
it is especially important to avoid problems that arise in floating point 
arithmetic.
+
+## String/character types
+
+R uses a single character type to represent strings whereas Arrow has two 
types. In the Arrow C++ library these types are referred to as strings and 
large_strings, but to avoid ambiguity in the `arrow` R package they are defined 
using the `utf8()` and `large_utf8()` data type functions. The distinction 
between these two Arrow types is unlikely to be important for R users, though 
the difference is discussed in the article on [data object 
layout](./developers/data_object_layout.html). 
+
+The default behavior is to translate R character vectors to the utf8/string 
type, and to translate both Arrow types to R character vectors:
+
+```{r}
+strings <- chunked_array(c("oh", "well", "whatever"))
+strings
+as.vector(strings)
+```
+
+## Factor/dictionary types
+
+The analog of R factors in Arrow is the dictionary type. Factors translate to 
dictionaries and vice versa. To illustrate this, let's create a small factor 
object in R:
+
+```{r}
+fct <- factor(c("cat", "dog", "pig", "dog"))
+fct
+```
+
+When translated to Arrow, this is the dictionary that results:
+
+```{r}
+dict <- chunked_array(fct, type = dictionary())
+dict
+```
+
+When translated back to R, we recover the original factor:
+
+```{r}
+as.vector(dict)
+```
+
+Arrow dictionaries are slightly more flexible than R factors: values in a 
dictionary do not necessarily have to be strings, but labels in a factor do. As 
a consequence, non-string values in an Arrow dictionary are coerced to strings 
when translated to R.
+
+## Date types
+
+In R, dates are typically represented using the Date class. Internally a Date 
object is a numeric type whose value counts the number of days since the 
beginning of the unix epoch (1 January 1970). Arrow supplies two data types 
that can be used to represent dates: the date32 type and the date64 type. The 
date32 type is similar to the Date class in R: internally it stores a 32-bit 
integer that counts the number of days since 1 January 1970. The default in 
`arrow` is to translate R Date objects to Arrow date32 types:

Review Comment:
   ```suggestion
   In R, dates are typically represented using the Date class. Internally a 
Date object is a numeric type whose value counts the number of days since the 
beginning of the Unix epoch (1 January 1970). Arrow supplies two data types 
that can be used to represent dates: the date32 type and the date64 type. The 
date32 type is similar to the Date class in R: internally it stores a 32-bit 
integer that counts the number of days since 1 January 1970. The default in 
`arrow` is to translate R Date objects to Arrow date32 types:
   ```



##########
r/vignettes/data_types.Rmd:
##########
@@ -0,0 +1,342 @@
+---
+title: "Data types"
+description: >
+  Learn about fundamental data types in Apache Arrow and how those 
+  types are mapped onto corresponding data types in R 
+output: rmarkdown::html_vignette
+---
+
+Arrow has a rich data type system that includes direct analogs of many R data 
types, and many data types that do not have a counterpart in R. This article 
describes the Arrow type system, compares it to R data types, and outlines the 
default mappings used when data are transferred from Arrow to R. At the end of 
the article there are two lookup tables: one describing the default "R to 
Arrow" type mappings and the other describing the "Arrow to R" mappings.
+
+## Motivating example
+
+To illustrate the conversion that needs to take place, consider the 
differences between the output when obtain we use `dplyr::glimpse()` to inspect 
the `starwars` data in its original format -- as a data frame in R -- and the 
output we obtain when we convert it to an Arrow Table first by calling 
`arrow_table()`:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+glimpse(starwars)
+glimpse(arrow_table(starwars))
+```
+
+The data represented are essentially the same, but the descriptions of the 
data types for the columns have changed. For example:
+
+- `name` is labelled `<chr>` (character vector) in the data frame; it is 
labelled `<string>` (a string type, also referred to as utf8 type) in the Arrow 
Table 
+- `height` is labelled `<int>` (integer vector) in the data frame; it is 
labelled `<int32>` (32-bit signed integer) in the Arrow Table
+- `mass` is labelled `<dbl>` (numeric vector) in the data frame; it is 
labelled `<double>` (64-bit floating point number) in the Arrow Table
+
+Some of these differences are purely cosmetic: integers in R are in fact 
32-bit signed integers, so the underlying data types in Arrow and R are direct 
analogs of one another. In other cases the differences are purely about the 
implementation: Arrow and R have different ways to store a vector of strings, 
but at a high level of abstraction the R character type and the Arrow string 
type can be viewed as direct analogs. In some cases, however, there are no 
clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it 
does not have an analog of POSIXlt; converselt, while R can represent 32 bit 
signed integers, it does not have an equivalent of a 64 bit unsigned integer.
+
+When the `arrow` package converts between R data and Arrow data, it will first 
check to see if a Schema has been provided -- see `schema()` for more 
information -- and if none is available it will attempt to guess the 
appropriate type by following the default mappings. A complete listing of these 
mappings is provided at the end of the article, but the most common cases are 
depicted in the illustration below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./data_types.png")
+```
+
+In this image, black boxes refer to R data types and light blue boxes refer to 
Arrow data types. Directional arrows specify conversions (e.g., the 
bidirectional arrow between the logical R type and the boolean Arrow type means 
that R logicals convert to Arrow booleans and vice versa). Solid lines indicate 
that the this conversion rule is always the default; dashed lines mean that it 
only sometimes applies (the rules and special cases are described below). 

Review Comment:
   I don't recall what the plan was regarding the png files here, but a few 
comments on them:
   
   - the unidirectional and bidirectional arrows are really effective for 
simplifying the explanation here
   - can we make the line width and arrow head sizes larger?
   - the dashed lines seem to merge together and are hard to interpret
   



##########
r/vignettes/data_types.Rmd:
##########
@@ -0,0 +1,342 @@
+---
+title: "Data types"
+description: >
+  Learn about fundamental data types in Apache Arrow and how those 
+  types are mapped onto corresponding data types in R 
+output: rmarkdown::html_vignette
+---
+
+Arrow has a rich data type system that includes direct analogs of many R data 
types, and many data types that do not have a counterpart in R. This article 
describes the Arrow type system, compares it to R data types, and outlines the 
default mappings used when data are transferred from Arrow to R. At the end of 
the article there are two lookup tables: one describing the default "R to 
Arrow" type mappings and the other describing the "Arrow to R" mappings.
+
+## Motivating example
+
+To illustrate the conversion that needs to take place, consider the 
differences between the output when obtain we use `dplyr::glimpse()` to inspect 
the `starwars` data in its original format -- as a data frame in R -- and the 
output we obtain when we convert it to an Arrow Table first by calling 
`arrow_table()`:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+glimpse(starwars)
+glimpse(arrow_table(starwars))
+```
+
+The data represented are essentially the same, but the descriptions of the 
data types for the columns have changed. For example:
+
+- `name` is labelled `<chr>` (character vector) in the data frame; it is 
labelled `<string>` (a string type, also referred to as utf8 type) in the Arrow 
Table 
+- `height` is labelled `<int>` (integer vector) in the data frame; it is 
labelled `<int32>` (32-bit signed integer) in the Arrow Table
+- `mass` is labelled `<dbl>` (numeric vector) in the data frame; it is 
labelled `<double>` (64-bit floating point number) in the Arrow Table
+
+Some of these differences are purely cosmetic: integers in R are in fact 
32-bit signed integers, so the underlying data types in Arrow and R are direct 
analogs of one another. In other cases the differences are purely about the 
implementation: Arrow and R have different ways to store a vector of strings, 
but at a high level of abstraction the R character type and the Arrow string 
type can be viewed as direct analogs. In some cases, however, there are no 
clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it 
does not have an analog of POSIXlt; converselt, while R can represent 32 bit 
signed integers, it does not have an equivalent of a 64 bit unsigned integer.

Review Comment:
   ```suggestion
   Some of these differences are purely cosmetic: integers in R are in fact 
32-bit signed integers, so the underlying data types in Arrow and R are direct 
analogs of one another. In other cases the differences are purely about the 
implementation: Arrow and R have different ways to store a vector of strings, 
but at a high level of abstraction the R character type and the Arrow string 
type can be viewed as direct analogs. In some cases, however, there are no 
clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it 
does not have an analog of POSIXlt; converselt, while R can natively represent 
32 bit signed integers, it does not have an equivalent of a 64 bit unsigned 
integer.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Reply via email to