thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1022687152
##########
r/vignettes/data_types.Rmd:
##########
@@ -0,0 +1,342 @@
+---
+title: "Data types"
+description: >
+ Learn about fundamental data types in Apache Arrow and how those
+ types are mapped onto corresponding data types in R
+output: rmarkdown::html_vignette
+---
+
+Arrow has a rich data type system that includes direct analogs of many R data
types, and many data types that do not have a counterpart in R. This article
describes the Arrow type system, compares it to R data types, and outlines the
default mappings used when data are transferred from Arrow to R. At the end of
the article there are two lookup tables: one describing the default "R to
Arrow" type mappings and the other describing the "Arrow to R" mappings.
+
+## Motivating example
+
+To illustrate the conversion that needs to take place, consider the
differences between the output when obtain we use `dplyr::glimpse()` to inspect
the `starwars` data in its original format -- as a data frame in R -- and the
output we obtain when we convert it to an Arrow Table first by calling
`arrow_table()`:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+glimpse(starwars)
+glimpse(arrow_table(starwars))
+```
+
+The data represented are essentially the same, but the descriptions of the
data types for the columns have changed. For example:
+
+- `name` is labelled `<chr>` (character vector) in the data frame; it is
labelled `<string>` (a string type, also referred to as utf8 type) in the Arrow
Table
+- `height` is labelled `<int>` (integer vector) in the data frame; it is
labelled `<int32>` (32-bit signed integer) in the Arrow Table
+- `mass` is labelled `<dbl>` (numeric vector) in the data frame; it is
labelled `<double>` (64-bit floating point number) in the Arrow Table
+
+Some of these differences are purely cosmetic: integers in R are in fact
32-bit signed integers, so the underlying data types in Arrow and R are direct
analogs of one another. In other cases the differences are purely about the
implementation: Arrow and R have different ways to store a vector of strings,
but at a high level of abstraction the R character type and the Arrow string
type can be viewed as direct analogs. In some cases, however, there are no
clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it
does not have an analog of POSIXlt; converselt, while R can represent 32 bit
signed integers, it does not have an equivalent of a 64 bit unsigned integer.
Review Comment:
```suggestion
Some of these differences are purely cosmetic: integers in R are in fact
32-bit signed integers, so the underlying data types in Arrow and R are direct
analogs of one another. In other cases the differences are purely about the
implementation: Arrow and R have different ways to store a vector of strings,
but at a high level of abstraction the R character type and the Arrow string
type can be viewed as direct analogs. In some cases, however, there are no
clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it
does not have an analog of POSIXlt; conversely, while R can represent 32 bit
signed integers, it does not have an equivalent of a 64 bit unsigned integer.
```
##########
r/vignettes/data_types.Rmd:
##########
@@ -0,0 +1,342 @@
+---
+title: "Data types"
+description: >
+ Learn about fundamental data types in Apache Arrow and how those
+ types are mapped onto corresponding data types in R
+output: rmarkdown::html_vignette
+---
+
+Arrow has a rich data type system that includes direct analogs of many R data
types, and many data types that do not have a counterpart in R. This article
describes the Arrow type system, compares it to R data types, and outlines the
default mappings used when data are transferred from Arrow to R. At the end of
the article there are two lookup tables: one describing the default "R to
Arrow" type mappings and the other describing the "Arrow to R" mappings.
+
+## Motivating example
+
+To illustrate the conversion that needs to take place, consider the
differences between the output when obtain we use `dplyr::glimpse()` to inspect
the `starwars` data in its original format -- as a data frame in R -- and the
output we obtain when we convert it to an Arrow Table first by calling
`arrow_table()`:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+glimpse(starwars)
+glimpse(arrow_table(starwars))
+```
+
+The data represented are essentially the same, but the descriptions of the
data types for the columns have changed. For example:
+
+- `name` is labelled `<chr>` (character vector) in the data frame; it is
labelled `<string>` (a string type, also referred to as utf8 type) in the Arrow
Table
+- `height` is labelled `<int>` (integer vector) in the data frame; it is
labelled `<int32>` (32-bit signed integer) in the Arrow Table
+- `mass` is labelled `<dbl>` (numeric vector) in the data frame; it is
labelled `<double>` (64-bit floating point number) in the Arrow Table
+
+Some of these differences are purely cosmetic: integers in R are in fact
32-bit signed integers, so the underlying data types in Arrow and R are direct
analogs of one another. In other cases the differences are purely about the
implementation: Arrow and R have different ways to store a vector of strings,
but at a high level of abstraction the R character type and the Arrow string
type can be viewed as direct analogs. In some cases, however, there are no
clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it
does not have an analog of POSIXlt; converselt, while R can represent 32 bit
signed integers, it does not have an equivalent of a 64 bit unsigned integer.
+
+When the `arrow` package converts between R data and Arrow data, it will first
check to see if a Schema has been provided -- see `schema()` for more
information -- and if none is available it will attempt to guess the
appropriate type by following the default mappings. A complete listing of these
mappings is provided at the end of the article, but the most common cases are
depicted in the illustration below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./data_types.png")
+```
+
+In this image, black boxes refer to R data types and light blue boxes refer to
Arrow data types. Directional arrows specify conversions (e.g., the
bidirectional arrow between the logical R type and the boolean Arrow type means
that R logicals convert to Arrow booleans and vice versa). Solid lines indicate
that the this conversion rule is always the default; dashed lines mean that it
only sometimes applies (the rules and special cases are described below).
+
+## Logical/boolean types
+
+Arrow and R both use three-valued logic. In R, logical values can be `TRUE` or
`FALSE`, with `NA` used to represent missing data. In Arrow, the corresponding
boolean type can take values `true`, `false`, or `null`, as shown below:
+
+```{r}
+chunked_array(c(TRUE, FALSE, NA), type = boolean()) # default
+```
+
+It is not strictly necessary to set `type = boolean()` in this example because
the default behavior in `arrow` is to translate R logical vectors to Arrow
booleans and vice versa. However, for the sake of clarity we will specify the
data types explicitly throughout this article. We will likewise use
`chunked_array()` to create Arrow data from R objects and `as.vector()` to
create R data from Arrow objects, but similar results are obtained if we use
other methods.
+
+## Integer types
+
+Base R natively supports only one type of integer, using 32 bits to represent
signed numbers between -2147483648 and 2147483647, though R can also support 64
bit integers via the [`bit64`](https://cran.r-project.org/package=bit64)
package. Arrow inherits signed and unsigned integer types from C++ in 8-bit,
16-bit, 32-bit, and 64-bit versions:
+
+| Description | Data Type Function | Smallest Value | Largest
Value |
+| --------------- | -----------------: | -------------------: |
-------------------: |
+| 8 bit unsigned | `uint8()` | 0 |
255 |
+| 16 bit unsigned | `uint16()` | 0 |
65535 |
+| 32 bit unsigned | `uint32()` | 0 |
4294967295 |
+| 64 bit unsigned | `uint64()` | 0 |
18446744073709551615 |
+| 8 bit signed | `int8()` | -128 |
127 |
+| 16 bit signed | `int16()` | -32768 |
32767 |
+| 32 bit signed | `int32()` | -2147483648 |
2147483647 |
+| 64 bit signed | `int64()` | -9223372036854775808 |
9223372036854775807 |
+
+By default, `arrow` translates R integers to the int32 type in Arrow, but you
can override this by explicitly specifying another integer type:
+
+```{r}
+chunked_array(c(10L, 3L, 200L), type = int32()) # default
+chunked_array(c(10L, 3L, 200L), type = int64())
+```
+
+If the value in R does not fall within the permissible range for the
corresponding Arrow type, `arrow` throws an error:
+
+```{r, error=TRUE}
+chunked_array(c(10L, 3L, 200L), type = int8())
+```
+
+When translating from Arrow to R, integer types alway translate to R integers
unless one of the following exceptions applies:
+
+- If the value of an Arrow uint32 or uint64 falls outside the range allowed
for R integers, the result will be a numeric vector in R
+- If the value of an Arrow int64 variable falls outside the range allowed for
R integers, the result will be a `bit64::integer64` vector in R
+- If the user sets `options(arrow.int64_downcast = FALSE)`, the Arrow int64
type always yields a `bit64::integer64` vector in R regardless of the value
+
+## Floating point numeric types
+
+R has one double-precision (64-bit) numeric type, which translates to the
Arrow 64-bit floating point type by default. Arrow supports both
single-precision (32-bit) and double-precision (64-bit) floating point numbers,
specified using the `float32()` and `float64()` data type functions. Both of
these are translated to doubles in R. Examples are shown below:
+
+```{r}
+chunked_array(c(0.1, 0.2, 0.3), type = float64()) # default
+chunked_array(c(0.1, 0.2, 0.3), type = float32())
+
+arrow_double <- chunked_array(c(0.1, 0.2, 0.3), type = float64())
+as.vector(arrow_double)
+```
+
+Note that the Arrow specification also permits half-precision (16-bit)
floating point numbers, but these have not yet been implemented.
+
+## Fixed point decimal types
+
+Arrow also contains `decimal()` data types, in which numeric values are
specified in decimal format rather than binary. Decimals in Arrow come in two
varieties, a 128-bit version and a 256-bit version, but in most cases users
should be able to use the more general `decimal()` data type function rather
than the specific `decimal128()` and `decimal256()` functions.
+
+The decimal types in Arrow are fixed-precision numbers (rather than
floating-point), which means it is necessary to explicitly specify the
`precision` and `scale` arguments:
+
+- `precision` specifies the number of significant digits to store.
+- `scale` specifies the number of digits that should be stored after the
decimal point. If you set `scale = 2`, exactly two digits will be stored after
the decimal point. If you set `scale = 0`, values will be rounded to the
nearest whole number. Negative scales are also permitted (handy when dealing
with extremely large numbers), so `scale = -2` stores the value to the nearest
100.
+
+Because R does not have any way to create decimal types natively, the example
below is a little circuitous. First we create some floating point numbers as
Chunked Arrays, and then explicitly cast these to decimal types within Arrow.
This is possible because Chunked Array objects possess a `cast()` method:
+
+```{r}
+arrow_floating <- chunked_array(c(.01, .1, 1, 10, 100))
+arrow_decimals <- arrow_floating$cast(decimal(precision = 5, scale = 2))
+arrow_decimals
+```
+
+Though not natively used in R, decimal types can be useful in situations where
it is especially important to avoid problems that arise in floating point
arithmetic.
+
+## String/character types
+
+R uses a single character type to represent strings whereas Arrow has two
types. In the Arrow C++ library these types are referred to as strings and
large_strings, but to avoid ambiguity in the `arrow` R package they are defined
using the `utf8()` and `large_utf8()` data type functions. The distinction
between these two Arrow types is unlikely to be important for R users, though
the difference is discussed in the article on [data object
layout](./developers/data_object_layout.html).
+
+The default behavior is to translate R character vectors to the utf8/string
type, and to translate both Arrow types to R character vectors:
+
+```{r}
+strings <- chunked_array(c("oh", "well", "whatever"))
+strings
+as.vector(strings)
+```
+
+## Factor/dictionary types
+
+The analog of R factors in Arrow is the dictionary type. Factors translate to
dictionaries and vice versa. To illustrate this, let's create a small factor
object in R:
+
+```{r}
+fct <- factor(c("cat", "dog", "pig", "dog"))
+fct
+```
+
+When translated to Arrow, this is the dictionary that results:
+
+```{r}
+dict <- chunked_array(fct, type = dictionary())
+dict
+```
+
+When translated back to R, we recover the original factor:
+
+```{r}
+as.vector(dict)
+```
+
+Arrow dictionaries are slightly more flexible than R factors: values in a
dictionary do not necessarily have to be strings, but labels in a factor do. As
a consequence, non-string values in an Arrow dictionary are coerced to strings
when translated to R.
+
+## Date types
+
+In R, dates are typically represented using the Date class. Internally a Date
object is a numeric type whose value counts the number of days since the
beginning of the unix epoch (1 January 1970). Arrow supplies two data types
that can be used to represent dates: the date32 type and the date64 type. The
date32 type is similar to the Date class in R: internally it stores a 32-bit
integer that counts the number of days since 1 January 1970. The default in
`arrow` is to translate R Date objects to Arrow date32 types:
Review Comment:
```suggestion
In R, dates are typically represented using the Date class. Internally a
Date object is a numeric type whose value counts the number of days since the
beginning of the Unix epoch (1 January 1970). Arrow supplies two data types
that can be used to represent dates: the date32 type and the date64 type. The
date32 type is similar to the Date class in R: internally it stores a 32-bit
integer that counts the number of days since 1 January 1970. The default in
`arrow` is to translate R Date objects to Arrow date32 types:
```
##########
r/vignettes/data_types.Rmd:
##########
@@ -0,0 +1,342 @@
+---
+title: "Data types"
+description: >
+ Learn about fundamental data types in Apache Arrow and how those
+ types are mapped onto corresponding data types in R
+output: rmarkdown::html_vignette
+---
+
+Arrow has a rich data type system that includes direct analogs of many R data
types, and many data types that do not have a counterpart in R. This article
describes the Arrow type system, compares it to R data types, and outlines the
default mappings used when data are transferred from Arrow to R. At the end of
the article there are two lookup tables: one describing the default "R to
Arrow" type mappings and the other describing the "Arrow to R" mappings.
+
+## Motivating example
+
+To illustrate the conversion that needs to take place, consider the
differences between the output when obtain we use `dplyr::glimpse()` to inspect
the `starwars` data in its original format -- as a data frame in R -- and the
output we obtain when we convert it to an Arrow Table first by calling
`arrow_table()`:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+glimpse(starwars)
+glimpse(arrow_table(starwars))
+```
+
+The data represented are essentially the same, but the descriptions of the
data types for the columns have changed. For example:
+
+- `name` is labelled `<chr>` (character vector) in the data frame; it is
labelled `<string>` (a string type, also referred to as utf8 type) in the Arrow
Table
+- `height` is labelled `<int>` (integer vector) in the data frame; it is
labelled `<int32>` (32-bit signed integer) in the Arrow Table
+- `mass` is labelled `<dbl>` (numeric vector) in the data frame; it is
labelled `<double>` (64-bit floating point number) in the Arrow Table
+
+Some of these differences are purely cosmetic: integers in R are in fact
32-bit signed integers, so the underlying data types in Arrow and R are direct
analogs of one another. In other cases the differences are purely about the
implementation: Arrow and R have different ways to store a vector of strings,
but at a high level of abstraction the R character type and the Arrow string
type can be viewed as direct analogs. In some cases, however, there are no
clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it
does not have an analog of POSIXlt; converselt, while R can represent 32 bit
signed integers, it does not have an equivalent of a 64 bit unsigned integer.
+
+When the `arrow` package converts between R data and Arrow data, it will first
check to see if a Schema has been provided -- see `schema()` for more
information -- and if none is available it will attempt to guess the
appropriate type by following the default mappings. A complete listing of these
mappings is provided at the end of the article, but the most common cases are
depicted in the illustration below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./data_types.png")
+```
+
+In this image, black boxes refer to R data types and light blue boxes refer to
Arrow data types. Directional arrows specify conversions (e.g., the
bidirectional arrow between the logical R type and the boolean Arrow type means
that R logicals convert to Arrow booleans and vice versa). Solid lines indicate
that the this conversion rule is always the default; dashed lines mean that it
only sometimes applies (the rules and special cases are described below).
Review Comment:
I don't recall what the plan was regarding the png files here, but a few
comments on them:
- the unidirectional and bidirectional arrows are really effective for
simplifying the explanation here
- can we make the line width and arrow head sizes larger?
- the dashed lines seem to merge together and are hard to interpret
##########
r/vignettes/data_types.Rmd:
##########
@@ -0,0 +1,342 @@
+---
+title: "Data types"
+description: >
+ Learn about fundamental data types in Apache Arrow and how those
+ types are mapped onto corresponding data types in R
+output: rmarkdown::html_vignette
+---
+
+Arrow has a rich data type system that includes direct analogs of many R data
types, and many data types that do not have a counterpart in R. This article
describes the Arrow type system, compares it to R data types, and outlines the
default mappings used when data are transferred from Arrow to R. At the end of
the article there are two lookup tables: one describing the default "R to
Arrow" type mappings and the other describing the "Arrow to R" mappings.
+
+## Motivating example
+
+To illustrate the conversion that needs to take place, consider the
differences between the output when obtain we use `dplyr::glimpse()` to inspect
the `starwars` data in its original format -- as a data frame in R -- and the
output we obtain when we convert it to an Arrow Table first by calling
`arrow_table()`:
+
+```{r}
+library(dplyr, warn.conflicts = FALSE)
+library(arrow, warn.conflicts = FALSE)
+
+glimpse(starwars)
+glimpse(arrow_table(starwars))
+```
+
+The data represented are essentially the same, but the descriptions of the
data types for the columns have changed. For example:
+
+- `name` is labelled `<chr>` (character vector) in the data frame; it is
labelled `<string>` (a string type, also referred to as utf8 type) in the Arrow
Table
+- `height` is labelled `<int>` (integer vector) in the data frame; it is
labelled `<int32>` (32-bit signed integer) in the Arrow Table
+- `mass` is labelled `<dbl>` (numeric vector) in the data frame; it is
labelled `<double>` (64-bit floating point number) in the Arrow Table
+
+Some of these differences are purely cosmetic: integers in R are in fact
32-bit signed integers, so the underlying data types in Arrow and R are direct
analogs of one another. In other cases the differences are purely about the
implementation: Arrow and R have different ways to store a vector of strings,
but at a high level of abstraction the R character type and the Arrow string
type can be viewed as direct analogs. In some cases, however, there are no
clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it
does not have an analog of POSIXlt; converselt, while R can represent 32 bit
signed integers, it does not have an equivalent of a 64 bit unsigned integer.
Review Comment:
```suggestion
Some of these differences are purely cosmetic: integers in R are in fact
32-bit signed integers, so the underlying data types in Arrow and R are direct
analogs of one another. In other cases the differences are purely about the
implementation: Arrow and R have different ways to store a vector of strings,
but at a high level of abstraction the R character type and the Arrow string
type can be viewed as direct analogs. In some cases, however, there are no
clear analogs: while Arrow has an analog of POSIXct (the timestamp type) it
does not have an analog of POSIXlt; converselt, while R can natively represent
32 bit signed integers, it does not have an equivalent of a 64 bit unsigned
integer.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]