thisisnic commented on a change in pull request #67:
URL: https://github.com/apache/arrow-cookbook/pull/67#discussion_r704304307
##########
File path: r/content/specify_data_types_and_schemas.Rmd
##########
@@ -0,0 +1,205 @@
+# Defining Data Types
+
+As discussed in previous chapters, Arrow automatically handles the conversion
of objects from native R data types to Arrow data types.
+However, you might want to manually define data types, for example, to ensure
interoperability with databases and data warehouse systems.
+
+## Specify data types when creating an Arrow table from an R object
+
+### Problem
+
+You want to manually specify Arrow data types when converting an object from a
data frame to an Arrow object.
+
+### Solution
+
+```{r, use_schema}
+# create a data frame
+share_data <- tibble::tibble(
+ company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+ price = c(3463.12, 2884.38, 2300.46, 732.39),
+ date = rep(as.Date("2021-09-02"), 4)
+)
+
+# define field names and types
+share_schema <- schema(
+ company = utf8(),
+ price = float32(),
+ date = date64()
+)
+
+# create arrow Table containing data and schema
+share_data_arrow <- Table$create(share_data, schema = share_schema)
+
+share_data_arrow
+```
+```{r, test_use_schema, opts.label = "test"}
+test_that("use_schema works as expected", {
+ expect_s3_class(share_data_arrow, "Table")
+ expect_equal(share_data_arrow$schema,
+ schema(company = utf8(), price = float32(), date = date64())
+ )
+})
+```
+
+## Specify data types when reading in files
+
+### Problem
+
+You want to manually specify Arrow data types when reading in files.
+
+### Solution
+
+```{r, use_schema_dataset}
+# create a data frame
+share_data <- tibble::tibble(
+ company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+ price = c(3463.12, 2884.38, 2300.46, 732.39),
+ date = rep(as.Date("2021-09-02"), 4)
+)
+
+# write dataset to disk
+write_dataset(share_data, path = "shares")
+
+# define field names and types
+share_schema <- schema(
+ company = utf8(),
+ price = float32(),
+ date = date64()
+)
+
+# read the dataset in, using the schema
+share_dataset_arrow <- open_dataset("shares", schema = share_schema)
+
+share_dataset_arrow
+```
+```{r, test_use_schema_dataset, opts.label = "test"}
+test_that("use_schema_dataset works as expected", {
+ expect_s3_class(share_dataset_arrow, "Dataset")
+ expect_equal(share_dataset_arrow$schema,
+ schema(company = utf8(), price = float32(), date = date64())
+ )
+})
+```
+```{r, include=FALSE}
+unlink("shares", recursive = TRUE)
+```
+
+### Discussion
+
+When native R data types are converted to Arrow data types, there is a default
+mapping between R type and Arrow types, as shown in the table below.
+
+#### R data type to Arrow data type mapping
+
+| R type | Arrow type |
+|--------------------------|------------|
+| logical | boolean |
+| integer | int32 |
+| double ("numeric") | float64^7^ |
+| character | utf8^1^ |
+| factor | dictionary |
+| raw | uint8 |
+| Date | date32 |
+| POSIXct | timestamp |
+| POSIXlt | struct |
+| data.frame | struct |
+| list^2^ | list |
+| bit64::integer64 | int64 |
+| difftime | time32 |
+| vctrs::vctrs_unspecified | null |
+
+^1^: If the character vector exceeds 2GB of strings, it will be converted to a
+`large_utf8` Arrow type
+
+^2^: Only lists where all elements are the same type are able to be translated
+to Arrow list type (which is a "list of" some type).
+
+The data types created via default mapping from R to Arrow are not the only
ones
+which exist, and alternative Arrow data types may compatible with each R data
+type. The compatible data types are shown in the table below.
+
+#### Arrow data type to R data type mapping
+
+| Arrow type | R type |
+|-------------------|------------------------------|
+| boolean | logical |
+| int8 | integer |
+| int16 | integer |
+| int32 | integer |
+| int64 | integer^3^ |
+| uint8 | integer |
+| uint16 | integer |
+| uint32 | integer^3^ |
+| uint64 | integer^3^ |
+| float16 | - |
+| float32 | double |
+| float64 ^7^ | double |
+| utf8 | character |
+| binary | arrow_binary ^5^ |
+| fixed_size_binary | arrow_fixed_size_binary ^5^ |
+| date32 | Date |
+| date64 | POSIXct |
+| time32 | hms::difftime |
+| time64 | hms::difftime |
+| timestamp | POSIXct |
+| duration | - |
+| decimal | double |
+| dictionary | factor^4^ |
+| list | arrow_list ^6^ |
+| fixed_size_list | arrow_fixed_size_list ^6^ |
+| struct | data.frame |
+| null | vctrs::vctrs_unspecified |
+| map | - |
+| union | - |
+| large_utf8 | character |
+| large_binary | arrow_large_binary ^5^ |
+| large_list | arrow_large_list ^6^ |
+
+^3^: These integer types may contain values that exceed the range of R's
+`integer` type (32-bit signed integer). When they do, `uint32` and `uint64`
are
+converted to `double` ("numeric") and `int64` is converted to
+`bit64::integer64`. This conversion can be disabled (so that `int64` always
+yields a `bit64::integer64` vector) by setting `options(arrow.int64_downcast =
FALSE)`.
+
+^4^: Due to the limitation of R `factor`s, Arrow `dictionary` values are
coerced
+to string when translated to R if they are not already strings.
+
+^5^: `arrow*_binary` classes are implemented as lists of raw vectors.
+
+^6^: `arrow*_list` classes are implemented as subclasses of `vctrs_list_of`
+with a `ptype` attribute set to what an empty Array of the value type converts
to.
+
+^7^: `float64` and `double` are the same concept and data type in Arrow C++;
+however, only `float64()` is used in arrow as the function `double()` already
exists in base R
+
+
+## Combine and harmonize schemas
Review comment:
Good thinking
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]