This is an automated email from the ASF dual-hosted git repository.
thisisnic pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git
The following commit(s) were added to refs/heads/main by this push:
new 449534d [R] - Schemas recipes (#67)
449534d is described below
commit 449534dae3ac2cffa97f20de07b6b86739d37be0
Author: Nic <[email protected]>
AuthorDate: Fri Oct 1 17:27:35 2021 +0000
[R] - Schemas recipes (#67)
* Add the creating schemas recipe
* Add in content on combinig schemas, and specifying schemas when reading
in files
* Delete unncecessary files, and stop showing test chunks
* Rephrase the bit about converting from R to Arrow
* Remove extraneous word
* Also mention reading in data
* Extra clarity
* missing word
* Add appendices
* Add section on casting, remove "problem" headings, update dataset, move
tables to appendix, show example of incompatible data types
* Link between incompatible data types and appendix table
* Add content on combining schemas
* Rephrase
* Add context
* Reorder items in table
* Add recipe for schemas where match or don't match
* Rephrase
* Update code which causes an error to not run
* Relegate unify_schemas to discussion
* Fix rebase
* Remove appendix and link to vignette instead
* Remove examples of everything that could go wrong, as not relevant
* Fix failing test
---
r/content/_bookdown.yml | 12 +-
r/content/reading_and_writing_data.Rmd | 11 ++
r/content/specify_data_types_and_schemas.Rmd | 208 +++++++++++++++++++++
r/content/unpublished/cute_datasets.Rmd | 10 +
.../unpublished/specify_data_types_and_schemas.Rmd | 10 -
5 files changed, 236 insertions(+), 15 deletions(-)
diff --git a/r/content/_bookdown.yml b/r/content/_bookdown.yml
index 280474a..51b9ef2 100644
--- a/r/content/_bookdown.yml
+++ b/r/content/_bookdown.yml
@@ -4,8 +4,10 @@ new_session: FALSE
clean: ["_book/*"]
output_dir: _book
edit: https://github.com/apache/arrow-cookbook/edit/main/r/content/%s
-rmd_files: ["index.Rmd", "reading_and_writing_data.Rmd",
"creating_arrow_objects.Rmd", "manipulating_data.Rmd"]
-
-# This is the full list
-# rmd_files: ["index.Rmd", "configure_arrow.Rmd",
"work_with_data_in_different_formats.Rmd",
-# "work_with_compressed_or_partitioned_data.Rmd",
"create_arrow_objects_from_r.Rmd", "specify_data_types_and_schemas.Rmd",
"manipulate_data.Rmd", "work_with_arrow_in_both_python_and_r.Rmd"]
+rmd_files: [
+ "index.Rmd",
+ "reading_and_writing_data.Rmd",
+ "creating_arrow_objects.Rmd",
+ "specify_data_types_and_schemas.Rmd",
+ "manipulating_data.Rmd"
+]
diff --git a/r/content/reading_and_writing_data.Rmd
b/r/content/reading_and_writing_data.Rmd
index 53671ac..47274ad 100644
--- a/r/content/reading_and_writing_data.Rmd
+++ b/r/content/reading_and_writing_data.Rmd
@@ -4,6 +4,7 @@ This chapter contains recipes related to reading and writing
data using Apache
Arrow. When reading files into R using Apache Arrow, you can choose to read
in
your file as either a data frame or as an Arrow Table object.
+
There are a number of circumstances in which you may want to read in the data
as an Arrow Table:
* your dataset is large and if you load it into memory, it may lead to
performance issues
* you want faster performance from your `dplyr` queries
@@ -348,3 +349,13 @@ test_that("open_dataset chunk works as expected", {
unlink("airquality_partitioned", recursive = TRUE)
```
+```{r, include = FALSE}
+# cleanup
+unlink("my_table.arrow")
+unlink("my_table.arrows")
+unlink("cars.csv")
+unlink("my_table.feather")
+unlink("my_table.parquet")
+unlink("dist_time.parquet")
+unlink("airquality_partitioned", recursive = TRUE)
+```
diff --git a/r/content/specify_data_types_and_schemas.Rmd
b/r/content/specify_data_types_and_schemas.Rmd
new file mode 100644
index 0000000..7e0b7d4
--- /dev/null
+++ b/r/content/specify_data_types_and_schemas.Rmd
@@ -0,0 +1,208 @@
+# Defining Data Types
+
+As discussed in previous chapters, Arrow automatically infers the most
+appropriate data type when reading in data or converting R objects to Arrow
+objects. However, you might want to manually tell Arrow which data types to
+use, for example, to ensure interoperability with databases and data warehouse
+systems. This chapter includes recipes for:
+
+* changing the data types of existing Arrow objects
+* defining data types during the process of creating Arrow objects
+
+A table showing the default mappings between R and Arrow data types can be
found
+in [R data type to Arrow data type
mappings](https://arrow.apache.org/docs/r/articles/arrow.html#r-to-arrow).
+
+A table containing Arrow data types, and their R equivalents can be found in
+[Arrow data type to R data type
mapping](https://arrow.apache.org/docs/r/articles/arrow.html#arrow-to-r).
+
+## Update the data type of an existing Arrow Array
+
+You want to change the data type of an existing Arrow Array.
+
+### Solution
+
+```{r, cast_array}
+# Create an Array to cast
+integer_arr <- Array$create(1:5)
+
+# Cast to an unsigned int8 type
+uint_arr <- integer_arr$cast(target_type = uint8())
+
+uint_arr
+```
+
+```{r, test_cast_array, opts.label = "test"}
+test_that("cast_array works as expected", {
+ expect_equal(
+ uint_arr$type,
+ uint8()
+ )
+})
+```
+
+### Discussion
+
+There are some data types which are not compatible with each other. Errors
will
+occur if you try to cast between incompatible data types.
+
+```{r, incompat, eval = FALSE}
+int_arr <- Array$create(1:5)
+int_arr$cast(target_type = binary())
+```
+
+```{r}
+## Error: NotImplemented: Unsupported cast from int32 to binary using function
cast_binary
+```
+
+```{r, test_incompat, opts.label = "test"}
+test_that("test_incompat works as expected", {
+ expect_error(
+ int_arr$cast(target_type = binary())
+ )
+})
+```
+
+## Update the data type of a field in an existing Arrow Table
+
+You want to change the type of one or more fields in an existing Arrow Table.
+
+### Solution
+
+```{r, cast_table}
+# Set up a tibble to use in this example
+oscars <- tibble::tibble(
+ actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
+ num_awards = c(4, 3, 3)
+)
+
+# Convert tibble to an Arrow table
+oscars_arrow <- Table$create(oscars)
+
+# The default mapping from numeric column "num_awards" is to a double
+oscars_arrow
+
+# Set up schema with "num_awards" as integer
+oscars_schema <- schema(actor = string(), num_awards = int16())
+
+# Cast to an int16
+oscars_arrow_int <- oscars_arrow$cast(target_schema = oscars_schema)
+
+oscars_arrow_int
+```
+
+```{r, test_cast_table, opts.label = "test"}
+test_that("cast_table works as expected", {
+ expect_equal(
+ oscars_arrow_int$schema,
+ schema(actor = string(), num_awards = int16())
+ )
+})
+```
+
+### Discussion {#no-compat-type}
+
+There are some Arrow data types which do not have any R equivalent.
Attempting
+to cast to these data types or using a schema which contains them will result
in
+an error.
+
+```{r, float_16_conversion, error=TRUE, eval=FALSE}
+# Set up a tibble to use in this example
+oscars <- tibble::tibble(
+ actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
+ num_awards = c(4, 3, 3)
+)
+
+# Convert tibble to an Arrow table
+oscars_arrow <- Table$create(oscars)
+
+# Set up schema with "num_awards" as float16 which doesn't have an R equivalent
+oscars_schema_invalid <- schema(actor = string(), num_awards = float16())
+
+# The default mapping from numeric column "num_awards" is to a double
+oscars_arrow$cast(target_schema = oscars_schema_invalid)
+```
+
+```{r}
+## Error: NotImplemented: Unsupported cast from double to halffloat using
function cast_half_float
+```
+
+```{r, test_float_16_conversion, opts.label = "test"}
+test_that("float_16_conversion works as expected", {
+
+ oscars_schema_invalid <- schema(actor = string(), num_awards = float16())
+
+ expect_error(
+ oscars_arrow$cast(target_schema = oscars_schema_invalid),
+ "NotImplemented: Unsupported cast from double to halffloat using function
cast_half_float"
+ )
+})
+```
+
+## Specify data types when creating an Arrow table from an R object
+
+You want to manually specify Arrow data types when converting an object from a
+data frame to an Arrow object.
+
+### Solution
+
+```{r, use_schema}
+# Set up a tibble to use in this example
+oscars <- tibble::tibble(
+ actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
+ num_awards = c(4, 3, 3)
+)
+
+# Set up schema with "num_awards" as integer
+oscars_schema <- schema(actor = string(), num_awards = int16())
+
+# create arrow Table containing data and schema
+oscars_data_arrow <- Table$create(oscars, schema = oscars_schema)
+
+oscars_data_arrow
+```
+```{r, test_use_schema, opts.label = "test"}
+test_that("use_schema works as expected", {
+ expect_s3_class(oscars_data_arrow, "Table")
+ expect_equal(
+ oscars_data_arrow$schema,
+ oscars_schema
+ )
+})
+```
+
+## Specify data types when reading in files
+
+You want to manually specify Arrow data types when reading in files.
+
+### Solution
+
+```{r, use_schema_dataset}
+# Set up a tibble to use in this example
+oscars <- tibble::tibble(
+ actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
+ num_awards = c(4, 3, 3)
+)
+
+# write dataset to disk
+write_dataset(oscars, path = "oscars_data")
+
+# Set up schema with "num_awards" as integer
+oscars_schema <- schema(actor = string(), num_awards = int16())
+
+# read the dataset in, using the schema instead of inferring the type
automatically
+oscars_dataset_arrow <- open_dataset("oscars_data", schema = oscars_schema)
+
+oscars_dataset_arrow
+```
+```{r, test_use_schema_dataset, opts.label = "test"}
+test_that("use_schema_dataset works as expected", {
+ expect_s3_class(oscars_dataset_arrow, "Dataset")
+ expect_equal(oscars_dataset_arrow$schema,
+ oscars_schema
+ )
+})
+```
+```{r, include=FALSE}
+unlink("oscars_data", recursive = TRUE)
+```
+
diff --git a/r/content/unpublished/cute_datasets.Rmd
b/r/content/unpublished/cute_datasets.Rmd
new file mode 100644
index 0000000..2d4fc21
--- /dev/null
+++ b/r/content/unpublished/cute_datasets.Rmd
@@ -0,0 +1,10 @@
+oscars <- tibble::tibble(
+ actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
+ num_awards = c(4, 3, 3)
+)
+
+share_data <- tibble::tibble(
+ company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+ price = c(3463.12, 2884.38, 2300.46, 732.39),
+ date = rep(as.Date("2021-09-02"), 4)
+)
\ No newline at end of file
diff --git a/r/content/unpublished/specify_data_types_and_schemas.Rmd
b/r/content/unpublished/specify_data_types_and_schemas.Rmd
deleted file mode 100644
index 82ca577..0000000
--- a/r/content/unpublished/specify_data_types_and_schemas.Rmd
+++ /dev/null
@@ -1,10 +0,0 @@
-# Specify data types and schemas
-(intro - why this is important, i.e. Exercise fine control over column types
for seamless interoperability with databases and data warehouse systems)
-
-## Data types
-
-## Create a schema
-
-## Read a schema
-
-## Combine and harmonize schemas