[arrow-cookbook] branch main updated: [R] - Schemas recipes (#67)

thisisnic Fri, 01 Oct 2021 10:27:53 -0700

This is an automated email from the ASF dual-hosted git repository.

thisisnic pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git



The following commit(s) were added to refs/heads/main by this push:
     new 449534d  [R] - Schemas recipes (#67)
449534d is described below

commit 449534dae3ac2cffa97f20de07b6b86739d37be0
Author: Nic <[email protected]>
AuthorDate: Fri Oct 1 17:27:35 2021 +0000

    [R] - Schemas recipes (#67)
    
    * Add the creating schemas recipe
    
    * Add in content on combinig schemas, and specifying schemas when reading 
in files
    
    * Delete unncecessary files, and stop showing test chunks
    
    * Rephrase the bit about converting from R to Arrow
    
    * Remove extraneous word
    
    * Also mention reading in data
    
    * Extra clarity
    
    * missing word
    
    * Add appendices
    
    * Add section on casting, remove "problem" headings, update dataset, move 
tables to appendix, show example of incompatible data types
    
    * Link between incompatible data types and appendix table
    
    * Add content on combining schemas
    
    * Rephrase
    
    * Add context
    
    * Reorder items in table
    
    * Add recipe for schemas where match or don't match
    
    * Rephrase
    
    * Update code which causes an error to not run
    
    * Relegate unify_schemas to discussion
    
    * Fix rebase
    
    * Remove appendix and link to vignette instead
    
    * Remove examples of everything that could go wrong, as not relevant
    
    * Fix failing test
---
 r/content/_bookdown.yml                            |  12 +-
 r/content/reading_and_writing_data.Rmd             |  11 ++
 r/content/specify_data_types_and_schemas.Rmd       | 208 +++++++++++++++++++++
 r/content/unpublished/cute_datasets.Rmd            |  10 +
 .../unpublished/specify_data_types_and_schemas.Rmd |  10 -
 5 files changed, 236 insertions(+), 15 deletions(-)

diff --git a/r/content/_bookdown.yml b/r/content/_bookdown.yml
index 280474a..51b9ef2 100644
--- a/r/content/_bookdown.yml
+++ b/r/content/_bookdown.yml
@@ -4,8 +4,10 @@ new_session: FALSE
 clean: ["_book/*"]
 output_dir: _book
 edit: https://github.com/apache/arrow-cookbook/edit/main/r/content/%s
-rmd_files: ["index.Rmd", "reading_and_writing_data.Rmd", 
"creating_arrow_objects.Rmd", "manipulating_data.Rmd"]
-
-# This is the full list
-# rmd_files: ["index.Rmd", "configure_arrow.Rmd", 
"work_with_data_in_different_formats.Rmd",
-# "work_with_compressed_or_partitioned_data.Rmd", 
"create_arrow_objects_from_r.Rmd", "specify_data_types_and_schemas.Rmd", 
"manipulate_data.Rmd", "work_with_arrow_in_both_python_and_r.Rmd"]
+rmd_files: [
+  "index.Rmd",
+  "reading_and_writing_data.Rmd",
+  "creating_arrow_objects.Rmd",
+  "specify_data_types_and_schemas.Rmd",
+  "manipulating_data.Rmd"
+]
diff --git a/r/content/reading_and_writing_data.Rmd 
b/r/content/reading_and_writing_data.Rmd
index 53671ac..47274ad 100644
--- a/r/content/reading_and_writing_data.Rmd
+++ b/r/content/reading_and_writing_data.Rmd
@@ -4,6 +4,7 @@ This chapter contains recipes related to reading and writing 
data using Apache
 Arrow.  When reading files into R using Apache Arrow, you can choose to read 
in 
 your file as either a data frame or as an Arrow Table object.
 
+
 There are a number of circumstances in which you may want to read in the data 
as an Arrow Table:
 * your dataset is large and if you load it into memory, it may lead to 
performance issues
 * you want faster performance from your `dplyr` queries
@@ -348,3 +349,13 @@ test_that("open_dataset chunk works as expected", {
 unlink("airquality_partitioned", recursive = TRUE)
 ```
 
+```{r, include = FALSE}
+# cleanup
+unlink("my_table.arrow")
+unlink("my_table.arrows")
+unlink("cars.csv")
+unlink("my_table.feather")
+unlink("my_table.parquet")
+unlink("dist_time.parquet")
+unlink("airquality_partitioned", recursive = TRUE)
+```
diff --git a/r/content/specify_data_types_and_schemas.Rmd 
b/r/content/specify_data_types_and_schemas.Rmd
new file mode 100644
index 0000000..7e0b7d4
--- /dev/null
+++ b/r/content/specify_data_types_and_schemas.Rmd
@@ -0,0 +1,208 @@
+# Defining Data Types
+
+As discussed in previous chapters, Arrow automatically infers the most 
+appropriate data type when reading in data or converting R objects to Arrow 
+objects.  However, you might want to manually tell Arrow which data types to 
+use, for example, to ensure interoperability with databases and data warehouse 
+systems.  This chapter includes recipes for:
+
+* changing the data types of existing Arrow objects
+* defining data types during the process of creating Arrow objects
+
+A table showing the default mappings between R and Arrow data types can be 
found 
+in [R data type to Arrow data type 
mappings](https://arrow.apache.org/docs/r/articles/arrow.html#r-to-arrow).
+
+A table containing Arrow data types, and their R equivalents can be found in 
+[Arrow data type to R data type 
mapping](https://arrow.apache.org/docs/r/articles/arrow.html#arrow-to-r).
+
+## Update the data type of an existing Arrow Array
+
+You want to change the data type of an existing Arrow Array.
+
+### Solution
+
+```{r, cast_array}
+# Create an Array to cast
+integer_arr <- Array$create(1:5)
+
+# Cast to an unsigned int8 type
+uint_arr <- integer_arr$cast(target_type = uint8())
+
+uint_arr
+```
+
+```{r, test_cast_array, opts.label = "test"}
+test_that("cast_array works as expected", {
+  expect_equal(
+   uint_arr$type,
+   uint8()
+  )
+})
+```
+
+### Discussion
+
+There are some data types which are not compatible with each other. Errors 
will 
+occur if you try to cast between incompatible data types.
+
+```{r, incompat, eval = FALSE}
+int_arr <- Array$create(1:5)
+int_arr$cast(target_type = binary())
+```
+
+```{r}
+## Error: NotImplemented: Unsupported cast from int32 to binary using function 
cast_binary
+```
+
+```{r, test_incompat, opts.label = "test"}
+test_that("test_incompat works as expected", {
+  expect_error(
+    int_arr$cast(target_type = binary())
+  )
+})
+```
+
+## Update the data type of a field in an existing Arrow Table
+
+You want to change the type of one or more fields in an existing Arrow Table.
+
+### Solution
+
+```{r, cast_table}
+# Set up a tibble to use in this example
+oscars <- tibble::tibble(
+  actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
+  num_awards = c(4, 3, 3)
+)
+
+# Convert tibble to an Arrow table
+oscars_arrow <- Table$create(oscars)
+
+# The default mapping from numeric column "num_awards" is to a double
+oscars_arrow
+
+# Set up schema with "num_awards" as integer
+oscars_schema <- schema(actor = string(), num_awards = int16())
+
+# Cast to an int16
+oscars_arrow_int <- oscars_arrow$cast(target_schema = oscars_schema)
+
+oscars_arrow_int
+```
+
+```{r, test_cast_table, opts.label = "test"}
+test_that("cast_table works as expected", {
+  expect_equal(
+    oscars_arrow_int$schema,
+    schema(actor = string(), num_awards = int16())
+  )
+})
+```
+
+### Discussion {#no-compat-type}
+
+There are some Arrow data types which do not have any R equivalent.  
Attempting 
+to cast to these data types or using a schema which contains them will result 
in
+an error.
+
+```{r, float_16_conversion, error=TRUE, eval=FALSE}
+# Set up a tibble to use in this example
+oscars <- tibble::tibble(
+  actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
+  num_awards = c(4, 3, 3)
+)
+
+# Convert tibble to an Arrow table
+oscars_arrow <- Table$create(oscars)
+
+# Set up schema with "num_awards" as float16 which doesn't have an R equivalent
+oscars_schema_invalid <- schema(actor = string(), num_awards = float16())
+
+# The default mapping from numeric column "num_awards" is to a double
+oscars_arrow$cast(target_schema = oscars_schema_invalid)
+```
+
+```{r}
+## Error: NotImplemented: Unsupported cast from double to halffloat using 
function cast_half_float
+```
+
+```{r, test_float_16_conversion, opts.label = "test"}
+test_that("float_16_conversion works as expected", {
+  
+  oscars_schema_invalid <- schema(actor = string(), num_awards = float16())
+  
+  expect_error(
+    oscars_arrow$cast(target_schema = oscars_schema_invalid),
+    "NotImplemented: Unsupported cast from double to halffloat using function 
cast_half_float"
+  )
+})
+```
+
+## Specify data types when creating an Arrow table from an R object
+
+You want to manually specify Arrow data types when converting an object from a 
+data frame to an Arrow object.
+
+### Solution
+
+```{r, use_schema}
+# Set up a tibble to use in this example
+oscars <- tibble::tibble(
+  actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
+  num_awards = c(4, 3, 3)
+)
+
+# Set up schema with "num_awards" as integer
+oscars_schema <- schema(actor = string(), num_awards = int16())
+
+# create arrow Table containing data and schema
+oscars_data_arrow <- Table$create(oscars, schema = oscars_schema)
+
+oscars_data_arrow
+```
+```{r, test_use_schema, opts.label = "test"}
+test_that("use_schema works as expected", {
+  expect_s3_class(oscars_data_arrow, "Table")
+  expect_equal(
+    oscars_data_arrow$schema,
+    oscars_schema
+  )
+})
+```
+
+## Specify data types when reading in files
+
+You want to manually specify Arrow data types when reading in files.
+
+### Solution
+
+```{r, use_schema_dataset}
+# Set up a tibble to use in this example
+oscars <- tibble::tibble(
+  actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
+  num_awards = c(4, 3, 3)
+)
+
+# write dataset to disk
+write_dataset(oscars, path = "oscars_data")
+
+# Set up schema with "num_awards" as integer
+oscars_schema <- schema(actor = string(), num_awards = int16())
+
+# read the dataset in, using the schema instead of inferring the type 
automatically
+oscars_dataset_arrow <- open_dataset("oscars_data", schema = oscars_schema)
+
+oscars_dataset_arrow
+```
+```{r, test_use_schema_dataset, opts.label = "test"}
+test_that("use_schema_dataset works as expected", {
+  expect_s3_class(oscars_dataset_arrow, "Dataset")
+  expect_equal(oscars_dataset_arrow$schema,
+    oscars_schema
+  )
+})
+```
+```{r, include=FALSE}
+unlink("oscars_data", recursive = TRUE)
+```
+
diff --git a/r/content/unpublished/cute_datasets.Rmd 
b/r/content/unpublished/cute_datasets.Rmd
new file mode 100644
index 0000000..2d4fc21
--- /dev/null
+++ b/r/content/unpublished/cute_datasets.Rmd
@@ -0,0 +1,10 @@
+oscars <- tibble::tibble(
+  actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"),
+  num_awards = c(4, 3, 3)
+)
+
+share_data <- tibble::tibble(
+  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+  price = c(3463.12, 2884.38, 2300.46, 732.39),
+  date = rep(as.Date("2021-09-02"), 4)
+)
\ No newline at end of file
diff --git a/r/content/unpublished/specify_data_types_and_schemas.Rmd 
b/r/content/unpublished/specify_data_types_and_schemas.Rmd
deleted file mode 100644
index 82ca577..0000000
--- a/r/content/unpublished/specify_data_types_and_schemas.Rmd
+++ /dev/null
@@ -1,10 +0,0 @@
-# Specify data types and schemas 
-(intro - why this is important, i.e. Exercise fine control over column types 
for seamless interoperability with databases and data warehouse systems)
-
-## Data types
-
-## Create a schema
-
-## Read a schema
-
-## Combine and harmonize schemas

[arrow-cookbook] branch main updated: [R] - Schemas recipes (#67)

Reply via email to