[GitHub] [arrow-cookbook] westonpace commented on a change in pull request #67: Schemas recipes

GitBox Tue, 07 Sep 2021 11:49:39 -0700


westonpace commented on a change in pull request #67:
URL: https://github.com/apache/arrow-cookbook/pull/67#discussion_r703745422




##########
File path: r/content/specify_data_types_and_schemas.Rmd
##########
@@ -0,0 +1,205 @@
+# Defining Data Types
+
+As discussed in previous chapters, Arrow automatically handles the conversion 
of objects from native R data types to Arrow data types.  
+However, you might want to manually define data types, for example, to ensure 
interoperability with databases and data warehouse systems.
+
+## Specify data types when creating an Arrow table from an R object
+
+### Problem
+
+You want to manually specify Arrow data types when converting an object from a 
data frame to an Arrow object.
+
+### Solution
+
+```{r, use_schema}
+# create a data frame 
+share_data <- tibble::tibble(
+  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+  price = c(3463.12, 2884.38, 2300.46, 732.39),
+  date = rep(as.Date("2021-09-02"), 4)
+)
+
+# define field names and types
+share_schema <- schema(
+  company = utf8(),
+  price = float32(),
+  date = date64()
+)
+
+# create arrow Table containing data and schema
+share_data_arrow <- Table$create(share_data, schema = share_schema)
+
+share_data_arrow
+```
+```{r, test_use_schema, opts.label = "test"}
+test_that("use_schema works as expected", {
+  expect_s3_class(share_data_arrow, "Table")
+  expect_equal(share_data_arrow$schema,
+    schema(company = utf8(),  price = float32(), date = date64())
+  )
+})
+```
+
+## Specify data types when reading in files
+
+### Problem
+
+You want to manually specify Arrow data types when reading in files.
+
+### Solution
+
+```{r, use_schema_dataset}
+# create a data frame 
+share_data <- tibble::tibble(
+  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+  price = c(3463.12, 2884.38, 2300.46, 732.39),
+  date = rep(as.Date("2021-09-02"), 4)
+)
+
+# write dataset to disk
+write_dataset(share_data, path = "shares")
+
+# define field names and types
+share_schema <- schema(
+  company = utf8(),
+  price = float32(),
+  date = date64()
+)
+
+# read the dataset in, using the schema
+share_dataset_arrow <- open_dataset("shares", schema = share_schema)
+
+share_dataset_arrow
+```
+```{r, test_use_schema_dataset, opts.label = "test"}
+test_that("use_schema_dataset works as expected", {
+  expect_s3_class(share_dataset_arrow, "Dataset")
+  expect_equal(share_dataset_arrow$schema,
+    schema(company = utf8(),  price = float32(), date = date64())
+  )
+})
+```
+```{r, include=FALSE}
+unlink("shares", recursive = TRUE)
+```
+
+### Discussion
+
+When native R data types are converted to Arrow data types, there is a default 
+mapping between R type and Arrow types, as shown in the table below.
+
+#### R data type to Arrow data type mapping
+
+| R type                   | Arrow type |
+|--------------------------|------------|
+| logical                  | boolean    |
+| integer                  | int32      |
+| double ("numeric")       | float64^7^ |
+| character                | utf8^1^    |
+| factor                   | dictionary |
+| raw                      | uint8      |
+| Date                     | date32     |
+| POSIXct                  | timestamp  |
+| POSIXlt                  | struct     |
+| data.frame               | struct     |
+| list^2^                  | list       |
+| bit64::integer64         | int64      |
+| difftime                 | time32     |
+| vctrs::vctrs_unspecified | null       |
+
+^1^: If the character vector exceeds 2GB of strings, it will be converted to a 
+`large_utf8` Arrow type
+
+^2^: Only lists where all elements are the same type are able to be translated 
+to Arrow list type (which is a "list of" some type).

Review comment:
       It's a little odd that you call out `^1^` and `^2^` here but the above 
table also has `^7^` which you don't describe until further down.  Maybe change 
the `7` to `3` and describe it here or move all the footnotes to the lower 
section.

##########
File path: r/content/specify_data_types_and_schemas.Rmd
##########
@@ -0,0 +1,205 @@
+# Defining Data Types
+
+As discussed in previous chapters, Arrow automatically handles the conversion 
of objects from native R data types to Arrow data types.  
+However, you might want to manually define data types, for example, to ensure 
interoperability with databases and data warehouse systems.
+
+## Specify data types when creating an Arrow table from an R object
+
+### Problem
+
+You want to manually specify Arrow data types when converting an object from a 
data frame to an Arrow object.
+
+### Solution
+
+```{r, use_schema}
+# create a data frame 
+share_data <- tibble::tibble(
+  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+  price = c(3463.12, 2884.38, 2300.46, 732.39),
+  date = rep(as.Date("2021-09-02"), 4)
+)
+
+# define field names and types
+share_schema <- schema(
+  company = utf8(),
+  price = float32(),
+  date = date64()
+)
+
+# create arrow Table containing data and schema
+share_data_arrow <- Table$create(share_data, schema = share_schema)
+
+share_data_arrow
+```
+```{r, test_use_schema, opts.label = "test"}
+test_that("use_schema works as expected", {
+  expect_s3_class(share_data_arrow, "Table")
+  expect_equal(share_data_arrow$schema,
+    schema(company = utf8(),  price = float32(), date = date64())
+  )
+})
+```
+
+## Specify data types when reading in files
+
+### Problem
+
+You want to manually specify Arrow data types when reading in files.
+
+### Solution
+
+```{r, use_schema_dataset}
+# create a data frame 
+share_data <- tibble::tibble(
+  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+  price = c(3463.12, 2884.38, 2300.46, 732.39),
+  date = rep(as.Date("2021-09-02"), 4)
+)
+
+# write dataset to disk
+write_dataset(share_data, path = "shares")
+
+# define field names and types
+share_schema <- schema(

Review comment:
       Nit: I know you described above but I have to wonder if someone looking 
at just this example might think that a schema is always required when reading 
or writing data.  Maybe update the below comment to something like...
   
   ```
   # read the dataset in, using the schema instead of inferring the type
   ```

##########
File path: r/content/specify_data_types_and_schemas.Rmd
##########
@@ -0,0 +1,205 @@
+# Defining Data Types
+
+As discussed in previous chapters, Arrow automatically handles the conversion 
of objects from native R data types to Arrow data types.  
+However, you might want to manually define data types, for example, to ensure 
interoperability with databases and data warehouse systems.
+
+## Specify data types when creating an Arrow table from an R object
+
+### Problem

Review comment:
       This `Problem`/`Solution` formatting is inconsistent with the other 
cookbook chapters.

##########
File path: r/content/specify_data_types_and_schemas.Rmd
##########
@@ -0,0 +1,205 @@
+# Defining Data Types
+
+As discussed in previous chapters, Arrow automatically handles the conversion 
of objects from native R data types to Arrow data types.  
+However, you might want to manually define data types, for example, to ensure 
interoperability with databases and data warehouse systems.

Review comment:
       Nit: This sentence doesn't quite work for me.  Even if you choose the 
data type Arrow will still need to convert from a native R type to an Arrow 
type.  Maybe something that uses the word "inference" or "picking the best data 
type".  For example:
   
   ```
   Data in Arrow can be represented by a number of different data types.  When 
importing data from R the default behavior will pick the Arrow data type that 
is the safest match for the incoming R type.  However, you might want...
   ```
   
   

##########
File path: r/content/specify_data_types_and_schemas.Rmd
##########
@@ -0,0 +1,205 @@
+# Defining Data Types
+
+As discussed in previous chapters, Arrow automatically handles the conversion 
of objects from native R data types to Arrow data types.  
+However, you might want to manually define data types, for example, to ensure 
interoperability with databases and data warehouse systems.
+
+## Specify data types when creating an Arrow table from an R object
+
+### Problem
+
+You want to manually specify Arrow data types when converting an object from a 
data frame to an Arrow object.
+
+### Solution
+
+```{r, use_schema}
+# create a data frame 
+share_data <- tibble::tibble(
+  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+  price = c(3463.12, 2884.38, 2300.46, 732.39),
+  date = rep(as.Date("2021-09-02"), 4)
+)
+
+# define field names and types
+share_schema <- schema(
+  company = utf8(),
+  price = float32(),
+  date = date64()
+)
+
+# create arrow Table containing data and schema
+share_data_arrow <- Table$create(share_data, schema = share_schema)
+
+share_data_arrow
+```
+```{r, test_use_schema, opts.label = "test"}
+test_that("use_schema works as expected", {
+  expect_s3_class(share_data_arrow, "Table")
+  expect_equal(share_data_arrow$schema,
+    schema(company = utf8(),  price = float32(), date = date64())
+  )
+})
+```
+
+## Specify data types when reading in files
+
+### Problem
+
+You want to manually specify Arrow data types when reading in files.
+
+### Solution
+
+```{r, use_schema_dataset}
+# create a data frame 
+share_data <- tibble::tibble(
+  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+  price = c(3463.12, 2884.38, 2300.46, 732.39),
+  date = rep(as.Date("2021-09-02"), 4)
+)
+
+# write dataset to disk
+write_dataset(share_data, path = "shares")
+
+# define field names and types
+share_schema <- schema(
+  company = utf8(),
+  price = float32(),
+  date = date64()
+)
+
+# read the dataset in, using the schema
+share_dataset_arrow <- open_dataset("shares", schema = share_schema)
+
+share_dataset_arrow
+```
+```{r, test_use_schema_dataset, opts.label = "test"}
+test_that("use_schema_dataset works as expected", {
+  expect_s3_class(share_dataset_arrow, "Dataset")
+  expect_equal(share_dataset_arrow$schema,
+    schema(company = utf8(),  price = float32(), date = date64())
+  )
+})
+```
+```{r, include=FALSE}
+unlink("shares", recursive = TRUE)
+```
+
+### Discussion
+
+When native R data types are converted to Arrow data types, there is a default 
+mapping between R type and Arrow types, as shown in the table below.
+
+#### R data type to Arrow data type mapping
+
+| R type                   | Arrow type |
+|--------------------------|------------|
+| logical                  | boolean    |
+| integer                  | int32      |
+| double ("numeric")       | float64^7^ |
+| character                | utf8^1^    |
+| factor                   | dictionary |
+| raw                      | uint8      |
+| Date                     | date32     |
+| POSIXct                  | timestamp  |
+| POSIXlt                  | struct     |
+| data.frame               | struct     |
+| list^2^                  | list       |
+| bit64::integer64         | int64      |
+| difftime                 | time32     |
+| vctrs::vctrs_unspecified | null       |
+
+^1^: If the character vector exceeds 2GB of strings, it will be converted to a 
+`large_utf8` Arrow type
+
+^2^: Only lists where all elements are the same type are able to be translated 
+to Arrow list type (which is a "list of" some type).
+
+The data types created via default mapping from R to Arrow are not the only 
ones
+which exist, and alternative Arrow data types may compatible with each R data 
+type.  The compatible data types are shown in the table below.
+
+#### Arrow data type to R data type mapping
+
+| Arrow type        | R type                       |
+|-------------------|------------------------------|
+| boolean           | logical                      |
+| int8              | integer                      |
+| int16             | integer                      |
+| int32             | integer                      |
+| int64             | integer^3^                   |
+| uint8             | integer                      |
+| uint16            | integer                      |
+| uint32            | integer^3^                   |
+| uint64            | integer^3^                   |
+| float16           | -                            |
+| float32           | double                       |
+| float64 ^7^       | double                       |
+| utf8              | character                    |

Review comment:
       Nit: Would it perhaps be better to group `utf8` and `large_utf8` 
together in this table?

##########
File path: r/content/specify_data_types_and_schemas.Rmd
##########
@@ -0,0 +1,205 @@
+# Defining Data Types
+
+As discussed in previous chapters, Arrow automatically handles the conversion 
of objects from native R data types to Arrow data types.  
+However, you might want to manually define data types, for example, to ensure 
interoperability with databases and data warehouse systems.
+
+## Specify data types when creating an Arrow table from an R object
+
+### Problem
+
+You want to manually specify Arrow data types when converting an object from a 
data frame to an Arrow object.
+
+### Solution
+
+```{r, use_schema}
+# create a data frame 
+share_data <- tibble::tibble(
+  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+  price = c(3463.12, 2884.38, 2300.46, 732.39),
+  date = rep(as.Date("2021-09-02"), 4)
+)
+
+# define field names and types
+share_schema <- schema(
+  company = utf8(),
+  price = float32(),
+  date = date64()
+)
+
+# create arrow Table containing data and schema
+share_data_arrow <- Table$create(share_data, schema = share_schema)
+
+share_data_arrow
+```
+```{r, test_use_schema, opts.label = "test"}
+test_that("use_schema works as expected", {
+  expect_s3_class(share_data_arrow, "Table")
+  expect_equal(share_data_arrow$schema,
+    schema(company = utf8(),  price = float32(), date = date64())
+  )
+})
+```
+
+## Specify data types when reading in files
+
+### Problem
+
+You want to manually specify Arrow data types when reading in files.
+
+### Solution
+
+```{r, use_schema_dataset}
+# create a data frame 
+share_data <- tibble::tibble(
+  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+  price = c(3463.12, 2884.38, 2300.46, 732.39),
+  date = rep(as.Date("2021-09-02"), 4)
+)
+
+# write dataset to disk
+write_dataset(share_data, path = "shares")
+
+# define field names and types
+share_schema <- schema(
+  company = utf8(),
+  price = float32(),
+  date = date64()
+)
+
+# read the dataset in, using the schema
+share_dataset_arrow <- open_dataset("shares", schema = share_schema)
+
+share_dataset_arrow
+```
+```{r, test_use_schema_dataset, opts.label = "test"}
+test_that("use_schema_dataset works as expected", {
+  expect_s3_class(share_dataset_arrow, "Dataset")
+  expect_equal(share_dataset_arrow$schema,
+    schema(company = utf8(),  price = float32(), date = date64())
+  )
+})
+```
+```{r, include=FALSE}
+unlink("shares", recursive = TRUE)
+```
+
+### Discussion
+
+When native R data types are converted to Arrow data types, there is a default 
+mapping between R type and Arrow types, as shown in the table below.
+
+#### R data type to Arrow data type mapping
+
+| R type                   | Arrow type |
+|--------------------------|------------|
+| logical                  | boolean    |
+| integer                  | int32      |
+| double ("numeric")       | float64^7^ |
+| character                | utf8^1^    |
+| factor                   | dictionary |
+| raw                      | uint8      |
+| Date                     | date32     |
+| POSIXct                  | timestamp  |
+| POSIXlt                  | struct     |
+| data.frame               | struct     |
+| list^2^                  | list       |
+| bit64::integer64         | int64      |
+| difftime                 | time32     |
+| vctrs::vctrs_unspecified | null       |
+
+^1^: If the character vector exceeds 2GB of strings, it will be converted to a 
+`large_utf8` Arrow type
+
+^2^: Only lists where all elements are the same type are able to be translated 
+to Arrow list type (which is a "list of" some type).
+
+The data types created via default mapping from R to Arrow are not the only 
ones
+which exist, and alternative Arrow data types may compatible with each R data 
+type.  The compatible data types are shown in the table below.
+
+#### Arrow data type to R data type mapping
+
+| Arrow type        | R type                       |
+|-------------------|------------------------------|
+| boolean           | logical                      |
+| int8              | integer                      |
+| int16             | integer                      |
+| int32             | integer                      |
+| int64             | integer^3^                   |
+| uint8             | integer                      |
+| uint16            | integer                      |
+| uint32            | integer^3^                   |
+| uint64            | integer^3^                   |
+| float16           | -                            |
+| float32           | double                       |
+| float64 ^7^       | double                       |
+| utf8              | character                    |
+| binary            | arrow_binary ^5^             |
+| fixed_size_binary | arrow_fixed_size_binary ^5^  |
+| date32            | Date                         |
+| date64            | POSIXct                      |
+| time32            | hms::difftime                |
+| time64            | hms::difftime                |
+| timestamp         | POSIXct                      |
+| duration          | -                            |
+| decimal           | double                       |
+| dictionary        | factor^4^                    |
+| list              | arrow_list ^6^               |
+| fixed_size_list   | arrow_fixed_size_list ^6^    |
+| struct            | data.frame                   |
+| null              | vctrs::vctrs_unspecified     |
+| map               | -                            |
+| union             | -                            |
+| large_utf8        | character                    |
+| large_binary      | arrow_large_binary ^5^       |
+| large_list        | arrow_large_list ^6^         |
+
+^3^: These integer types may contain values that exceed the range of R's 
+`integer` type (32-bit signed integer). When they do, `uint32` and `uint64` 
are 
+converted to `double` ("numeric") and `int64` is converted to 
+`bit64::integer64`. This conversion can be disabled (so that `int64` always
+yields a `bit64::integer64` vector) by setting `options(arrow.int64_downcast = 
FALSE)`.
+
+^4^: Due to the limitation of R `factor`s, Arrow `dictionary` values are 
coerced
+to string when translated to R if they are not already strings.
+
+^5^: `arrow*_binary` classes are implemented as lists of raw vectors. 
+
+^6^: `arrow*_list` classes are implemented as subclasses of `vctrs_list_of` 
+with a `ptype` attribute set to what an empty Array of the value type converts 
to. 
+
+^7^: `float64` and `double` are the same concept and data type in Arrow C++; 
+however, only `float64()` is used in arrow as the function `double()` already 
exists in base R
+
+
+## Combine and harmonize schemas
+
+### Problem
+
+You have a dataset split across multiple sources for which you have separate 
+schemas that you want to combine.
+
+### Solution
+
+You can use `unify_schemas()` to combine multiple schemas into a single 
schemas.
+
+```{r, combine_schemas}
+# create first schema to combine
+country_code_schema <- schema(country = utf8(), code = utf8())
+
+# create second schema to combine
+country_phone_schema <- schema(country = utf8(), phone_prefix = int8())
+
+# combine schemas
+combined_schemas <- unify_schemas(country_code_schema, country_phone_schema)

Review comment:
       This begs the question, "What happens if the types are different but the 
names are the same?"

##########
File path: r/content/specify_data_types_and_schemas.Rmd
##########
@@ -0,0 +1,205 @@
+# Defining Data Types
+
+As discussed in previous chapters, Arrow automatically handles the conversion 
of objects from native R data types to Arrow data types.  
+However, you might want to manually define data types, for example, to ensure 
interoperability with databases and data warehouse systems.
+
+## Specify data types when creating an Arrow table from an R object
+
+### Problem
+
+You want to manually specify Arrow data types when converting an object from a 
data frame to an Arrow object.
+
+### Solution
+
+```{r, use_schema}
+# create a data frame 
+share_data <- tibble::tibble(
+  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+  price = c(3463.12, 2884.38, 2300.46, 732.39),
+  date = rep(as.Date("2021-09-02"), 4)
+)
+
+# define field names and types
+share_schema <- schema(
+  company = utf8(),
+  price = float32(),
+  date = date64()
+)
+
+# create arrow Table containing data and schema
+share_data_arrow <- Table$create(share_data, schema = share_schema)
+
+share_data_arrow
+```
+```{r, test_use_schema, opts.label = "test"}
+test_that("use_schema works as expected", {
+  expect_s3_class(share_data_arrow, "Table")
+  expect_equal(share_data_arrow$schema,
+    schema(company = utf8(),  price = float32(), date = date64())
+  )
+})
+```
+
+## Specify data types when reading in files
+
+### Problem
+
+You want to manually specify Arrow data types when reading in files.
+
+### Solution
+
+```{r, use_schema_dataset}
+# create a data frame 
+share_data <- tibble::tibble(
+  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+  price = c(3463.12, 2884.38, 2300.46, 732.39),
+  date = rep(as.Date("2021-09-02"), 4)
+)
+
+# write dataset to disk
+write_dataset(share_data, path = "shares")
+
+# define field names and types
+share_schema <- schema(
+  company = utf8(),
+  price = float32(),
+  date = date64()
+)
+
+# read the dataset in, using the schema
+share_dataset_arrow <- open_dataset("shares", schema = share_schema)
+
+share_dataset_arrow
+```
+```{r, test_use_schema_dataset, opts.label = "test"}
+test_that("use_schema_dataset works as expected", {
+  expect_s3_class(share_dataset_arrow, "Dataset")
+  expect_equal(share_dataset_arrow$schema,
+    schema(company = utf8(),  price = float32(), date = date64())
+  )
+})
+```
+```{r, include=FALSE}
+unlink("shares", recursive = TRUE)
+```
+
+### Discussion
+
+When native R data types are converted to Arrow data types, there is a default 
+mapping between R type and Arrow types, as shown in the table below.
+
+#### R data type to Arrow data type mapping
+
+| R type                   | Arrow type |
+|--------------------------|------------|
+| logical                  | boolean    |
+| integer                  | int32      |
+| double ("numeric")       | float64^7^ |
+| character                | utf8^1^    |
+| factor                   | dictionary |
+| raw                      | uint8      |
+| Date                     | date32     |
+| POSIXct                  | timestamp  |
+| POSIXlt                  | struct     |
+| data.frame               | struct     |
+| list^2^                  | list       |
+| bit64::integer64         | int64      |
+| difftime                 | time32     |
+| vctrs::vctrs_unspecified | null       |
+
+^1^: If the character vector exceeds 2GB of strings, it will be converted to a 
+`large_utf8` Arrow type
+
+^2^: Only lists where all elements are the same type are able to be translated 
+to Arrow list type (which is a "list of" some type).
+
+The data types created via default mapping from R to Arrow are not the only 
ones
+which exist, and alternative Arrow data types may compatible with each R data 
+type.  The compatible data types are shown in the table below.
+
+#### Arrow data type to R data type mapping
+
+| Arrow type        | R type                       |
+|-------------------|------------------------------|
+| boolean           | logical                      |
+| int8              | integer                      |
+| int16             | integer                      |
+| int32             | integer                      |
+| int64             | integer^3^                   |
+| uint8             | integer                      |
+| uint16            | integer                      |
+| uint32            | integer^3^                   |
+| uint64            | integer^3^                   |
+| float16           | -                            |

Review comment:
       What does `-` mean?  Does this mean the Arrow data type has no 
corresponding R type?  What happens in that case?  Is it a runtime error?  Can 
you expand on this below?

##########
File path: r/content/specify_data_types_and_schemas.Rmd
##########
@@ -0,0 +1,205 @@
+# Defining Data Types
+
+As discussed in previous chapters, Arrow automatically handles the conversion 
of objects from native R data types to Arrow data types.  
+However, you might want to manually define data types, for example, to ensure 
interoperability with databases and data warehouse systems.

Review comment:
       Although even that example seems to fall short as description for the 
entire chapter because we should probably point out that this inference happens 
both when importing from R and when importing from files (or importing from any 
non-Arrow source really).

##########
File path: r/content/specify_data_types_and_schemas.Rmd
##########
@@ -0,0 +1,205 @@
+# Defining Data Types
+
+As discussed in previous chapters, Arrow automatically handles the conversion 
of objects from native R data types to Arrow data types.  
+However, you might want to manually define data types, for example, to ensure 
interoperability with databases and data warehouse systems.
+
+## Specify data types when creating an Arrow table from an R object
+
+### Problem
+
+You want to manually specify Arrow data types when converting an object from a 
data frame to an Arrow object.
+
+### Solution
+
+```{r, use_schema}
+# create a data frame 
+share_data <- tibble::tibble(
+  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+  price = c(3463.12, 2884.38, 2300.46, 732.39),
+  date = rep(as.Date("2021-09-02"), 4)
+)
+
+# define field names and types
+share_schema <- schema(
+  company = utf8(),
+  price = float32(),
+  date = date64()
+)
+
+# create arrow Table containing data and schema
+share_data_arrow <- Table$create(share_data, schema = share_schema)
+
+share_data_arrow
+```
+```{r, test_use_schema, opts.label = "test"}
+test_that("use_schema works as expected", {
+  expect_s3_class(share_data_arrow, "Table")
+  expect_equal(share_data_arrow$schema,
+    schema(company = utf8(),  price = float32(), date = date64())
+  )
+})
+```
+
+## Specify data types when reading in files
+
+### Problem
+
+You want to manually specify Arrow data types when reading in files.
+
+### Solution
+
+```{r, use_schema_dataset}
+# create a data frame 
+share_data <- tibble::tibble(
+  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+  price = c(3463.12, 2884.38, 2300.46, 732.39),
+  date = rep(as.Date("2021-09-02"), 4)
+)
+
+# write dataset to disk
+write_dataset(share_data, path = "shares")
+
+# define field names and types
+share_schema <- schema(
+  company = utf8(),
+  price = float32(),
+  date = date64()
+)
+
+# read the dataset in, using the schema
+share_dataset_arrow <- open_dataset("shares", schema = share_schema)
+
+share_dataset_arrow
+```
+```{r, test_use_schema_dataset, opts.label = "test"}
+test_that("use_schema_dataset works as expected", {
+  expect_s3_class(share_dataset_arrow, "Dataset")
+  expect_equal(share_dataset_arrow$schema,
+    schema(company = utf8(),  price = float32(), date = date64())
+  )
+})
+```
+```{r, include=FALSE}
+unlink("shares", recursive = TRUE)
+```
+
+### Discussion
+
+When native R data types are converted to Arrow data types, there is a default 
+mapping between R type and Arrow types, as shown in the table below.
+
+#### R data type to Arrow data type mapping
+
+| R type                   | Arrow type |
+|--------------------------|------------|
+| logical                  | boolean    |
+| integer                  | int32      |
+| double ("numeric")       | float64^7^ |
+| character                | utf8^1^    |
+| factor                   | dictionary |
+| raw                      | uint8      |
+| Date                     | date32     |
+| POSIXct                  | timestamp  |
+| POSIXlt                  | struct     |
+| data.frame               | struct     |
+| list^2^                  | list       |
+| bit64::integer64         | int64      |
+| difftime                 | time32     |
+| vctrs::vctrs_unspecified | null       |
+
+^1^: If the character vector exceeds 2GB of strings, it will be converted to a 
+`large_utf8` Arrow type
+
+^2^: Only lists where all elements are the same type are able to be translated 
+to Arrow list type (which is a "list of" some type).
+
+The data types created via default mapping from R to Arrow are not the only 
ones
+which exist, and alternative Arrow data types may compatible with each R data 
+type.  The compatible data types are shown in the table below.
+
+#### Arrow data type to R data type mapping
+
+| Arrow type        | R type                       |
+|-------------------|------------------------------|
+| boolean           | logical                      |
+| int8              | integer                      |
+| int16             | integer                      |
+| int32             | integer                      |
+| int64             | integer^3^                   |
+| uint8             | integer                      |
+| uint16            | integer                      |
+| uint32            | integer^3^                   |
+| uint64            | integer^3^                   |
+| float16           | -                            |
+| float32           | double                       |
+| float64 ^7^       | double                       |
+| utf8              | character                    |
+| binary            | arrow_binary ^5^             |
+| fixed_size_binary | arrow_fixed_size_binary ^5^  |
+| date32            | Date                         |
+| date64            | POSIXct                      |
+| time32            | hms::difftime                |
+| time64            | hms::difftime                |
+| timestamp         | POSIXct                      |
+| duration          | -                            |
+| decimal           | double                       |

Review comment:
       If I were a naive user looking at this table then I'd probably be 
wondering "What is decimal and why does it also correspond to double?"  I'm not 
sure if that has to be answered here though.

##########
File path: r/content/specify_data_types_and_schemas.Rmd
##########
@@ -0,0 +1,205 @@
+# Defining Data Types
+
+As discussed in previous chapters, Arrow automatically handles the conversion 
of objects from native R data types to Arrow data types.  
+However, you might want to manually define data types, for example, to ensure 
interoperability with databases and data warehouse systems.
+
+## Specify data types when creating an Arrow table from an R object
+
+### Problem
+
+You want to manually specify Arrow data types when converting an object from a 
data frame to an Arrow object.
+
+### Solution
+
+```{r, use_schema}
+# create a data frame 
+share_data <- tibble::tibble(
+  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+  price = c(3463.12, 2884.38, 2300.46, 732.39),
+  date = rep(as.Date("2021-09-02"), 4)
+)
+
+# define field names and types
+share_schema <- schema(
+  company = utf8(),
+  price = float32(),
+  date = date64()
+)
+
+# create arrow Table containing data and schema
+share_data_arrow <- Table$create(share_data, schema = share_schema)
+
+share_data_arrow
+```
+```{r, test_use_schema, opts.label = "test"}
+test_that("use_schema works as expected", {
+  expect_s3_class(share_data_arrow, "Table")
+  expect_equal(share_data_arrow$schema,
+    schema(company = utf8(),  price = float32(), date = date64())
+  )
+})
+```
+
+## Specify data types when reading in files
+
+### Problem
+
+You want to manually specify Arrow data types when reading in files.
+
+### Solution
+
+```{r, use_schema_dataset}
+# create a data frame 
+share_data <- tibble::tibble(
+  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+  price = c(3463.12, 2884.38, 2300.46, 732.39),
+  date = rep(as.Date("2021-09-02"), 4)
+)
+
+# write dataset to disk
+write_dataset(share_data, path = "shares")
+
+# define field names and types
+share_schema <- schema(
+  company = utf8(),
+  price = float32(),
+  date = date64()
+)
+
+# read the dataset in, using the schema
+share_dataset_arrow <- open_dataset("shares", schema = share_schema)
+
+share_dataset_arrow
+```
+```{r, test_use_schema_dataset, opts.label = "test"}
+test_that("use_schema_dataset works as expected", {
+  expect_s3_class(share_dataset_arrow, "Dataset")
+  expect_equal(share_dataset_arrow$schema,
+    schema(company = utf8(),  price = float32(), date = date64())
+  )
+})
+```
+```{r, include=FALSE}
+unlink("shares", recursive = TRUE)
+```
+
+### Discussion
+
+When native R data types are converted to Arrow data types, there is a default 
+mapping between R type and Arrow types, as shown in the table below.
+
+#### R data type to Arrow data type mapping
+
+| R type                   | Arrow type |
+|--------------------------|------------|
+| logical                  | boolean    |
+| integer                  | int32      |
+| double ("numeric")       | float64^7^ |
+| character                | utf8^1^    |
+| factor                   | dictionary |
+| raw                      | uint8      |
+| Date                     | date32     |
+| POSIXct                  | timestamp  |
+| POSIXlt                  | struct     |
+| data.frame               | struct     |
+| list^2^                  | list       |
+| bit64::integer64         | int64      |
+| difftime                 | time32     |
+| vctrs::vctrs_unspecified | null       |
+
+^1^: If the character vector exceeds 2GB of strings, it will be converted to a 
+`large_utf8` Arrow type
+
+^2^: Only lists where all elements are the same type are able to be translated 
+to Arrow list type (which is a "list of" some type).
+
+The data types created via default mapping from R to Arrow are not the only 
ones
+which exist, and alternative Arrow data types may compatible with each R data 
+type.  The compatible data types are shown in the table below.
+
+#### Arrow data type to R data type mapping
+
+| Arrow type        | R type                       |
+|-------------------|------------------------------|
+| boolean           | logical                      |
+| int8              | integer                      |
+| int16             | integer                      |
+| int32             | integer                      |
+| int64             | integer^3^                   |
+| uint8             | integer                      |
+| uint16            | integer                      |
+| uint32            | integer^3^                   |
+| uint64            | integer^3^                   |
+| float16           | -                            |
+| float32           | double                       |
+| float64 ^7^       | double                       |
+| utf8              | character                    |
+| binary            | arrow_binary ^5^             |
+| fixed_size_binary | arrow_fixed_size_binary ^5^  |
+| date32            | Date                         |
+| date64            | POSIXct                      |
+| time32            | hms::difftime                |
+| time64            | hms::difftime                |
+| timestamp         | POSIXct                      |
+| duration          | -                            |
+| decimal           | double                       |
+| dictionary        | factor^4^                    |
+| list              | arrow_list ^6^               |
+| fixed_size_list   | arrow_fixed_size_list ^6^    |
+| struct            | data.frame                   |
+| null              | vctrs::vctrs_unspecified     |
+| map               | -                            |
+| union             | -                            |
+| large_utf8        | character                    |
+| large_binary      | arrow_large_binary ^5^       |
+| large_list        | arrow_large_list ^6^         |
+
+^3^: These integer types may contain values that exceed the range of R's 
+`integer` type (32-bit signed integer). When they do, `uint32` and `uint64` 
are 
+converted to `double` ("numeric") and `int64` is converted to 
+`bit64::integer64`. This conversion can be disabled (so that `int64` always
+yields a `bit64::integer64` vector) by setting `options(arrow.int64_downcast = 
FALSE)`.
+
+^4^: Due to the limitation of R `factor`s, Arrow `dictionary` values are 
coerced
+to string when translated to R if they are not already strings.
+
+^5^: `arrow*_binary` classes are implemented as lists of raw vectors. 
+
+^6^: `arrow*_list` classes are implemented as subclasses of `vctrs_list_of` 
+with a `ptype` attribute set to what an empty Array of the value type converts 
to. 
+
+^7^: `float64` and `double` are the same concept and data type in Arrow C++; 
+however, only `float64()` is used in arrow as the function `double()` already 
exists in base R
+
+
+## Combine and harmonize schemas

Review comment:
       ```suggestion
   ## Combine and unify schemas
   ```
   This line is prose so harmonize as an alias for unify could work but I fear 
a more technical minded reader might try and read into it more than they should 
and expect some formal definition of harmonizing schemas.

##########
File path: r/content/specify_data_types_and_schemas.Rmd
##########
@@ -0,0 +1,205 @@
+# Defining Data Types
+
+As discussed in previous chapters, Arrow automatically handles the conversion 
of objects from native R data types to Arrow data types.  
+However, you might want to manually define data types, for example, to ensure 
interoperability with databases and data warehouse systems.
+
+## Specify data types when creating an Arrow table from an R object
+
+### Problem
+
+You want to manually specify Arrow data types when converting an object from a 
data frame to an Arrow object.
+
+### Solution
+
+```{r, use_schema}
+# create a data frame 
+share_data <- tibble::tibble(
+  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+  price = c(3463.12, 2884.38, 2300.46, 732.39),
+  date = rep(as.Date("2021-09-02"), 4)
+)
+
+# define field names and types
+share_schema <- schema(
+  company = utf8(),
+  price = float32(),
+  date = date64()
+)
+
+# create arrow Table containing data and schema
+share_data_arrow <- Table$create(share_data, schema = share_schema)
+
+share_data_arrow
+```
+```{r, test_use_schema, opts.label = "test"}
+test_that("use_schema works as expected", {
+  expect_s3_class(share_data_arrow, "Table")
+  expect_equal(share_data_arrow$schema,
+    schema(company = utf8(),  price = float32(), date = date64())
+  )
+})
+```
+
+## Specify data types when reading in files
+
+### Problem
+
+You want to manually specify Arrow data types when reading in files.
+
+### Solution
+
+```{r, use_schema_dataset}
+# create a data frame 
+share_data <- tibble::tibble(
+  company = c("AMZN", "GOOG", "BKNG", "TSLA"),
+  price = c(3463.12, 2884.38, 2300.46, 732.39),
+  date = rep(as.Date("2021-09-02"), 4)
+)
+
+# write dataset to disk
+write_dataset(share_data, path = "shares")
+
+# define field names and types
+share_schema <- schema(
+  company = utf8(),
+  price = float32(),
+  date = date64()
+)
+
+# read the dataset in, using the schema
+share_dataset_arrow <- open_dataset("shares", schema = share_schema)
+
+share_dataset_arrow
+```
+```{r, test_use_schema_dataset, opts.label = "test"}
+test_that("use_schema_dataset works as expected", {
+  expect_s3_class(share_dataset_arrow, "Dataset")
+  expect_equal(share_dataset_arrow$schema,
+    schema(company = utf8(),  price = float32(), date = date64())
+  )
+})
+```
+```{r, include=FALSE}
+unlink("shares", recursive = TRUE)
+```
+
+### Discussion
+
+When native R data types are converted to Arrow data types, there is a default 
+mapping between R type and Arrow types, as shown in the table below.
+
+#### R data type to Arrow data type mapping
+
+| R type                   | Arrow type |
+|--------------------------|------------|
+| logical                  | boolean    |
+| integer                  | int32      |
+| double ("numeric")       | float64^7^ |
+| character                | utf8^1^    |
+| factor                   | dictionary |
+| raw                      | uint8      |
+| Date                     | date32     |
+| POSIXct                  | timestamp  |
+| POSIXlt                  | struct     |
+| data.frame               | struct     |
+| list^2^                  | list       |
+| bit64::integer64         | int64      |
+| difftime                 | time32     |
+| vctrs::vctrs_unspecified | null       |
+
+^1^: If the character vector exceeds 2GB of strings, it will be converted to a 
+`large_utf8` Arrow type
+
+^2^: Only lists where all elements are the same type are able to be translated 
+to Arrow list type (which is a "list of" some type).
+
+The data types created via default mapping from R to Arrow are not the only 
ones
+which exist, and alternative Arrow data types may compatible with each R data 
+type.  The compatible data types are shown in the table below.

Review comment:
       This sentence doesn't read quite right.  "and alternative Arrow data 
types may *be* compatible..." perhaps?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-cookbook] westonpace commented on a change in pull request #67: Schemas recipes

Reply via email to