thisisnic commented on a change in pull request #67: URL: https://github.com/apache/arrow-cookbook/pull/67#discussion_r713154891
########## File path: r/content/specify_data_types_and_schemas.Rmd ########## @@ -0,0 +1,279 @@ +# Defining Data Types + +As discussed in previous chapters, Arrow automatically infers the most +appropriate data type when reading in data or converting R objects to Arrow +objects. However, you might want to manually tell Arrow which data types to +use, for example, to ensure interoperability with databases and data warehouse +systems. This chapter includes recipes for: + +* changing the data types of existing Arrow objects +* defining data types during the process of creating Arrow objects + +A table showing the default mappings between R and Arrow data types can be found +in [R data type to Arrow data type mappings](#r2arrow). + +A table containing Arrow data types, and their R equivalents can be found in +[Arrow data type to R data type mapping](#arrow2r). + +## Update the data type of an existing Arrow Array + +You want to change the data type of an existing Arrow Array. + +### Solution + +```{r, cast_array} +# Create an Array to cast +integer_arr <- Array$create(1:5) + +# Cast to an unsigned int8 type +uint_arr <- integer_arr$cast(target_type = uint8()) + +uint_arr +``` + +```{r, test_cast_array, opts.label = "test"} +test_that("cast_array works as expected", { + expect_equal( + uint_arr$type, + uint8() + ) +}) +``` + +### Discussion + +There are some data types which are not compatible with each other. Errors will +occur if you try to cast between incompatible data types. + +```{r, incompat, eval = FALSE} +int_arr <- Array$create(1:5) +int_arr$cast(target_type = binary()) +``` + +```{r} +## Error: NotImplemented: Unsupported cast from int32 to binary using function cast_binary +``` + +```{r, test_incompat, opts.label = "test"} +test_that("test_incompat works as expected", { + expect_error( + int_arr$cast(target_type = binary()) + ) +}) +``` + +## Update the data type of a field in an existing Arrow Table + +You want to change the type of one or more fields in an existing Arrow Table. + +### Solution + +```{r, cast_table} +# Set up a tibble to use in this example +oscars <- tibble::tibble( + actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"), + num_awards = c(4, 3, 3) +) + +# Convert tibble to an Arrow table +oscars_arrow <- Table$create(oscars) + +# The default mapping from numeric column "num_awards" is to a double +oscars_arrow + +# Set up schema with "num_awards" as integer +oscars_schema <- schema(actor = string(), num_awards = int16()) + +# Cast to an int16 +oscars_arrow_int <- oscars_arrow$cast(target_schema = oscars_schema) + +oscars_arrow_int +``` + +```{r, test_cast_table, opts.label = "test"} +test_that("cast_table works as expected", { + expect_equal( + oscars_arrow_int$schema, + schema(actor = string(), num_awards = int16()) + ) +}) +``` + +### Discussion {#no-compat-type} + +There are some Arrow data types which do not have any R equivalent. Attempting +to cast to these data types or using a schema which contains them will result in +an error. + +```{r, float_16_conversion, error=TRUE, eval=FALSE} +oscars <- tibble::tibble( + actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"), + num_awards = c(4, 3, 3) +) + +# Set up schema with "num_awards" as float16 which doesn't have an R equivalent +oscars_schema <- schema(actor = string(), num_awards = float16()) + +Table$create(oscars, schema = oscars_schema) + +``` + +```{r} +## Error: Invalid: Cannot convert to Half Float +``` + +```{r, test_float_16_conversion, opts.label = "test"} +test_that("float_16_conversion works as expected", { + expect_error(Table$create(oscars, schema = schema(actor = string(), num_awards = float16()))) +}) +``` + +## Specify data types when creating an Arrow table from an R object + +You want to manually specify Arrow data types when converting an object from a data frame to an Arrow object. + +### Solution + +```{r, use_schema} +# Set up a tibble to use in this example +oscars <- tibble::tibble( + actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"), + num_awards = c(4, 3, 3) +) + +# Set up schema with "num_awards" as integer +oscars_schema <- schema(actor = string(), num_awards = int16()) + +# create arrow Table containing data and schema +oscars_data_arrow <- Table$create(oscars, schema = oscars_schema) + +oscars_data_arrow +``` +```{r, test_use_schema, opts.label = "test"} +test_that("use_schema works as expected", { + expect_s3_class(oscars_data_arrow, "Table") + expect_equal( + oscars_data_arrow$schema, + oscars_schema + ) +}) +``` + +## Specify data types when reading in files + +You want to manually specify Arrow data types when reading in files. + +### Solution + +```{r, use_schema_dataset} +# Set up a tibble to use in this example +oscars <- tibble::tibble( + actor = c("Katharine Hepburn", "Meryl Streep", "Jack Nicholson"), + num_awards = c(4, 3, 3) +) + +# write dataset to disk +write_dataset(oscars, path = "oscars_data") + +# Set up schema with "num_awards" as integer +oscars_schema <- schema(actor = string(), num_awards = int16()) + +# read the dataset in, using the schema instead of inferring the type automatically +oscars_dataset_arrow <- open_dataset("oscars_data", schema = oscars_schema) + +oscars_dataset_arrow +``` +```{r, test_use_schema_dataset, opts.label = "test"} +test_that("use_schema_dataset works as expected", { + expect_s3_class(oscars_dataset_arrow, "Dataset") + expect_equal(oscars_dataset_arrow$schema, + oscars_schema + ) +}) +``` +```{r, include=FALSE} +unlink("oscars_data", recursive = TRUE) +``` + +## Combine and unify schemas + +You have a dataset split across multiple sources for which you have separate +schemas that you want to combine. Review comment: I've updated it now to relegate `unify_schemas()` to the discussion section rather than the main recipe; how does it look now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org