jonkeane commented on a change in pull request #9748: URL: https://github.com/apache/arrow/pull/9748#discussion_r596864728
########## File path: r/R/dataset-write.R ########## @@ -54,6 +54,40 @@ #' - `null_fallback`: character to be used in place of missing values (`NA` or #' `NULL`) when using Hive-style partitioning. See [hive_partition()]. #' @return The input `dataset`, invisibly +#' @examples +#' # We can partition by one more variables, say, cyl and gear in mtcars dataset Review comment: This is a personal preference, so you don't have to do it, but I think it's nice to have the comments just above each line of code explaining exactly what that code is doing in examples. So here start with `# We can write datasets partitioned by the values in a column (here: "cyl")` then have the first `write_dataset()` line. Then have another line `# We can also partition by the values in multiple columns.` and then have the second `write_dataset()` ########## File path: r/R/dataset-write.R ########## @@ -54,6 +54,40 @@ #' - `null_fallback`: character to be used in place of missing values (`NA` or #' `NULL`) when using Hive-style partitioning. See [hive_partition()]. #' @return The input `dataset`, invisibly +#' @examples +#' # We can partition by one more variables, say, cyl and gear in mtcars dataset +#' one_part_dir <- tempfile() +#' two_part_dir <- tempfile() +#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl") +#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", "gear")) Review comment: ```suggestion #' write_dataset(mtcars, one_part_dir, partitioning = "cyl") #' write_dataset(mtcars, two_part_dir, partitioning = c("cyl", "gear")) ``` We don't need to (and shouldn't) specify the format argument here, we changed that so the default is parquet, so we don't need to specify it here. ########## File path: r/R/dataset-write.R ########## @@ -54,6 +54,40 @@ #' - `null_fallback`: character to be used in place of missing values (`NA` or #' `NULL`) when using Hive-style partitioning. See [hive_partition()]. #' @return The input `dataset`, invisibly +#' @examples +#' # We can partition by one more variables, say, cyl and gear in mtcars dataset +#' one_part_dir <- tempfile() +#' two_part_dir <- tempfile() +#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl") +#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", "gear")) +#' +#' # See the contents of the directory we wrote to in order to demonstrate +#' # what partitioning does +#' list( +#' cyl_partioning = list.files(one_part_dir, pattern = "parquet", +#' recursive = TRUE), +#' cyl_gear_partitioning = list.files(two_part_dir, pattern = "parquet", +#' recursive = TRUE) +#' ) +#' +#' # We can do the same combining both arrow and dplyr +#' if(requireNamespace("dplyr", quietly = TRUE)) { +#' two_part_dir_2 <- tempfile() +#' two_part_dir_3 <- tempfile() +#' d <- mtcars %>% group_by(cyl, gear) +#' d %>% write_dataset(two_part_dir_2, "parquet") +#' +#' # Passing the additional hive_style option to see the difference in the +#' # output Review comment: ```suggestion #' # We can also turn off the Hive-style directory naming where the column name is included with the value for each directory with `hive_style = FALSE`. ``` ########## File path: r/R/dataset-write.R ########## @@ -54,6 +54,40 @@ #' - `null_fallback`: character to be used in place of missing values (`NA` or #' `NULL`) when using Hive-style partitioning. See [hive_partition()]. #' @return The input `dataset`, invisibly +#' @examples +#' # We can partition by one more variables, say, cyl and gear in mtcars dataset +#' one_part_dir <- tempfile() +#' two_part_dir <- tempfile() +#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl") +#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", "gear")) +#' +#' # See the contents of the directory we wrote to in order to demonstrate +#' # what partitioning does +#' list( +#' cyl_partioning = list.files(one_part_dir, pattern = "parquet", +#' recursive = TRUE), +#' cyl_gear_partitioning = list.files(two_part_dir, pattern = "parquet", +#' recursive = TRUE) +#' ) +#' +#' # We can do the same combining both arrow and dplyr +#' if(requireNamespace("dplyr", quietly = TRUE)) { +#' two_part_dir_2 <- tempfile() +#' two_part_dir_3 <- tempfile() +#' d <- mtcars %>% group_by(cyl, gear) +#' d %>% write_dataset(two_part_dir_2, "parquet") Review comment: ```suggestion #' d %>% write_dataset(two_part_dir_2) ``` ########## File path: r/R/dataset-write.R ########## @@ -54,6 +54,40 @@ #' - `null_fallback`: character to be used in place of missing values (`NA` or #' `NULL`) when using Hive-style partitioning. See [hive_partition()]. #' @return The input `dataset`, invisibly +#' @examples +#' # We can partition by one more variables, say, cyl and gear in mtcars dataset +#' one_part_dir <- tempfile() +#' two_part_dir <- tempfile() +#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl") +#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", "gear")) +#' +#' # See the contents of the directory we wrote to in order to demonstrate +#' # what partitioning does Review comment: I think it would be nice to describe what people should be looking for in these lists and how the folder structure is different between the two calls ########## File path: r/R/dataset-write.R ########## @@ -54,6 +54,40 @@ #' - `null_fallback`: character to be used in place of missing values (`NA` or #' `NULL`) when using Hive-style partitioning. See [hive_partition()]. #' @return The input `dataset`, invisibly +#' @examples +#' # We can partition by one more variables, say, cyl and gear in mtcars dataset +#' one_part_dir <- tempfile() +#' two_part_dir <- tempfile() +#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl") +#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", "gear")) +#' +#' # See the contents of the directory we wrote to in order to demonstrate +#' # what partitioning does +#' list( +#' cyl_partioning = list.files(one_part_dir, pattern = "parquet", +#' recursive = TRUE), +#' cyl_gear_partitioning = list.files(two_part_dir, pattern = "parquet", +#' recursive = TRUE) +#' ) +#' +#' # We can do the same combining both arrow and dplyr Review comment: We should say explicitly that this is the same as the second call above. This has been a point of confusion for a few people and are hoping this example helps clear that up. ########## File path: r/R/dataset-write.R ########## @@ -54,6 +54,40 @@ #' - `null_fallback`: character to be used in place of missing values (`NA` or #' `NULL`) when using Hive-style partitioning. See [hive_partition()]. #' @return The input `dataset`, invisibly +#' @examples +#' # We can partition by one more variables, say, cyl and gear in mtcars dataset +#' one_part_dir <- tempfile() +#' two_part_dir <- tempfile() +#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl") +#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", "gear")) +#' +#' # See the contents of the directory we wrote to in order to demonstrate +#' # what partitioning does +#' list( +#' cyl_partioning = list.files(one_part_dir, pattern = "parquet", +#' recursive = TRUE), +#' cyl_gear_partitioning = list.files(two_part_dir, pattern = "parquet", +#' recursive = TRUE) +#' ) +#' +#' # We can do the same combining both arrow and dplyr +#' if(requireNamespace("dplyr", quietly = TRUE)) { +#' two_part_dir_2 <- tempfile() +#' two_part_dir_3 <- tempfile() Review comment: I would move this tempfile creation down below to where it's needed so it doesn't confuse this block. ########## File path: r/R/dataset-write.R ########## @@ -54,6 +54,40 @@ #' - `null_fallback`: character to be used in place of missing values (`NA` or #' `NULL`) when using Hive-style partitioning. See [hive_partition()]. #' @return The input `dataset`, invisibly +#' @examples +#' # We can partition by one more variables, say, cyl and gear in mtcars dataset +#' one_part_dir <- tempfile() +#' two_part_dir <- tempfile() +#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl") +#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", "gear")) +#' +#' # See the contents of the directory we wrote to in order to demonstrate +#' # what partitioning does +#' list( +#' cyl_partioning = list.files(one_part_dir, pattern = "parquet", +#' recursive = TRUE), +#' cyl_gear_partitioning = list.files(two_part_dir, pattern = "parquet", +#' recursive = TRUE) +#' ) Review comment: ```suggestion #' list.files(one_part_dir, recursive = TRUE) #' list.files(two_part_dir, recursive = TRUE) ``` I think this can be even simpler and just be calls to `list.files()` that will print out the directory + file names which is what we want. ########## File path: r/R/dataset-write.R ########## @@ -54,6 +54,40 @@ #' - `null_fallback`: character to be used in place of missing values (`NA` or #' `NULL`) when using Hive-style partitioning. See [hive_partition()]. #' @return The input `dataset`, invisibly +#' @examples +#' # We can partition by one more variables, say, cyl and gear in mtcars dataset +#' one_part_dir <- tempfile() +#' two_part_dir <- tempfile() +#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl") +#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", "gear")) +#' +#' # See the contents of the directory we wrote to in order to demonstrate +#' # what partitioning does +#' list( +#' cyl_partioning = list.files(one_part_dir, pattern = "parquet", +#' recursive = TRUE), +#' cyl_gear_partitioning = list.files(two_part_dir, pattern = "parquet", +#' recursive = TRUE) +#' ) +#' +#' # We can do the same combining both arrow and dplyr +#' if(requireNamespace("dplyr", quietly = TRUE)) { +#' two_part_dir_2 <- tempfile() +#' two_part_dir_3 <- tempfile() +#' d <- mtcars %>% group_by(cyl, gear) +#' d %>% write_dataset(two_part_dir_2, "parquet") +#' +#' # Passing the additional hive_style option to see the difference in the +#' # output +#' d %>% write_dataset(two_part_dir_3, "parquet", hive_style = FALSE) +#' +#' list( +#' hive_true = list.files(two_part_dir_2, pattern = "parquet", +#' recursive = TRUE), +#' hive_false = list.files(two_part_dir_3, pattern = "parquet", +#' recursive = TRUE) +#' ) Review comment: ```suggestion #' list.files(two_part_dir_2, recursive = TRUE) #' list.files(two_part_dir_3, recursive = TRUE) ``` More simplification, and it would be good to add a comment about what the person is looking for (that two_part_dir_2 is the same as two_part_dir above, and that two_part_dir_3 doesn't have `cyl=` etc. ########## File path: r/R/dataset-write.R ########## @@ -54,6 +54,40 @@ #' - `null_fallback`: character to be used in place of missing values (`NA` or #' `NULL`) when using Hive-style partitioning. See [hive_partition()]. #' @return The input `dataset`, invisibly +#' @examples +#' # We can partition by one more variables, say, cyl and gear in mtcars dataset +#' one_part_dir <- tempfile() +#' two_part_dir <- tempfile() +#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl") +#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", "gear")) +#' +#' # See the contents of the directory we wrote to in order to demonstrate +#' # what partitioning does +#' list( +#' cyl_partioning = list.files(one_part_dir, pattern = "parquet", +#' recursive = TRUE), +#' cyl_gear_partitioning = list.files(two_part_dir, pattern = "parquet", +#' recursive = TRUE) +#' ) +#' +#' # We can do the same combining both arrow and dplyr +#' if(requireNamespace("dplyr", quietly = TRUE)) { +#' two_part_dir_2 <- tempfile() +#' two_part_dir_3 <- tempfile() +#' d <- mtcars %>% group_by(cyl, gear) +#' d %>% write_dataset(two_part_dir_2, "parquet") +#' +#' # Passing the additional hive_style option to see the difference in the +#' # output +#' d %>% write_dataset(two_part_dir_3, "parquet", hive_style = FALSE) Review comment: ```suggestion #' d %>% write_dataset(two_part_dir_3, hive_style = FALSE) ``` ########## File path: r/R/dataset-write.R ########## @@ -54,6 +54,40 @@ #' - `null_fallback`: character to be used in place of missing values (`NA` or #' `NULL`) when using Hive-style partitioning. See [hive_partition()]. #' @return The input `dataset`, invisibly +#' @examples +#' # We can partition by one more variables, say, cyl and gear in mtcars dataset +#' one_part_dir <- tempfile() +#' two_part_dir <- tempfile() +#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl") +#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", "gear")) +#' +#' # See the contents of the directory we wrote to in order to demonstrate +#' # what partitioning does +#' list( +#' cyl_partioning = list.files(one_part_dir, pattern = "parquet", +#' recursive = TRUE), +#' cyl_gear_partitioning = list.files(two_part_dir, pattern = "parquet", +#' recursive = TRUE) +#' ) +#' +#' # We can do the same combining both arrow and dplyr +#' if(requireNamespace("dplyr", quietly = TRUE)) { +#' two_part_dir_2 <- tempfile() +#' two_part_dir_3 <- tempfile() +#' d <- mtcars %>% group_by(cyl, gear) +#' d %>% write_dataset(two_part_dir_2, "parquet") +#' +#' # Passing the additional hive_style option to see the difference in the +#' # output Review comment: This might be outside of the scope of this example (since it concentrates on writing) but it might be nice to add in an example of how one would need to call `read_dataset()` when one has specified `hive_style=FALSE` and how that differs from reading a dataset that does have Hive-style directories. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org