[GitHub] [arrow] jonkeane commented on a change in pull request #9748: ARROW-11729 adding examples again

GitBox Thu, 18 Mar 2021 06:35:05 -0700


jonkeane commented on a change in pull request #9748:
URL: https://github.com/apache/arrow/pull/9748#discussion_r596864728




##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,40 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # We can partition by one more variables, say, cyl and gear in mtcars 
dataset

Review comment:
       This is a personal preference, so you don't have to do it, but I think 
it's nice to have the comments just above each line of code explaining exactly 
what that code is doing in examples. So here start with `# We can write 
datasets partitioned by the values in a column (here: "cyl")` then have the 
first `write_dataset()` line. Then have another line `# We can also partition 
by the values in multiple columns.` and then have the second `write_dataset()`

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,40 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # We can partition by one more variables, say, cyl and gear in mtcars 
dataset
+#' one_part_dir <- tempfile()
+#' two_part_dir <- tempfile()
+#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl")
+#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", 
"gear"))

Review comment:
       ```suggestion
   #' write_dataset(mtcars, one_part_dir, partitioning = "cyl")
   #' write_dataset(mtcars, two_part_dir, partitioning = c("cyl", "gear"))
   ```
   
   We don't need to (and shouldn't) specify the format argument here, we 
changed that so the default is parquet, so we don't need to specify it here.
   
   

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,40 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # We can partition by one more variables, say, cyl and gear in mtcars 
dataset
+#' one_part_dir <- tempfile()
+#' two_part_dir <- tempfile()
+#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl")
+#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", 
"gear"))
+#'
+#' # See the contents of the directory we wrote to in order to demonstrate
+#' # what partitioning does
+#' list(
+#'  cyl_partioning = list.files(one_part_dir, pattern = "parquet",
+#'   recursive = TRUE),
+#'  cyl_gear_partitioning = list.files(two_part_dir, pattern = "parquet",
+#'   recursive = TRUE)
+#' )
+#'
+#' # We can do the same combining both arrow and dplyr
+#' if(requireNamespace("dplyr", quietly = TRUE)) {
+#'  two_part_dir_2 <- tempfile()
+#'  two_part_dir_3 <- tempfile()
+#'  d <- mtcars %>% group_by(cyl, gear)
+#'  d %>% write_dataset(two_part_dir_2, "parquet")
+#'
+#'  # Passing the additional hive_style option to see the difference in the
+#'  # output

Review comment:
       ```suggestion
   #'  # We can also turn off the Hive-style directory naming where the column 
name is included with the value for each directory with `hive_style = FALSE`.
   ```

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,40 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # We can partition by one more variables, say, cyl and gear in mtcars 
dataset
+#' one_part_dir <- tempfile()
+#' two_part_dir <- tempfile()
+#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl")
+#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", 
"gear"))
+#'
+#' # See the contents of the directory we wrote to in order to demonstrate
+#' # what partitioning does
+#' list(
+#'  cyl_partioning = list.files(one_part_dir, pattern = "parquet",
+#'   recursive = TRUE),
+#'  cyl_gear_partitioning = list.files(two_part_dir, pattern = "parquet",
+#'   recursive = TRUE)
+#' )
+#'
+#' # We can do the same combining both arrow and dplyr
+#' if(requireNamespace("dplyr", quietly = TRUE)) {
+#'  two_part_dir_2 <- tempfile()
+#'  two_part_dir_3 <- tempfile()
+#'  d <- mtcars %>% group_by(cyl, gear)
+#'  d %>% write_dataset(two_part_dir_2, "parquet")

Review comment:
       ```suggestion
   #'  d %>% write_dataset(two_part_dir_2)
   ```

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,40 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # We can partition by one more variables, say, cyl and gear in mtcars 
dataset
+#' one_part_dir <- tempfile()
+#' two_part_dir <- tempfile()
+#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl")
+#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", 
"gear"))
+#'
+#' # See the contents of the directory we wrote to in order to demonstrate
+#' # what partitioning does

Review comment:
       I think it would be nice to describe what people should be looking for 
in these lists and how the folder structure is different between the two calls

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,40 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # We can partition by one more variables, say, cyl and gear in mtcars 
dataset
+#' one_part_dir <- tempfile()
+#' two_part_dir <- tempfile()
+#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl")
+#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", 
"gear"))
+#'
+#' # See the contents of the directory we wrote to in order to demonstrate
+#' # what partitioning does
+#' list(
+#'  cyl_partioning = list.files(one_part_dir, pattern = "parquet",
+#'   recursive = TRUE),
+#'  cyl_gear_partitioning = list.files(two_part_dir, pattern = "parquet",
+#'   recursive = TRUE)
+#' )
+#'
+#' # We can do the same combining both arrow and dplyr

Review comment:
       We should say explicitly that this is the same as the second call above. 
This has been a point of confusion for a few people and are hoping this example 
helps clear that up.

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,40 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # We can partition by one more variables, say, cyl and gear in mtcars 
dataset
+#' one_part_dir <- tempfile()
+#' two_part_dir <- tempfile()
+#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl")
+#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", 
"gear"))
+#'
+#' # See the contents of the directory we wrote to in order to demonstrate
+#' # what partitioning does
+#' list(
+#'  cyl_partioning = list.files(one_part_dir, pattern = "parquet",
+#'   recursive = TRUE),
+#'  cyl_gear_partitioning = list.files(two_part_dir, pattern = "parquet",
+#'   recursive = TRUE)
+#' )
+#'
+#' # We can do the same combining both arrow and dplyr
+#' if(requireNamespace("dplyr", quietly = TRUE)) {
+#'  two_part_dir_2 <- tempfile()
+#'  two_part_dir_3 <- tempfile()

Review comment:
       I would move this tempfile creation down below to where it's needed so 
it doesn't confuse this block.

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,40 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # We can partition by one more variables, say, cyl and gear in mtcars 
dataset
+#' one_part_dir <- tempfile()
+#' two_part_dir <- tempfile()
+#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl")
+#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", 
"gear"))
+#'
+#' # See the contents of the directory we wrote to in order to demonstrate
+#' # what partitioning does
+#' list(
+#'  cyl_partioning = list.files(one_part_dir, pattern = "parquet",
+#'   recursive = TRUE),
+#'  cyl_gear_partitioning = list.files(two_part_dir, pattern = "parquet",
+#'   recursive = TRUE)
+#' )

Review comment:
       ```suggestion
   #'  list.files(one_part_dir, recursive = TRUE)
   #'  list.files(two_part_dir, recursive = TRUE)
   ```
   
   I think this can be even simpler and just be calls to `list.files()` that 
will print out the directory + file names which is what we want.

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,40 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # We can partition by one more variables, say, cyl and gear in mtcars 
dataset
+#' one_part_dir <- tempfile()
+#' two_part_dir <- tempfile()
+#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl")
+#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", 
"gear"))
+#'
+#' # See the contents of the directory we wrote to in order to demonstrate
+#' # what partitioning does
+#' list(
+#'  cyl_partioning = list.files(one_part_dir, pattern = "parquet",
+#'   recursive = TRUE),
+#'  cyl_gear_partitioning = list.files(two_part_dir, pattern = "parquet",
+#'   recursive = TRUE)
+#' )
+#'
+#' # We can do the same combining both arrow and dplyr
+#' if(requireNamespace("dplyr", quietly = TRUE)) {
+#'  two_part_dir_2 <- tempfile()
+#'  two_part_dir_3 <- tempfile()
+#'  d <- mtcars %>% group_by(cyl, gear)
+#'  d %>% write_dataset(two_part_dir_2, "parquet")
+#'
+#'  # Passing the additional hive_style option to see the difference in the
+#'  # output
+#'  d %>% write_dataset(two_part_dir_3, "parquet", hive_style = FALSE)
+#'
+#'  list(
+#'   hive_true = list.files(two_part_dir_2, pattern = "parquet",
+#'    recursive = TRUE),
+#'   hive_false = list.files(two_part_dir_3, pattern = "parquet",
+#'    recursive = TRUE)
+#'  )

Review comment:
       ```suggestion
   #' list.files(two_part_dir_2, recursive = TRUE)
   #' list.files(two_part_dir_3, recursive = TRUE)
   ```
   
   More simplification, and it would be good to add a comment about what the 
person is looking for (that two_part_dir_2 is the same as two_part_dir above, 
and that two_part_dir_3 doesn't have `cyl=` etc.

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,40 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # We can partition by one more variables, say, cyl and gear in mtcars 
dataset
+#' one_part_dir <- tempfile()
+#' two_part_dir <- tempfile()
+#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl")
+#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", 
"gear"))
+#'
+#' # See the contents of the directory we wrote to in order to demonstrate
+#' # what partitioning does
+#' list(
+#'  cyl_partioning = list.files(one_part_dir, pattern = "parquet",
+#'   recursive = TRUE),
+#'  cyl_gear_partitioning = list.files(two_part_dir, pattern = "parquet",
+#'   recursive = TRUE)
+#' )
+#'
+#' # We can do the same combining both arrow and dplyr
+#' if(requireNamespace("dplyr", quietly = TRUE)) {
+#'  two_part_dir_2 <- tempfile()
+#'  two_part_dir_3 <- tempfile()
+#'  d <- mtcars %>% group_by(cyl, gear)
+#'  d %>% write_dataset(two_part_dir_2, "parquet")
+#'
+#'  # Passing the additional hive_style option to see the difference in the
+#'  # output
+#'  d %>% write_dataset(two_part_dir_3, "parquet", hive_style = FALSE)

Review comment:
       ```suggestion
   #'  d %>% write_dataset(two_part_dir_3, hive_style = FALSE)
   ```

##########
File path: r/R/dataset-write.R
##########
@@ -54,6 +54,40 @@
 #' - `null_fallback`: character to be used in place of missing values (`NA` or
 #' `NULL`) when using Hive-style partitioning. See [hive_partition()].
 #' @return The input `dataset`, invisibly
+#' @examples
+#' # We can partition by one more variables, say, cyl and gear in mtcars 
dataset
+#' one_part_dir <- tempfile()
+#' two_part_dir <- tempfile()
+#' write_dataset(mtcars, one_part_dir, "parquet", partitioning = "cyl")
+#' write_dataset(mtcars, two_part_dir, "parquet", partitioning = c("cyl", 
"gear"))
+#'
+#' # See the contents of the directory we wrote to in order to demonstrate
+#' # what partitioning does
+#' list(
+#'  cyl_partioning = list.files(one_part_dir, pattern = "parquet",
+#'   recursive = TRUE),
+#'  cyl_gear_partitioning = list.files(two_part_dir, pattern = "parquet",
+#'   recursive = TRUE)
+#' )
+#'
+#' # We can do the same combining both arrow and dplyr
+#' if(requireNamespace("dplyr", quietly = TRUE)) {
+#'  two_part_dir_2 <- tempfile()
+#'  two_part_dir_3 <- tempfile()
+#'  d <- mtcars %>% group_by(cyl, gear)
+#'  d %>% write_dataset(two_part_dir_2, "parquet")
+#'
+#'  # Passing the additional hive_style option to see the difference in the
+#'  # output

Review comment:
       This might be outside of the scope of this example (since it 
concentrates on writing) but it might be nice to add in an example of how one 
would need to call `read_dataset()` when one has specified `hive_style=FALSE` 
and how that differs from reading a dataset that does have Hive-style 
directories.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jonkeane commented on a change in pull request #9748: ARROW-11729 adding examples again

Reply via email to