[GitHub] [arrow] nealrichardson commented on a change in pull request #9561: ARROW-11649: [R] Add support for null_fallback to R

GitBox Wed, 24 Feb 2021 13:16:35 -0800


nealrichardson commented on a change in pull request #9561:
URL: https://github.com/apache/arrow/pull/9561#discussion_r582308004




##########
File path: r/R/dataset-partition.R
##########
@@ -72,19 +77,22 @@ HivePartitioning$create <- dataset___HivePartitioning
 #' Because fields are named in the path segments, order of fields passed to
 #' `hive_partition()` does not matter.
 #' @param ... named list of [data types][data-type], passed to [schema()]
+#' @param null_fallback character to be used in place of `NA` and `NULL` values

Review comment:
       ```suggestion
   #' @param null_fallback character to be used in place of missing values 
(`NA` or `NULL`)
   ```

##########
File path: r/R/dataset-partition.R
##########
@@ -25,12 +25,17 @@
 #' `DirectoryPartitioning` describes how to interpret raw path segments, in
 #' order. For example, `schema(year = int16(), month = int8())` would define
 #' partitions for file paths like "2019/01/file.parquet",
-#' "2019/02/file.parquet", etc.
+#' "2019/02/file.parquet", etc. In this scheme `NULL` values will be skipped.
+#' In the previous example, if the month was `NULL`, the files would be placed
+#' in 2019/file.parquet. An error will be raised if an outer directory is
+#' `NULL` and an inner directory is not.
 #'
 #' `HivePartitioning` is for Hive-style partitioning, which embeds field
 #' names and values in path segments, such as
 #' "/year=2019/month=2/data.parquet". Because fields are named in the path
-#' segments, order does not matter.
+#' segments, order does not matter. This partitioning scheme allows `NULL`
+#' values. They will be replaced by a configurable `null_fallback` which
+#' defaults to `__HIVE_DEFAULT_PARTITION__`.

Review comment:
       ```suggestion
   #' defaults to the string `"__HIVE_DEFAULT_PARTITION__"`.
   ```
   this?

##########
File path: r/R/dataset-partition.R
##########
@@ -25,12 +25,17 @@
 #' `DirectoryPartitioning` describes how to interpret raw path segments, in
 #' order. For example, `schema(year = int16(), month = int8())` would define
 #' partitions for file paths like "2019/01/file.parquet",
-#' "2019/02/file.parquet", etc.
+#' "2019/02/file.parquet", etc. In this scheme `NULL` values will be skipped.
+#' In the previous example, if the month was `NULL`, the files would be placed

Review comment:
       While partitioning specification is for both reading and writing, these 
doc additions seem implicitly focused on writing datasets. Should this be made 
explicit? Is there a converse statement to be made about how nulls are 
interpreted on read?

##########
File path: r/R/dataset-write.R
##########
@@ -41,6 +41,9 @@
 #' will yield `"part-0.feather", ...`.
 #' @param hive_style logical: write partition segments as Hive-style
 #' (`key1=value1/key2=value2/file.ext`) or as just bare values. Default is 
`TRUE`.
+#' @param hive_null_fallback If writing partition segments as Hive-style

Review comment:
       I would either (1) call this `null_fallback` so it matches the name of 
the arg elsewhere or (2) omit it from the signature and pull it out of `...` in 
the function below.
   
   Is there null fallback behavior for non-hive?

##########
File path: r/R/dataset-partition.R
##########
@@ -72,19 +77,22 @@ HivePartitioning$create <- dataset___HivePartitioning
 #' Because fields are named in the path segments, order of fields passed to
 #' `hive_partition()` does not matter.
 #' @param ... named list of [data types][data-type], passed to [schema()]
+#' @param null_fallback character to be used in place of `NA` and `NULL` values
+#' in columns that are being partitioned by. (default:
+#' `"__HIVE_DEFAULT_PARTITION__"`)
 #' @return A [HivePartitioning][Partitioning], or a `HivePartitioningFactory` 
if
 #' calling `hive_partition()` with no arguments.
 #' @examples
 #' \donttest{
 #' hive_partition(year = int16(), month = int8())
 #' }
 #' @export
-hive_partition <- function(...) {
+hive_partition <- function(..., null_fallback = "__HIVE_DEFAULT_PARTITION__") {

Review comment:
       Do other systems let you set an env var to govern this, or is it always 
in code?

##########
File path: r/R/dataset-partition.R
##########
@@ -72,19 +77,22 @@ HivePartitioning$create <- dataset___HivePartitioning
 #' Because fields are named in the path segments, order of fields passed to
 #' `hive_partition()` does not matter.
 #' @param ... named list of [data types][data-type], passed to [schema()]
+#' @param null_fallback character to be used in place of `NA` and `NULL` values
+#' in columns that are being partitioned by. (default:
+#' `"__HIVE_DEFAULT_PARTITION__"`)

Review comment:
       Should it be explained somewhere where this sentinel string comes from? 
I.e. we didn't make this up, we're following the convention that XYZ follow.

##########
File path: r/R/dataset-write.R
##########
@@ -41,6 +41,9 @@
 #' will yield `"part-0.feather", ...`.
 #' @param hive_style logical: write partition segments as Hive-style
 #' (`key1=value1/key2=value2/file.ext`) or as just bare values. Default is 
`TRUE`.
+#' @param hive_null_fallback If writing partition segments as Hive-style
+#' directories then this value will be used in place of any null value in a
+#' patition column.

Review comment:
       ```suggestion
   #' partition column.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #9561: ARROW-11649: [R] Add support for null_fallback to R

Reply via email to