[GitHub] [arrow] AlenkaF commented on a change in pull request #12133: ARROW-10485: [R] Accept partitioning in open_dataset when file paths are hive-style

GitBox Thu, 13 Jan 2022 01:02:26 -0800


AlenkaF commented on a change in pull request #12133:
URL: https://github.com/apache/arrow/pull/12133#discussion_r783756457




##########
File path: r/R/dataset-factory.R
##########
@@ -60,16 +61,63 @@ DatasetFactory$create <- function(x,
     return(FileSystemDatasetFactory$create(path_and_fs$fs, NULL, 
path_and_fs$path, format))
   }
 
-  if (!is.null(partitioning)) {
-    if (inherits(partitioning, "Schema")) {
-      partitioning <- DirectoryPartitioning$create(partitioning)
-    } else if (is.character(partitioning)) {
-      # These are the column/field names, and we should autodetect their types
-      partitioning <- DirectoryPartitioningFactory$create(partitioning)
+  # Handle partitioning arg in cases where it is "character" or "Schema"
+  if (!is.null(partitioning) && !inherits(partitioning, c("Partitioning", 
"PartitioningFactory"))) {
+    if (!is_false(hive_style)) {
+      # Default is NA, which means check to see if the paths could be 
hive_style
+      hive_factory <- HivePartitioningFactory$create()
+      paths <- path_and_fs$fs$ls(
+        path_and_fs$path,
+        allow_not_found = FALSE,
+        recursive = TRUE
+      )
+      hive_schema <- hive_factory$Inspect(paths)
+      # This is length-0 if there are no hive segments
+      if (is.na(hive_style)) {
+        hive_style <- length(hive_schema) > 0
+      }
+    }
+
+    if (hive_style) {
+      if (is.character(partitioning)) {
+        # These are not needed, the user probably provided them because they
+        # thought they needed to. Just make sure they aren't invalid.

Review comment:
       I think it is great to make this check and help the user at the start, 
before it gets complicated to search for the cause of an error. Same for the 
line 93 ->.
   
   @jorisvandenbossche thought this would be interesting for you. If I 
understand correctly, in the R package the default partitioning will be set to 
NA and it will be checked (in `open_dataset`) if the partitioning is hive using 
`PartitioningFactory`, `Inspect`argument. If the user provides a list (vector) 
of partitioning names and the dataset is Hive-partitioned, an error will be 
raised if schemas do not match, or the list will be ignored if the schemas 
match. (idea for 
[ARROW-15310](https://issues.apache.org/jira/browse/ARROW-15310))




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] AlenkaF commented on a change in pull request #12133: ARROW-10485: [R] Accept partitioning in open_dataset when file paths are hive-style

Reply via email to