Re: [I] [R] Add support for `%in%` within partitions [arrow]

via GitHub Fri, 06 Jun 2025 09:07:20 -0700


JerePlum99 commented on issue #46681:
URL: https://github.com/apache/arrow/issues/46681#issuecomment-2949754381


   @thisisnic Is there any way in the R package to show the partitions actually 
being access by a query? I'm trying to build a local reprex that wouldn't 
require cloud storage an extremely large data writes, but I can't seem to find 
a way to depict the partitions being access through the 
`arrow::show_exec_plan`. Is there any similar functionality to the python 
`get_fragments` that would show this? 
   
   Typically, I'm working with very large datasets in S3 and noticed that while 
a simple `filter(col == "x")` properly leverages partitions, the `filter(col 
%in% c("x", "y")` does not leverage partitions. And as such significantly 
diminishes the value of partitions when you have > 3 or 4 levels. 
   
   Here is the current code I have: 
   ```r
   library(arrow)
   library(dplyr)
   
   # Create sample partitioned dataset
   set.seed(123)
   n_rows <- 50000
   n_partitions <- 5
   
   sample_data <- data.frame(
     partition_col = sample(paste0("group_", LETTERS[1:n_partitions]), 
                           n_rows, replace = TRUE),
     value1 = rnorm(n_rows),
     value2 = sample(letters, n_rows, replace = TRUE),
     id = 1:n_rows
   )
   
   # Write as partitioned dataset
   data_dir <- "./arrow_partition_test"
   if (dir.exists(data_dir)) unlink(data_dir, recursive = TRUE)
   dir.create(data_dir)
   
   arrow::write_dataset(
     sample_data, 
     data_dir, 
     partitioning = "partition_col",
     format = "parquet"
   )
   
   # Verify partitioned structure was created
   print("Partitioned dataset structure:")
   print(list.files(data_dir, recursive = TRUE))
   
   # Open dataset
   dataset <- arrow::open_dataset(data_dir)
   print(paste("Dataset contains", length(dataset$files), "partition files"))
   
   # Test 1: Single equality filter (should access 1 partition)
   print("\n=== Single Equality Filter ===")
   query1 <- dataset |> dplyr::filter(partition_col == "group_A")
   print("Filter:")
   print(query1)
   print("Execution plan:")
   dplyr::explain(query1)
   
   result1 <- query1 |> dplyr::collect()
   print(paste("Rows returned:", nrow(result1)))
   
   # Test 2: OR conditions (should access 3 partitions)
   print("\n=== OR Conditions Filter ===")
   query2 <- dataset |>
     dplyr::filter(partition_col == "group_A" | 
                   partition_col == "group_B" | 
                   partition_col == "group_C")
   
   print("Filter:")
   print(query2)
   print("Execution plan:")
   dplyr::explain(query2)
   
   result2 <- query2 |> dplyr::collect()
   print(paste("Rows returned:", nrow(result2)))
   
   # Test 3: %in% operator (should access 3 partitions)
   print("\n=== %in% Operator Filter ===")
   partitions_filter <- c("group_A", "group_B", "group_C")
   query3 <- dataset |> dplyr::filter(partition_col %in% partitions_filter)
   
   print("Filter:")
   print(query3)
   print("Execution plan:")
   dplyr::explain(query3)
   
   result3 <- query3 |> dplyr::collect()
   print(paste("Rows returned:", nrow(result3)))
   
   # Results comparison
   print("\n=== ISSUE SUMMARY ===")
   print("✓ All approaches return functionally equivalent results:")
   print(paste("  - Single equality:", nrow(result1), "rows"))
   print(paste("  - OR conditions:", nrow(result2), "rows"))  
   print(paste("  - %in% operator:", nrow(result3), "rows"))
   
   print("\n✗ BUT: No way to verify partition-level efficiency")
   print("  - All execution plans show generic 'SourceNode{}'")
   print("  - Cannot see which partition files are actually accessed")
   print("  - Cannot verify if %in% leverages partition pruning optimally")
   
   print("\n=== ENHANCEMENT REQUEST ===")
   print("Need R equivalent of Python's dataset.get_fragments(filter=...) to:")
   print("1. Show which partition files will be scanned for each filter type")
   print("2. Verify partition pruning optimization")
   print("3. Enable performance debugging for large partitioned datasets")
   
   # Cleanup
   unlink(data_dir, recursive = TRUE)
   
   sessionInfo()
   ``` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [R] Add support for `%in%` within partitions [arrow]

Reply via email to