JerePlum99 commented on issue #46681: URL: https://github.com/apache/arrow/issues/46681#issuecomment-2949754381
@thisisnic Is there any way in the R package to show the partitions actually being access by a query? I'm trying to build a local reprex that wouldn't require cloud storage an extremely large data writes, but I can't seem to find a way to depict the partitions being access through the `arrow::show_exec_plan`. Is there any similar functionality to the python `get_fragments` that would show this? Typically, I'm working with very large datasets in S3 and noticed that while a simple `filter(col == "x")` properly leverages partitions, the `filter(col %in% c("x", "y")` does not leverage partitions. And as such significantly diminishes the value of partitions when you have > 3 or 4 levels. Here is the current code I have: ```r library(arrow) library(dplyr) # Create sample partitioned dataset set.seed(123) n_rows <- 50000 n_partitions <- 5 sample_data <- data.frame( partition_col = sample(paste0("group_", LETTERS[1:n_partitions]), n_rows, replace = TRUE), value1 = rnorm(n_rows), value2 = sample(letters, n_rows, replace = TRUE), id = 1:n_rows ) # Write as partitioned dataset data_dir <- "./arrow_partition_test" if (dir.exists(data_dir)) unlink(data_dir, recursive = TRUE) dir.create(data_dir) arrow::write_dataset( sample_data, data_dir, partitioning = "partition_col", format = "parquet" ) # Verify partitioned structure was created print("Partitioned dataset structure:") print(list.files(data_dir, recursive = TRUE)) # Open dataset dataset <- arrow::open_dataset(data_dir) print(paste("Dataset contains", length(dataset$files), "partition files")) # Test 1: Single equality filter (should access 1 partition) print("\n=== Single Equality Filter ===") query1 <- dataset |> dplyr::filter(partition_col == "group_A") print("Filter:") print(query1) print("Execution plan:") dplyr::explain(query1) result1 <- query1 |> dplyr::collect() print(paste("Rows returned:", nrow(result1))) # Test 2: OR conditions (should access 3 partitions) print("\n=== OR Conditions Filter ===") query2 <- dataset |> dplyr::filter(partition_col == "group_A" | partition_col == "group_B" | partition_col == "group_C") print("Filter:") print(query2) print("Execution plan:") dplyr::explain(query2) result2 <- query2 |> dplyr::collect() print(paste("Rows returned:", nrow(result2))) # Test 3: %in% operator (should access 3 partitions) print("\n=== %in% Operator Filter ===") partitions_filter <- c("group_A", "group_B", "group_C") query3 <- dataset |> dplyr::filter(partition_col %in% partitions_filter) print("Filter:") print(query3) print("Execution plan:") dplyr::explain(query3) result3 <- query3 |> dplyr::collect() print(paste("Rows returned:", nrow(result3))) # Results comparison print("\n=== ISSUE SUMMARY ===") print("ā All approaches return functionally equivalent results:") print(paste(" - Single equality:", nrow(result1), "rows")) print(paste(" - OR conditions:", nrow(result2), "rows")) print(paste(" - %in% operator:", nrow(result3), "rows")) print("\nā BUT: No way to verify partition-level efficiency") print(" - All execution plans show generic 'SourceNode{}'") print(" - Cannot see which partition files are actually accessed") print(" - Cannot verify if %in% leverages partition pruning optimally") print("\n=== ENHANCEMENT REQUEST ===") print("Need R equivalent of Python's dataset.get_fragments(filter=...) to:") print("1. Show which partition files will be scanned for each filter type") print("2. Verify partition pruning optimization") print("3. Enable performance debugging for large partitioned datasets") # Cleanup unlink(data_dir, recursive = TRUE) sessionInfo() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org