marberts opened a new pull request, #49343:
URL: https://github.com/apache/arrow/pull/49343

   ### Rationale for this change
   
   `write_dataset(df)` need not preserve the row-ordering of `df` across 
partitions. The arrow C++ library was recently updated (since 21.0.0) so that 
row ordering can be preserved when writing across partitions. This is useful 
for cases where it is assumed that row-ordering is unchanged within each 
partition.
   
   ``` r
   df <- tibble::tibble(x = 1:1.5e6, g = rep(1:15, each = 1e5))
   
   df |>
     dplyr::group_by(g) |>
     arrow::write_dataset("test1", preserve_order = FALSE)
   
   df |>
     dplyr::group_by(g) |>
     arrow::write_dataset("test2", preserve_order = TRUE)
   
   test1 <- arrow::open_dataset("test1") |>
     dplyr::collect()
   
   test2 <- arrow::open_dataset("test2") |>
     dplyr::collect()
   
   # Current behavior.
   all.equal(test1 |> sort_by(~ g), df)
   #> [1] "Component \"x\": Mean relative difference: 0.0475804"
   
   # Preserve order.
   all.equal(test2 |> sort_by(~ g), df)
   #> [1] TRUE
   ```
   
   <sup>Created on 2026-02-20 with [reprex 
v2.1.1](https://reprex.tidyverse.org)</sup>
   
   ### What changes are included in this PR?
   
   Added an argument `preserve_order` to `write_dataset()` that sets 
`FileSystemDatasetWriteOptions.preserve_order` to true in the call to 
`ExecPlan_Write()`.
   
   ### Are these changes tested?
   
   Partially. The change is small, so I haven't written unit tests. I can 
revisit this if necessary.
   
   ### Are there any user-facing changes?
   
   Yes, there is a new argument in `write_dataset()`. The default keeps the 
current behavior and the argument appears after all existing arguments, so the 
change in backwards compatible.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to