[
https://issues.apache.org/jira/browse/ARROW-13169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376697#comment-17376697
]
Nic Crane commented on ARROW-13169:
-----------------------------------
Further to Weston's comment above, I've looked into this a little, and to
shrink down the relevant bit of R code here is:
{code:java}
dir <- "./10M_records"
n_row <- 1e7
df <- data.frame(foo = runif(n_row))
df$let <- sort(sample(letters, n_row, replace = TRUE))
write_dataset(df, dir, partitioning = "let")
# this should be 26, corresponding to the number of letters (but is not)
length(list.files(dir))
#> [1] 3
{code}
Prior to commit c697a41ab9c (the changes in this PR:
[https://github.com/apache/arrow/pull/9768),] we get the correct value back
from the above code (26), whereas after that commit, the value is incorrect.
> [R] [C++] sorted partition keys can cause issues
> ------------------------------------------------
>
> Key: ARROW-13169
> URL: https://issues.apache.org/jira/browse/ARROW-13169
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, R
> Reporter: Mauricio 'PachĂĄ' Vargas SepĂșlveda
> Priority: Blocker
> Fix For: 5.0.0
>
> Attachments: screenshot-1.png
>
>
> _This is a regression after 4.0.1 so is not a live-bug in a release version
> of arrow_
> When a partition key happens to be ordered, on large (>=1e7 rows), the
> partitions are not being written faithfully.
> If the partition isn't ordered or the dataset is smaller than 1e7 the
> partitions appear to be correct (though we should check that the values in
> other rows do still match when we test this).
> {code:r}
> library(arrow)
> dir <- "./1M_records"
> n_row <- 1e6
> df <- data.frame(foo = runif(n_row))
> df$let <- sort(sample(letters, n_row, replace = TRUE))
> write_dataset(df, dir, partitioning = "let")
> # this should be 26, corresponding to the number of letters (and is)
> length(list.files(dir))
> #> [1] 26
> dir <- "./10M_records_not_sorted"
> n_row <- 1e7
> df <- data.frame(foo = runif(n_row))
> df$let <- sample(letters, n_row, replace = TRUE)
> write_dataset(df, dir, partitioning = "let")
> # this should be 26, corresponding to the number of letters (and is!)
> length(list.files(dir))
> #> [1] 26
> dir <- "./10M_records"
> n_row <- 1e7
> df <- data.frame(foo = runif(n_row))
> df$let <- sort(sample(letters, n_row, replace = TRUE))
> write_dataset(df, dir, partitioning = "let")
> # this should be 26, corresponding to the number of letters (but is not)
> length(list.files(dir))
> #> [1] 3
> # the letters that were retained:
> list.files(dir)
> #> [1] "let=a" "let=b" "let=c"
> # Oddly(?) all of the rows are here, they have just been reshuffled into one
> of the letters retained
> nrow(open_dataset(dir))
> #> [1] 10000000
> {code}
> h1. Original report for context:
> A bit of context: the data for this example contains all the world exports
> in 1995, it contain 212 countries, but when saving it as parquet, only 66
> countries are actually recorded. The verification I included was to check if
> the USA (one of the best in the reporter quality index) was present in the
> data.
> {code:r}
> library(arrow)
> #>
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #>
> #> timestamp
> library(dplyr)
> #>
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #>
> #> filter, lag
> #> The following objects are masked from 'package:base':
> #>
> #> intersect, setdiff, setequal, union
> url <- "https://ams3.digitaloceanspaces.com/uncomtrade/baci_hs92_1995.rds"
> rds <- "baci_hs92_1995.rds"
> if (!file.exists(rds)) try(download.file(url, rds))
> d <- readRDS("baci_hs92_1995.rds")
> rds_has_usa <- any(grepl("usa", unique(d$reporter_iso)))
> rds_has_usa
> #> [1] TRUE
> dir <- "parquet/baci_hs92"
> d %>%
> group_by(year, reporter_iso) %>%
> write_dataset(dir, hive_style = F)
> parquet_has_usa <- any(grepl("usa", list.files(paste0(dir, "/1995"))))
> parquet_has_usa
> #> [1] FALSE
> {code}
> _Created on 2021-06-24 by the reprex package (https://reprex.tidyverse.org)
> (v2.0.0)_
--
This message was sent by Atlassian Jira
(v8.3.4#803005)