xtimbeau commented on issue #40593:
URL: https://github.com/apache/arrow/issues/40593#issuecomment-2002372814
Hi @amoeba. I tried also to reproduce it with mild success. Playing around
with datasets and arrow 15, I think I found 2 other bugs:
1. trying to write a parquet to a non existing path (on a dropbox); I know
this a an unforgiveable error, but crashing R where an error message would be
nice is a bit hard on us.
2. producing a dataset with a hierarchy of folders works well, unless you
name the subfolder VAR=value and that there is in the parquet data file a field
VAR crashes R too. An error message would be nice (or even renaming one of the
2 VAR to something else)
Then I sent you by PM a link to my faulty dataset. This should be related to
different dictionaries or factor wrangling as changing the type of CCONLC
solves the crashes :
```
library(tidyverse)
library(arrow)
# link is in your mail box
test <- glue("{link}/SOURCEFF=2022")
alt <- glue("{link}/alt")
dir.create(alt)
open_dataset(test) |> select(CCONLC) |> collect() # works fine
open_dataset(test) |> to_duckdb() |> select(CCONLC) |> collect() # crashes
ungracefully
deps <- arrow::open_dataset(test) |> distinct(CCODEP) |> collect() |> pull()
walk(deps, ~{
unlink(str_c(alt, "/CCODEP=", .x), recursive = TRUE)
dir.create(str_c(alt, "/CCODEP=", .x))
read_parquet(str_c(test, "/CCODEP=", .x, "/part-0.parquet")) |>
mutate(CCONLC = as.character(CCONLC)) |>
write_parquet(str_c(alt, "/CCODEP=", .x, "/part-0.parquet"))})
open_dataset(alt) |> select(CCONLC) |> collect() # still works
open_dataset(test) |> to_duckdb() |> select(CCONLC) |> collect() # works too
!
```
you can also revert alt$CCONLC to factor, parquet file by parquet file, and
reproduce the crash when collecting it with duckdb.
Hopes this clarify.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]