xtimbeau commented on issue #40593:
URL: https://github.com/apache/arrow/issues/40593#issuecomment-2002372814

   Hi @amoeba. I tried also to reproduce it with mild success. Playing around 
with datasets and arrow 15, I think I found 2 other bugs:
   1. trying to write a parquet to a non existing path (on a dropbox); I know 
this a an unforgiveable error, but crashing R where an error message would be 
nice is a bit hard on us.
   2. producing a dataset with a hierarchy of folders works well, unless you 
name the subfolder VAR=value and that there is in the parquet data file a field 
VAR crashes R too. An error message would be nice (or even renaming one of the 
2 VAR to something else)
   
   Then I sent you by PM a link to my faulty dataset. This should be related to 
different dictionaries or factor wrangling as changing the type of CCONLC 
solves the crashes :
   
   ```
   library(tidyverse)
   library(arrow)
   # link is in your mail box
   test <- glue("{link}/SOURCEFF=2022")
   alt <- glue("{link}/alt")
   dir.create(alt)
   
   open_dataset(test) |> select(CCONLC) |> collect() # works fine
   open_dataset(test) |> to_duckdb() |> select(CCONLC) |> collect() # crashes 
ungracefully
   
   deps <- arrow::open_dataset(test) |> distinct(CCODEP) |> collect() |>  pull()
   walk(deps, ~{
     unlink(str_c(alt, "/CCODEP=", .x), recursive = TRUE)
     dir.create(str_c(alt, "/CCODEP=", .x))
     read_parquet(str_c(test, "/CCODEP=", .x, "/part-0.parquet")) |> 
       mutate(CCONLC = as.character(CCONLC)) |> 
       write_parquet(str_c(alt, "/CCODEP=", .x, "/part-0.parquet"))})    
   
   open_dataset(alt) |> select(CCONLC) |> collect() # still works
   open_dataset(test) |> to_duckdb() |> select(CCONLC) |> collect() # works too 
!
   ```
   you can also revert alt$CCONLC to factor, parquet file by parquet file, and 
reproduce the crash when collecting it with duckdb.
   
   Hopes this clarify.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to