prayaggordy opened a new issue, #43303:
URL: https://github.com/apache/arrow/issues/43303

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I'm running `arrow` 16.1.0 with R 4.4.1 on a MacBook Pro M3, macOS 14.5, 
though this also occurred on Linux.
   
   I partitioned my dataset by a string column that could be parsed to integer. 
When I read the dataset back into R with `arrow::open_dataset`, the partitioned 
column became an integer, even though I saved it as a string.
   
   For example:
   
   ```{r}
   mtcars |>
     dplyr::mutate(cyl_ch = stringr::str_pad(cyl, 2, pad = "0"),
                   gear_ch = stringr::str_pad(gear, 2, pad = "0")) |>
     dplyr::group_by(cyl_ch) |>
     arrow::write_dataset("output/partition_cyl_ch")
   ```
   
   The resulting directory structure is:
   
   ```
   output
     partition_cyl_ch
       cyl_ch=04
         part-0.parquet
       cyl_ch=06
         part-0.parquet
       cyl_ch=08
         part-0.parquet
   ```
   
   If I run `arrow::open_dataset("output/partition_cyl_ch")`, the `cyl_ch` 
column is now an `int32` (with values 4, 6, and 8 instead of "04", "06", and 
"08"), while the `gear_ch` column remains a `string` as intended.
   
   I want the `cyl_ch` column to remain a `string` as well. There are no error 
messages, just an unexpected result.
   
   It looks like the partitioned column itself (in this case, `cyl_ch`) isn't 
saved in the resulting parquet file but is instead inferred from the folder 
name; perhaps this is where the string is cast to an integer. For instance, if 
I directly read 
`arrow::read_parquet("output/partition_cyl_ch/cyl_ch=04/part-0.parquet")`, the 
`cyl_ch` column does not appear. There was a similar issue in the [duckdb 
GitHub repository](https://github.com/duckdb/duckdb/pull/11676), but I can't 
find anything in the arrow repo.
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to