Andy Teucher created ARROW-16783:
------------------------------------
Summary: [R] write_dataset fails with an uninformative message
when duplicated column names
Key: ARROW-16783
URL: https://issues.apache.org/jira/browse/ARROW-16783
Project: Apache Arrow
Issue Type: Improvement
Components: R
Affects Versions: 8.0.0
Reporter: Andy Teucher
{{write_dataset()}} fails when the object being written has duplicated column
names. This is probably reasonable behaviour, but the error message is
misleading:
{code:r}
library(arrow, warn.conflicts = FALSE)
df <- data.frame(
id = c("a", "b", "c"),
x = 1:3,
x = 4:6,
check.names = FALSE
)
write_dataset(df, "df.parquet")
#> Error: 'dataset' must be a Dataset, RecordBatch, Table, arrow_dplyr_query,
or data.frame, not "data.frame"
{code}
[{{write_dataset()}} calls {{as_adq()}} inside a {{tryCatch()}}
statement|https://github.com/apache/arrow/blob/0d5cf1882228624271062e6c19583c8b0c361a20/r/R/dataset-write.R#L146-L160],
so any error from {{as_adq()}} is swallowed and the error emitted is about the
class of the object.
The real error comes from here:
{code:r}
arrow:::as_adq(df)
#> Error in `arrow_dplyr_query()`:
#> ! Duplicated field names
#> ✖ The following field names were found more than once in the data: "x"
{code}
I'm not sure what your preferred fix is here... two options that come to mind
are:
1. Explicitly check for compatible classes before calling {{as_adq()}} instead
of using {{tryCatch()}}
OR
2. Check for duplicate column names before the {{tryCatc}} block
My thought is that option 1 is better, as option 2 means that checking for
duplicates would happen twice (once inside {{write_dataset()}} and once again
inside {{{}as_adq(){}}}).
I'm happy to work a fix if you like!
--
This message was sent by Atlassian Jira
(v8.20.7#820007)