[
https://issues.apache.org/jira/browse/ARROW-13887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428789#comment-17428789
]
Dragoș Moldovan-Grünfeld commented on ARROW-13887:
--------------------------------------------------
This is a caught because there is a type mismatch between the column type and
the first value. Note that for the first column *company* where there isn't a
mismatch between the column name (string) and column type (utf8()) we do not
get an error message from C++.
This implies we cannot rely on capturing the C++ error message and offering a
more informative option in R as sometimes the error might not be triggered (in
the case of a CSV where all the columns are strings / characters).
The solution might be to somehow assess whether the CSV file has headers or
not.
{code:r}
share_data2 <- tibble::tibble(
company = c("AMZN", "GOOG", "BKNG", "TSLA"),
another_string = c("AMZN", "GOOG", "BKNG", "TSLA")
)
readr::write_csv(share_data2, file = "share_data2.csv")
share_schema2 <- schema(
company = utf8(),
another_string = utf8()
)
read_csv_arrow("share_data2.csv", schema = share_schema2)
{code}
> [R] Capture error produced when reading in CSV file with headers and using a
> schema, and add suggestion
> -------------------------------------------------------------------------------------------------------
>
> Key: ARROW-13887
> URL: https://issues.apache.org/jira/browse/ARROW-13887
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Nicola Crane
> Assignee: Dragoș Moldovan-Grünfeld
> Priority: Major
> Labels: good-first-issue
> Fix For: 6.0.0
>
>
> When reading in a CSV with headers, and also using a schema, we get an error
> as the code tries to read in the header as a line of data.
> {code:java}
> share_data <- tibble::tibble(
> company = c("AMZN", "GOOG", "BKNG", "TSLA"),
> price = c(3463.12, 2884.38, 2300.46, 732.39)
> )
> readr::write_csv(share_data, file = "share_data.csv")
> share_schema <- schema(
> company = utf8(),
> price = float64()
> )
> read_csv_arrow("share_data.csv", schema = share_schema)
> {code}
> {code:java}
> Error: Invalid: In CSV column #1: CSV conversion error to double: invalid
> value 'price'
> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:492 decoder_.Decode(data,
> size, quoted, &value)
> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:84 status
> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:496
> parser.VisitColumn(col_index, visit) {code}
> The correct thing here would have been for the user to supply the argument
> {{skip=1}} to {{read_csv_arrow()}} but this is not immediately obvious from
> the error message returned from C++. We should capture the error and instead
> supply our own error message using {{rlang::abort}} which informs the user of
> the error and then suggests what they can do to prevent it.
>
> For similar examples (and their associated PRs) see
> {color:#1d1c1d}ARROW-11766, and ARROW-12791{color}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)