paleolimbot opened a new issue, #33960: URL: https://github.com/apache/arrow/issues/33960
### Describe the bug, including details regarding any error messages, version, and platform. Discovered in https://github.com/duckdb/duckdb/issues/5895 and ended up in 32-bit integers (our calculated output schema) interpreted as 64-bit integers (the actual resulting schema). In 11.0.0 you can now get a RecordBatchReader from an Arrow dplyr query without starting to evaluate the query, so `as_record_batch_reader(query)$schema` is safe. We might be able to replace our current schema inference with that except we'll need to be careful about performance because that process involves building up the entire query from scratch including any nested components. I don't know how often we request that schema but maybe we can request it less as part of this PR. An example where our calculated output is incorrect: ``` r library(arrow, warn.conflicts = FALSE) #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information. library(dplyr, warn.conflicts = FALSE) small <- tibble(a = c(letters, letters)) |> as_arrow_table() query <- small |> count(a) query #> Table (query) #> a: string #> n: int32 #> #> See $.data for the source Arrow object collect(query, as_data_frame = FALSE) #> Table #> 26 rows x 2 columns #> $a <string> #> $n <int64> as_record_batch_reader(query)$schema #> Schema #> a: string #> n: int64 ``` <sup>Created on 2023-01-31 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup> ### Component(s) R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
