[GitHub] [arrow] paleolimbot opened a new issue, #33960: [R] Output schema for aggregation is sometimes innacurate

via GitHub Tue, 31 Jan 2023 12:41:36 -0800


paleolimbot opened a new issue, #33960:
URL: https://github.com/apache/arrow/issues/33960


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Discovered in https://github.com/duckdb/duckdb/issues/5895 and ended up in 
32-bit integers (our calculated output schema) interpreted as 64-bit integers 
(the actual resulting schema).
   
   In 11.0.0 you can now get a RecordBatchReader from an Arrow dplyr query 
without starting to evaluate the query, so 
`as_record_batch_reader(query)$schema` is safe. We might be able to replace our 
current schema inference with that except we'll need to be careful about 
performance because that process involves building up the entire query from 
scratch including any nested components. I don't know how often we request that 
schema but maybe we can request it less as part of this PR.
   
   An example where our calculated output is incorrect:
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   #> Some features are not enabled in this build of Arrow. Run `arrow_info()` 
for more information.
   library(dplyr, warn.conflicts = FALSE)
   
   small <- tibble(a = c(letters, letters)) |> as_arrow_table()
   
   query <- small |> count(a)
   query  
   #> Table (query)
   #> a: string
   #> n: int32
   #> 
   #> See $.data for the source Arrow object
   
   collect(query, as_data_frame = FALSE)
   #> Table
   #> 26 rows x 2 columns
   #> $a <string>
   #> $n <int64>
   as_record_batch_reader(query)$schema
   #> Schema
   #> a: string
   #> n: int64
   ```
   
   <sup>Created on 2023-01-31 with [reprex 
v2.0.2](https://reprex.tidyverse.org)</sup>
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] paleolimbot opened a new issue, #33960: [R] Output schema for aggregation is sometimes innacurate

Reply via email to