kuriwaki commented on issue #39912:
URL: https://github.com/apache/arrow/issues/39912#issuecomment-2419605427
I'm getting the same error message when trying to count or run `distinct()`
on a 1B+ row parquet dataset. Adding `to_duckdb()` fixed it. I also can't share
the dataset now, but here is the metadata requested.
```
> ds$schema
Schema
cvr_id: double
precinct: string
pres: string
pid: string
column: double
item: string
choice: string
choice_id: double
office_type: string
dist: string
party: string
incumbent: double
measure: double
place: string
topic: string
unexp_term: double
num_votes: double
state: string
county: string
See $metadata for additional Schema metadata
>
> n_rows = vapply(ds$files, function(f) {
ParquetFileReader$create(f)$num_rows }, 0, USE.NAMES=FALSE)
> n_rowgrps = vapply(ds$files, function(f) {
ParquetFileReader$create(f)$num_row_groups }, 0, USE.NAMES=FALSE)
> summary(n_rows); sum(n_rows)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2588 177040 500624 2773980 1776367 140958860
[1] 1137331911
> summary(n_rowgrps); sum(n_rowgrps)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 6.00 16.00 85.16 54.75 4302.00
[1] 34916
> packageVersion("arrow")
[1] ‘17.0.0.1’
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]