kuriwaki commented on issue #39912:
URL: https://github.com/apache/arrow/issues/39912#issuecomment-2419605427

   I'm getting the same error message when trying to count or run `distinct()` 
on a 1B+ row parquet dataset. Adding `to_duckdb()` fixed it. I also can't share 
the dataset now, but here is the metadata requested.
   
   ```
   > ds$schema
   Schema
   cvr_id: double
   precinct: string
   pres: string
   pid: string
   column: double
   item: string
   choice: string
   choice_id: double
   office_type: string
   dist: string
   party: string
   incumbent: double
   measure: double
   place: string
   topic: string
   unexp_term: double
   num_votes: double
   state: string
   county: string
   
   See $metadata for additional Schema metadata
   > 
   > n_rows = vapply(ds$files, function(f) { 
ParquetFileReader$create(f)$num_rows }, 0, USE.NAMES=FALSE)
   > n_rowgrps = vapply(ds$files, function(f) { 
ParquetFileReader$create(f)$num_row_groups }, 0, USE.NAMES=FALSE)
   > summary(n_rows); sum(n_rows)
        Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
        2588    177040    500624   2773980   1776367 140958860 
   [1] 1137331911
   > summary(n_rowgrps); sum(n_rowgrps)
      Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1.00    6.00   16.00   85.16   54.75 4302.00 
   [1] 34916
   > packageVersion("arrow")
   [1] ‘17.0.0.1’
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to