chenyiwrites opened a new issue, #39912:
URL: https://github.com/apache/arrow/issues/39912

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Hi everyone,
   
   I was working on a large dataset with over 1 billion observations, stored in 
3040 parquet files, with 41 variables. I read the data with `open_dataset()` 
and then wanted to apply functions from `dplyr`:
   
   ```
   individual_positions %>% 
     group_by(user_id) %>% 
     summarize(n_positions = n()) %>% 
     count(n_positions, sort = TRUE) %>% 
     collect()
   ```
   
   `individual_positions` is my dataset, which consists different job positions 
a user held throughout her career. I tried to understand the distribution of 
the number of all job positions that a user ever held. And I got the following 
error message:
   
   ```
   Error in `compute.arrow_dplyr_query()`:
   ! Invalid: Negative buffer resize: -2147483584
   Backtrace:
    1. ... %>% collect()
    3. arrow:::collect.arrow_dplyr_query(.)
    4. arrow:::compute.arrow_dplyr_query(x)
   ```
   
   I googled what "negative buffer resize" really means, but it was in vain. 
Can anyone please help me with the interpretation and provide any solutions? I 
know it's possible to process the dataset in `SAS`, but I'm an R lover and I 
really want to stick with it. Thanks a lot!
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to