Liamtoha commented on issue #35268:
URL: https://github.com/apache/arrow/issues/35268#issuecomment-1528160830

   > ### Describe the enhancement requested
   > Hi everyone,
   > 
   > I'm representing a group of researchers that is working with observational 
health data. We have an [ecosystem of packages mostly in 
R](https://github.com/OHDSI/) and have been exploring using `arrow` as a 
backend for working with our data instead of `sqlite`. We've been impressed by 
the speed improvements and were almost ready to make the switch but we've hid a 
roadblock.
   > 
   > The current sorting in arrow (using `dplyr::arrange`) is taking to much 
memory. Looking at it further I see this 
[operation](https://arrow.apache.org/docs/dev/cpp/streaming_execution.html#order-by-sink)
 is a `pipeline breaker` and seems to accumulate everything in memory before 
sorting with a single thread.
   > 
   > I also see mentioned in many places the plan is to improve this and add 
spillover mechanisms to the sort and other `pipeline breakers`.
   > 
   > I did a small comparison between `arrow`, our current solution, `duckdb` 
and `dplyr`. I measured time and max memory with `gnu time`
   > 
   > Small      Medium
   > memory after dplyr::compute()      1.1 GB  5.1 GB
   > arrow (arrange and then write_dataset)     memory  3.1 GB  14.1 GB
   > time       1 minute 12 sec 8 minutes 46 sec
   > dplyr (collect and then arrange)   memory  3.6 GB  15.9 GB
   > time       11 seconds      1 minute
   > duckdb (from parquet files)        memory  4.3 GB  19.3 GB
   > time       4 seconds       21 seconds
   > Our current solution (uses sqlite) memory  240 MB  260 MB
   > time       2 minutes 30 seconds    13 min 22 sec
   > As you can see our current solution is slow but will never run out of 
memory.
   > 
   > It would be very nice if spillover was added to the sort in arrow so we 
could specify a memory limit to ensure we don't run out of memory and sort 
larger than memory data. I hope you would even consider this feature in the 
near future (even for arrow `13.0.0`).
   > 
   > I just wanted to make this issue to make you aware this is a blocker for 
us at the moment. We don't have the c++ knowledge to contribute to a solution 
for this, but would be glad to help if changes of R bindings would be needed 
and of course with testing.
   > 
   > ### Component(s)
   > C++
   
   Commit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to