[GitHub] [arrow-julia] JoaoAparicio commented on pull request #422: Pre-allocate buffer

via GitHub Mon, 10 Apr 2023 20:17:44 -0700


JoaoAparicio commented on PR #422:
URL: https://github.com/apache/arrow-julia/pull/422#issuecomment-1502630284


   I have some thoughts. One solution to "my dataset is larger than memory" is 
partitioning. If your dataset is partitioned in such a way that each partition 
fits in memory, you can iterate it with
   ```
   stream = Arrow.Stream(path)
   for tbl in stream
       ...
   end
   ```
   You can do this right now without requiring any additional features from 
this package.
   
   In contrast what is discussed in #340 (which is: don't decompress if you 
don't have to) is a different approach, but doesn't yet exist.
   
   Currently I have some commits that add the feature to multi-thread 
decompression at the buffer level. I will be trying to upstream what I have so 
far. The difficulty is that these commits touch a lot of code, so this won't 
happen overnight. I imagine couple of weeks? On top of that it should be 
straightforward to implement what is discussed in #340.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-julia] JoaoAparicio commented on pull request #422: Pre-allocate buffer

Reply via email to