JoaoAparicio commented on PR #422:
URL: https://github.com/apache/arrow-julia/pull/422#issuecomment-1502630284
I have some thoughts. One solution to "my dataset is larger than memory" is
partitioning. If your dataset is partitioned in such a way that each partition
fits in memory, you can iterate it with
```
stream = Arrow.Stream(path)
for tbl in stream
...
end
```
You can do this right now without requiring any additional features from
this package.
In contrast what is discussed in #340 (which is: don't decompress if you
don't have to) is a different approach, but doesn't yet exist.
Currently I have some commits that add the feature to multi-thread
decompression at the buffer level. I will be trying to upstream what I have so
far. The difficulty is that these commits touch a lot of code, so this won't
happen overnight. I imagine couple of weeks? On top of that it should be
straightforward to implement what is discussed in #340.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]