[GitHub] [arrow-julia] JoaoAparicio opened a new pull request, #412: Add kwarg to filter columns

via GitHub Sun, 02 Apr 2023 17:14:58 -0700


JoaoAparicio opened a new pull request, #412:
URL: https://github.com/apache/arrow-julia/pull/412


   Currently we don't have the option to load just a subset of the columns. 
This matters e.g. when compression is the bottleneck.
   
   For example, create a compressed arrow file.
   
   ```julia
   using Arrow
   p = tempname();
   N = 1000000
   tbl = (
       a=rand(N),
       b=rand(N),
       c=rand(N),
       d=rand(N),
       e=rand(N),
       f=[rand(rand(0:100)) for _ in 1:N],
   );
   Arrow.write(p, tbl; compress=:zstd);
   ```
   
   Column `f` is the longest - it has an expected 50*N elements vs N for the 
rest Some times we only care for some of the other columns. Currently we must 
decompress all columns regardless:
   ```julia
   using BenchmarkTools
   @btime tbl = Arrow.Table(p);  # 359.205 ms (530 allocations: 794.23 MiB)
   ```
   With this commit we can load only some of the columns
   ```julia
   @btime tbl = Arrow.Table(p; filtercolumns=["a"]);  # 6.146 ms (231 
allocations: 14.33 MiB)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-julia] JoaoAparicio opened a new pull request, #412: Add kwarg to filter columns

Reply via email to