Moelf opened a new issue, #417:
URL: https://github.com/apache/arrow-julia/issues/417

   this is an example to demonstrate what "early returning" mean following a 
discussion on Slack with Alexander Plavin
   
   ## tl;dr
   The idea is that you have N columns you filter on, however, maybe 20% of the 
N columns are enough to **early** fail 80% of the rows. We need an ergonomic 
interface to delay the allocation (or worse, when Feather is compressed) as 
much as possible.
   
   
   ## Setup
   ```julia
   
   julia> using Arrow
   
   julia> function gendata()
              x = [rand(rand(0:10)) for _ = 1:10^5]
              y = [randn(rand(0:10)) for _ = 1:10^5]
              (;x, y)
          end
   
   julia> foreach(1:10) do _
              Arrow.append("./out.feather", gendata())
          end
   ```
   
   ## Benchmark
   ```julia
   julia> function kernel1(xs, ys)
              s1 = maximum(ys; init=0.0)
              s1 < 5 && return false
   
              maximum(xs; init=0.0) < 0.7 && return false
              return true
          end
   kernel1 (generic function with 1 method)
   
   julia> @benchmark map(kernel1, tbl.x, tbl.y)
   BenchmarkTools.Trial: 19 samples with 1 evaluation.
    Range (min … max):  264.955 ms … 271.889 ms  ┊ GC (min … max): 3.83% … 4.93%
    Time  (median):     267.430 ms               ┊ GC (median):    3.82%
    Time  (mean ± σ):   267.704 ms ±   2.096 ms  ┊ GC (mean ± σ):  4.17% ± 0.47%
   
     ▁   ▁ ██    ▁▁     ▁ ▁     █▁  ▁  ▁     ▁         ▁       ▁ ▁
     █▁▁▁█▁██▁▁▁▁██▁▁▁▁▁█▁█▁▁▁▁▁██▁▁█▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁█ ▁
     265 ms           Histogram: frequency by time          272 ms <
   
    Memory estimate: 192.34 MiB, allocs estimate: 2000004.
   
   julia> function kernel2(xss, yss)
              map(eachindex(xss)) do i
                  ys = yss[i]
                  s1 = maximum(ys; init=0.0)
                  s1 < 5 && return false
   
                  xs = xss[i]
                  maximum(xs; init=0.0) < 0.7 && return false
              end
          end
   kernel2 (generic function with 1 method)
   
   julia> @benchmark kernel2(tbl.x, tbl.y)
   BenchmarkTools.Trial: 34 samples with 1 evaluation.
    Range (min … max):  149.177 ms … 156.043 ms  ┊ GC (min … max): 3.57% … 3.49%
    Time  (median):     150.438 ms               ┊ GC (median):    3.57%
    Time  (mean ± σ):   151.088 ms ±   1.503 ms  ┊ GC (mean ± σ):  3.92% ± 0.69%
   
             ██▂    ▂
     ▅▁▅▁▁▁▅▅███▅█▅▁█▅▁▁▁▁▁▅▁▁▁▁▁▁▅▁▅▅▁▅▅▅▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
     149 ms           Histogram: frequency by time          156 ms <
   
    Memory estimate: 96.62 MiB, allocs estimate: 1000002.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to