Thanks, maybe I'll give it a try to include it manually into the repo! > improve performance and usability on complex apply/map
It will definitely help, but I'm already creating a single loop for each formula, no matter how many tensors are involved. E.g. let df = ...# some DF w/ cols A, B, C, D df.mutate(f{"Foo" ~ `A` * `B` - `C` / `D`}) Run will already be rewritten to: var col0_47816020 = toTensor(df["A"], float) col1_47816021 = toTensor(df["B"], float) col2_47816022 = toTensor(df["C"], float) col3_47816023 = toTensor(df["D"], float) res_47816024 = newTensor[float](df.len) for idx in 0 ..< df.len: []=(res_47816024, idx, col0_47816020[idx] * col1_47816021[idx] - col2_47816022[idx] / col3_47816023[idx]) result = toColumn res_47816024) Run which is indeed a little slower than a manual map_inline, but still pretty fast. Compare the first plot from here: [https://github.com/Vindaar/ggplotnim/tree/arraymancerBackend/benchmarks/pandas_compare](https://github.com/Vindaar/ggplotnim/tree/arraymancerBackend/benchmarks/pandas_compare) Not sure where the variations map_line sees are coming from though. Effects of openmp? **Small aside about the types** The data types are determined as floats from the usage of *, / etc. Could be overridden by giving type hints: f{int -> float: ...} ^--- type of involved tensors ^---- type of resulting tensor Run > AFAIK it should would allow combining complex transformations and do them in > a single pass instead of allocating many intermediate dataframes so > performance can be an order of magnitude faster on zip/map/filter chains. While this is certainly exciting to think about, I think it'd be pretty hard to (for me in the near future anyways) achieve while: 1. keeping it simple to extend the library by adding new procs 2. still allowing usage of the procs in a normal way as to return a new DF (without having differently named procs for inplace / not inplace variants). But this is just me speculating from the not all that simple code of zero-functional. I guess having a custom operator like it does would allow us to replace the user given proc names though. If you have a better idea of how to do efficient chaining that seems reasonable to implement, I'm all ears. **what I 'm working on** Right now I'm rather worrying about having decent performance for group_by and inner_join though. I'm looking at [https://h2oai.github.io/db-benchmark](https://h2oai.github.io/db-benchmark)/ since yesterday. It's a rather brutal reality check, hehe. Comparing my current code with the first of the 0.5 GB group_by examples to pandas and data.table was eye opening. In my current implementation of summarize for grouped data frames I actually return the sub data frames for each group and apply a simple reduce operation based on the users formula. Well, what a surprise, that's slow. I haven't dug deep into data.table of pandas yet, but as far as I can tell they essentially special case group_by \+ other operation and handle these by just aggregating on all groups in a single pass. So I've implemented the same and even for a single key with a single sum I'm 2 times slower than running the code with pandas on my machine. To be fair, performing operations on sub groups individually is a nice 100x slower than pandas. Still, the biggest performance impact I have to make is in order to allow columns with multiple data types to group by. I need some way to check which subgroup a row belongs to. Since I can't create a tuple at runtime, in order to just use normal comparison operators I decided to calculate a hash for each row and compare that. That works well, but gives me that 2x speed penalty. For the time being though, I think I'm happy with that unless I have a better idea / someone can point me to something that works in a typed language and doesn't involve huge amount of boilerplate code. So I'm currently working on an implementation that allows to use user defined formulas for aggregation while not having to call a closure for each row.