There are a few interesting topics that I think would make a good sync up in the sync document. For personal reasons I can no longer host such meetings on Thursdays for a while but I would like to propose one for next Wednesday March 20, 2022 at 15:00 UTC. Please see the document for more details and to offer comments.
I also want to remind the community that anyone should feel free to organize meetups on days / timezones that work well for them and publicize them in the document, in the slack channel and on the mailing list. Andrew [1] https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit# On Fri, Mar 11, 2022 at 4:56 PM Bob Tinsman <bobti...@gmail.com> wrote: > I just missed the call, but I watched the recording (thank you to Andrew > for posting [1]). Really interesting! > I'm diving into Arrow because I have some previous experience with > in-memory query engines. I'm following discussions around improving > performance and adding features so I can determine how best to contribute. > > In particular, I was interested in some of the background for the JIT > implementation [2] and the row format [3] but I guess I'm missing context. > I saw the comment in #1708 [4] that "many pipeline-breaking operators are > inherently row-based". > My questions: > - By "pipeline-breaking" I assume you mean "very slow", but can you give me > details? Does this arise from some particular observation, or other > reported issues? > - An example would be nice, like "select a, b, c from blah order by d" > with table "blah" having 1 million rows and 10 columns takes 5 minutes, or > even anecdotal evidence like mailing list discussions > - In general, what tools are you using to analyze datafusion performance? > - The criterion benchmarks are nice but do you have anything higher-level > which exercises a broad range of workloads? > - How much profiling have you done to identify bottlenecks? > > To be honest, I was kind of surprised to see using a row format to solve a > performance problem, but I figured you must have good reasons, and I'm > still getting my brain around datafusion's query execution model. Thanks > for any illumination! > > [1] https://youtu.be/5NJcqXm6uE0 > [2] https://github.com/apache/arrow-datafusion/pull/1849 > [3] https://github.com/apache/arrow-datafusion/pull/1782 > [4] https://github.com/apache/arrow-datafusion/issues/1708 > > On Tue, Mar 8, 2022 at 12:25 PM Andrew Lamb <al...@influxdata.com> wrote: > > > I am not sure if everyone saw it in the agenda[1], but we plan to have a > > meeting tomorrow. I'll plan to record it for anyone who can not make this > > time. > > > > 15:00 UTC Wednesday March 9, 2022 > > Meeting Location: (in agenda) > > Matthew Turner: focused on JIT and row representation, next Wednesday, > > March 9th, > > @yijie: JIT overview > > > > [1] > > > > > https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit# > > > > On Thu, Mar 3, 2022 at 12:50 AM Benson Muite <benson_mu...@emailplus.org > > > > wrote: > > > > > Interested in learning more about this. Can work through the code and > > > discuss on 17 March either 4:00 or 16:00 UTC. > > > > > > Benson > > > > > > On 3/3/22 12:03 AM, Andrew Lamb wrote: > > > > I noticed that Matthew Turner added a note to the agenda[1] for a > walk > > > > through of the JIT code. I would be interested in this as well -- > would > > > > anyone plan to be on the call and discuss it? > > > > > > > > I don't think I have time to prepare that content prior > > > > > > > > Andrew > > > > > > > > [1] > > > > > > > > > > https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit# > > > > > > > > > >