paul-rogers commented on issue #2421: URL: https://github.com/apache/drill/issues/2421#issuecomment-1007611673
Hi James, One could do something like what you described. However, to have all of Drill work with Arrow would be a huge amount of work. Optimizations made for one format would be sub-optimal for the other. (Example: exchanges.) Furthermore, your use case would benefit from vectors only in the project and grouping operators. So, I wonder if we might think about the problem operator-by-operator. If you have a compute-heavy phase, might that first transform data to vectors, apply the compute, then send data along in row format? Every fragment does a network exchange: data is read/written anyway. So, perhaps there is something that can be done to transform formats at fragment boundaries (he says, waving hands wildly...) You'll also get speed only for queries without joins. If you have joins, then the joins are likely to take the vast amount of the runtime, leaving your projection and grouping in the noise. I'm not sure how vectorization can help joins; certainly in Drill today, vectors make the join code atrociously complex. This is why DBs (and compiler optimizers) are hard: the answers change based on use case... Thanks, - Paul On Wed, Jan 5, 2022 at 9:03 PM James Turton ***@***.***> wrote: > Okay, @paul-rogers <https://github.com/paul-rogers> I've had a few swigs > of the kool aid by now and I think I'm ready to forget about in-memory > column orientation and SIMD in return for the benefits of row orientation. > For workflows that do involve bulk arithmetic I can imagine good interop > taking care of that stage: > > 1. Do some efficient parsing, filtering, sorting, aggregating in Drill > 2. Smoothly switch over to Pandas/Numpy (perhaps an Arrow exporter?) > or Julia or ... > 3. Do bulk arithmetic using SIMD > 4. Store results or smoothly switch back to Drill > > I've used this workflow myself where the data interchange format was > Parquet and the transport medium was the DFS (so perhaps a bit more > "clunky" than "smooth", with lots of serialisation and IO incurred). > > Going further, if the decoupling of Drill from its in-memory format > mentioned above is a real possibility then can we even imagine something > like this, entirely in Drill? > > alter session set exec.memory_format = 'drill'; -- the default, row-oriented format > > create table as select ... -- do some efficient parsing, filtering, sorting, aggregating in Drill > create table as select ... -- do some efficient parsing, filtering, sorting, aggregating in Drill > > alter session set exec.memory_format = 'arrow'; -- switch to Arrow format > > create table as select ... do some bulk arithmetic using SIMD > create table as select ... do some bulk arithmetic using SIMD > > To my mind Drill 2.0 would not try to ship support for the latter, Arrow > format, merely make design decisions which leave that door open for a > motivated developer... > > — > Reply to this email directly, view it on GitHub > <https://github.com/apache/drill/issues/2421#issuecomment-1006287333>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAYZF4OL3CNE5WIQCZG4SBDUUUPD3ANCNFSM5LHIIU5Q> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > > You are receiving this because you were mentioned.Message ID: > ***@***.***> > -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org