I finally got time to watch the video. In Arrow talks, I've been selling a similar idea of universal and very efficient UDFs that would use arrow as an interchange. One killer application we talked about with Wes is to have PySpark arrow-enabled. I suspect that the Weld representation would not be very different from Arrow and it's quite possible the efficient operations they built could be adapted for it. I guess we'll know in January. Julian, what you describe makes sense to me. I had been plying with code gen some time ago [1] and I'm wondering what vectorized operations we could add besides expression eval. [1] https://github.com/julienledem/brennus
On Mon, Nov 21, 2016 at 4:34 PM, Hanifi GUNES <hanifigu...@gmail.com> wrote: > Looks interesting. I see some commonalities. I hope the original work (in > progress?) will make references to Arrow so that we will all know the > distinguishing points better. > > 2016-11-20 8:31 GMT-08:00 Donald E. Foss <donald.f...@gmail.com>: > > > Thanks Julian. Sounds worth a listen. > > > > Donald E. Foss (mobile-US ET) > > > > > On Nov 19, 2016, at 1:48 PM, Julian Hyde <jh...@apache.org> wrote: > > > > > > Matei Zaharia just spoke at the AMPlab seminar [1], and showed a couple > > of slides about Weld. In the video of the day [2], his talk starts at > > 4:05:00, and he starts talking about Weld at 4:28:30. > > > > > > The essence is an intermediate language for row-level expressions, with > > the ability to do limited iteration, with the goal of making it easier to > > pass data between UDFs written in different languages. Sounds familiar? I > > would presume that an implementation of the language would be strongly > tied > > to a memory format. Or maybe it allows multiple possible implementations, > > one of which would be Arrow in Java. > > > > > > The slide listed Pandas as one of the supported front ends, so I > > wondered if Wes knew something about the project. > > > > > > I have been thinking of doing something similar in the Calcite / Drill > / > > Arrow world. In Calcite we have RexNodes as an expression language, and > we > > have a Java code generator that can target data represented as Java > arrays, > > and another variant that can target data represented as Java structs. > Drill > > of course has a code generator that can target data in Arrow. I have been > > thinking for a while of abstracting the code generators so that the > person > > implementing, say, the Filter+Project for “select x + y … where x > 5” > > doesn’t have to get their hands dirty with code generation. There are a > lot > > of optimizations to be done, e.g. remembering that you’ve already made > sure > > that x is not null. > > > > > > Julian > > > > > > [1] https://amplab.cs.berkeley.edu/endofproject/ < > > https://amplab.cs.berkeley.edu/endofproject/> > > > > > > [2] https://youtu.be/KAacs9jYPHU <https://youtu.be/KAacs9jYPHU> > > > > > > > > > > > >> On Nov 19, 2016, at 4:31 AM, Donald Foss <donald.f...@gmail.com> > wrote: > > >> > > >> Did you find that at https://cs.stanford.edu/~matei/? < > > https://cs.stanford.edu/~matei/?> That’s the only thing I can find via > > Google about it. Do you have more detail or a link to the paper > itself? I > > get the feeling that it is not yet fully complete despite 21 November > > camera-ready CIDR 2017 deadline. > > >> > > >> For those who aren’t familiar with CIDR, it is a conference that > occurs > > every other year. This year’s agenda/program may be found at > > http://cidrdb.org/cidr2017/program.html <http://cidrdb.org/cidr2017/ > > program.html>. CIDR is not an acronym for network subnet masks—the first > > thing I thought of, Classless Inter Domain Routing, but Conference on > > Innovative Data Systems Research, which focuses primarily on systems. I > > hate to admit this, but I’m unfamiliar with the conference, however that > > appears that it is because I’ve been out of academia for far too long, > and > > this conference seems to be the presentation of quite a few interesting > > papers. Just judging by title, a poor, yet humorous judge indeed, I > like: > > >> - “Dependency-Driven Analytics: A Compass for Uncharted Data Oceans” > > (Donald - Why just data lakes when you can have data oceans?) > > >> - “My Weak Consistency is Strong” (Donald - Great title, reminds me of > > Star Wars and the “Force”) > > >> - “SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale > > Machine Learning” (Donald - Another brilliant backronym.) > > >> > > >> The Weld paper is the last paper to be presented on 10 January 2017 > > between 2:30 and 4:05 (UTC-8). > > >> > > >> On a side note, looking down that page a little, I love the title of > > the last paper in 2016, Yggdrasil: An Optimized System for Training Deep > > Decision Trees at Scale <https://cs.stanford.edu/~ > matei/papers/2016/nips_ > > yggdrasil.pdf>. When I see Yggdrasil, the first thing I think of is a > > really big tree and Norse mythology. It’s a great name. I’m going to > read > > some of his other papers this weekend. > > >> > > >> Donald Foss > > >> donald.f...@gmail.com > > >> ------ __o > > >> ----_`\<,_ > > >> ---(_)/ (_) > > >> > > >> The information in this email is confidential and may be legally > > privileged. It is intended solely for the addressee. Access to this > e-mail > > by anyone else is unauthorized. > > >> > > >>> On Nov 18, 2016, at 4:42 PM, Julian Hyde <jh...@apache.org> wrote: > > >>> > > >>> Anyone know anything about Matei Zaharia’s Weld project? > > >>> > > >>> • S. Palkar, J. Thomas, A. Shanbhag, H. Pirk, M. Schwarzkopf, S. > > Amarasinghe and M. Zaharia. Weld: A Common Runtime for High Performance > > Data Analytics, to appear at CIDR 2017. > > >>> > > >>> It seems to have similar goals to Arrow. > > >>> > > >>> Julian > > >>> > > >> > > > > > > -- Julien