Sorry for the delay, Julian. My replies inline. On Fri, Jun 22, 2018 at 11:39 PM Julian Hyde <jh...@apache.org> wrote:
> This is exciting. We have wanted to build an Arrow adapter in Calcite for > some time and have a prototype (see > https://issues.apache.org/jira/browse/CALCITE-2173 < > https://issues.apache.org/jira/browse/CALCITE-2173>) but I hope that we > can use Gandiva. I know that Gandiva has Java bindings, but will these > allow queries to be compiled and executed from a pure Java process? > Yes. Dremio is a java process and uses the java bindings for gandiva. You could take a look at the maven unit tests for an example. > > Can you describe Gandiva’s governance model? Without an open governance > model, companies that compete with Dremio may be wary about contributing. > Jacques has replied on this. > > Can you compare and contrast your approach to Hyper[1]? Hyper is also > concerned with efficient use to the bus, and also uses LLVM, but it has a > different memory format and places much emphasis on lock-free data > structures. > > I just attended SIGMOD and there were interesting industry papers from > MemSQL[2][3] and Oracle RAPID[4]. I was impressed with some of the tricks > MemSQL uses to achieve SIMD parallelism on queries such as “select k4, > sum(x) from t group by k4” (where k4 has 4 values). > > I missed part of the RAPID talk, but I got the impression that they are > using disk-based algorithms (e.g. hybrid hash join) to handle data spread > between fast and slow memory. > > MemSQL uses TPC-H query 1 as a motivating benchmark and I think this would > be good target for Gandiva also. It is a table scan with a range filter > (returning 98% of rows), a low-cardinality aggregate (grouping by two > fields with 3 values each), and several aggregate functions, the arguments > of which contain common sub-expressions. > Thanks for the references - I'll look into them and get back. Gandiva doesn't attempt to solve query optimization, efficient disk reads or work distribution across threads/VMs. We expect the higher layers (i.e users of gandiva) to handle this. The expression builder returns a compiled, immutable "llvm module", which can be shared across threads. Once an expression is built, both the inputs/outputs are arrow vectors (actually, the input is a row batch). There is no locking within gandiva in the evaluation path. We are also targeting performance evaluation using TPC-H, but we plan to first address projections and filters before moving to aggregations. > > SELECT > l_returnflag, > l_linestatus, > sum(l_quantity), > sum(l_extendedprice), > sum(l_extendedprice * (1 - l_discount)), > sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)), > avg(l_quantity), > avg(l_extendedprice), > avg(l_discount), > count(*) > FROM lineitem > WHERE l_shipdate <= date '1998-12-01' - interval '90’ day > GROUP BY > l_returnflag, > l_linestatus > ORDER BY > l_returnflag, > l_linestatus; > > Julian > > [1] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf < > http://www.vldb.org/pvldb/vol4/p539-neumann.pdf> > > [2] > http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/ > < > http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/ > > > > [3] https://dl.acm.org/citation.cfm?id=3183713.3190658 < > https://dl.acm.org/citation.cfm?id=3183713.3190658> > > [4] https://dl.acm.org/citation.cfm?id=3183713.3190655 < > https://dl.acm.org/citation.cfm?id=3183713.3190655> > > > On Jun 22, 2018, at 7:22 AM, ravind...@gmail.com wrote: > > > > Hi everyone, > > > > I'm Ravindra and I'm a developer on the Gandiva project. I do believe > that the combination of arrow and llvm for efficient expression evaluation > is powerful, and has a broad range of use-cases. We've just started and > hope to finesse and add a lot of functionality over the next few months. > > > > Welcome your feedback and participation in gandiva !! > > > > thanks & regards, > > ravindra. > > > > On 2018/06/21 19:15:20, Jacques Nadeau <jacq...@apache.org> wrote: > >> Hey Guys, > >> > >> Dremio just open sourced a new framework for processing data in Arrow > data > >> structures [1], built on top of the Apache Arrow C++ APIs and leveraging > >> LLVM (Apache licensed). It also includes Java APIs that leverage the > Apache > >> Arrow Java libraries. I expect the developers who have been working on > this > >> will introduce themselves soon. To read more about it, take a look at > our > >> Ravindra's blog post (he's the lead developer driving this work): [2]. > >> Hopefully people will find this interesting/useful. > >> > >> Let us know what you all think! > >> > >> thanks, > >> Jacques > >> > >> > >> [1] https://github.com/dremio/gandiva > >> [2] > https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/ > >> > >