In case somebody is interested, here is an isolated test to compare performance : https://gist.github.com/asereda-gs/9f09f3c46904c4b6670aeca602164c6b
I get the following results (10x difference) my previous comparison was with elastic search (now it is HSQLDB) : === Column columnCount: 10 direct count:200 50%:0 80%:0 90%:0 95%:0 max:1 calcite count:200 50%:25 80%:28 90%:30 95%:32 max:40 === Column columnCount: 100 direct count:200 50%:3 80%:3 90%:3 95%:3 max:5 calcite count:200 50%:48 80%:49 90%:50 95%:51 max:60 === Column columnCount: 500 direct count:200 50%:19 80%:20 90%:21 95%:22 max:29 calcite count:200 50%:190 80%:195 90%:200 95%:204 max:241 === Column columnCount: 1000 direct count:200 50%:43 80%:47 90%:50 95%:51 max:56 calcite count:200 50%:331 80%:342 90%:352 95%:361 max:389 It's possible that my test is doing something wrong or I'm measuring different operation (independent "audit" will be appreciated). I will check if I can narrow down the issue with a profiler. On Tue, Apr 30, 2019 at 4:07 PM Julian Hyde <[email protected]> wrote: > > As temporary work-around can I re-use PreparedStatement ? > > If you mean re-execute PreparedStatement, yes. Its main purpose is to only > pay the cost of preparation once, but execute multiple times (possibly with > different parameter values). > > But each time you execute it will be the same query and the same plan. > > Julian > > > > On Apr 30, 2019, at 11:43 AM, Andrei Sereda <[email protected]> wrote: > > > >> Consider using Hep planner rather than Volcano planner. > > Will check with Hep. Working on isolated unit test. > > > >> If you reduce the number of columns (to say 10), does the time reduce > > significantly? That might be a clue that there is a performance bug > > somewhere. > > Number of columns seems to be correlated with performance penalty: > > - for 10 columns: 120 (raw) vs 150ms (calcite) > > - for 50 columns: 260 (raw) vs 740ms (calcite) > > > >> Are these numbers on the first query, or after the system has warmed up? > > After. Following warmup I'm running 100 queries sequentially. > > > > As temporary work-around can I re-use PreparedStatement ? > > > > On Tue, Apr 30, 2019 at 2:06 PM Julian Hyde <[email protected]> wrote: > > > >> Consider using Hep planner rather than Volcano planner. (Measure the > >> number of rule firings. Is it higher than you think is necessary, given > the > >> complexity of the query?) > >> > >> If you reduce the number of columns (to say 10), does the time reduce > >> significantly? That might be a clue that there is a performance bug > >> somewhere. > >> > >> Are these numbers on the first query, or after the system has warmed up? > >> > >> Julian > >> > >> > >>> On Apr 30, 2019, at 9:41 AM, Andrei Sereda <[email protected]> wrote: > >>> > >>> Hello, > >>> > >>> One of our applications uses Calcite as translation layer between SQL > and > >>> destination source (mongo, elastic, etc.). The queries are fairly > simple > >>> and similar to the one below: > >>> > >>> select col1, col2, agg3(col3), agg4(col4), ..., aggN(colN) from table > >>> where id in (1, 2, 3) group by col1, col2 > >>> > >>> The only complexity is that number of columns can be fairly large (up > to > >>> 150) but otherwise it is a standard aggregation with some simple > >> predicates > >>> (no joins). Number of rows is small and usually is less than 1k. > >>> > >>> We have observed that overhead for such queries is 2x-3x (95th > >> percentile) > >>> compared to executing produced queries directly on the data-source (eg. > >>> mongo / elastic query). Difference is in the order of 100ms: 200ms > >> (direct) > >>> vs 600ms (calcite). Unfortunately such latency is noticeable in UI. > >>> > >>> Originally I thought it has to do with compilation time (janino) but > >>> profiling showed that most of time is spent in the following methods: > >>> > >>> 1. .preprare.Prepare.optimize() (VolcanoPanner) > >>> 2. .sql2rel.SqlToRelConverter.convertQuery() > >>> > >>> What can be done to avoid such overhead ? > >>> > >>> 1. Have avatica / calcite connection cache > >> connection.prepareStatement() > >>> so same optimization is not done twice ? Manually re-running same > >>> PreparedStatement helps. > >>> 2. Use interpreter ? > >>> 3. Manually re-use PreparedQuery (eg. from Cache<String, > >>> PreparedStatement>) ? This introduces other complexities like > executing > >>> same query in parallel. > >>> 4. Minimize number of Rules ? > >>> 5. Cache logical plan (RelNode) ? > >>> 6. Anything else ? > >>> > >>> Many thanks in advance. > >>> > >>> Andrei. > >> > >> > >
