Re: Calcite performance related question

Laptop huawei Thu, 08 Jun 2017 14:55:40 -0700

Hi, Vladimir,

Thank you for your response. That’s very helpful. We are actually using calcite 
as the first case you described. We use it as a parse and federation engine. 
Our data are spread over difference data engines with different formats, e.g. 
csv files, mysql, druid and elastic search etc. We will definitely push the 
query down to the underlying systems as much as possible, we are actually using 
the calcite adapters from the calcite project and are creating our owns for 
these data sources that were not supported yet.


So what I was looking for to understand the performance impact is: 
1. For all the queries that can completely push down, what’s the overhead added 
by going through calcite? which including the parse time/optimize time

2. For queries that have to fall back to do in memory join using calcite 
built-in enumerable convention, what is the performance look like given enough 
memory for a set of given inputs.

3. For large joins that can not be done in one host, what’s the common 
practice? using spark adaptor?
 
4. Do we have any self-protection to reject any query that requires too much 
local resources?

Thanks,
-JD


> On Jun 8, 2017, at 2:21 PM, Vladimir Sitnikov <[email protected]> 
> wrote:
> 
> Hello,
> 
>> Have anyone done any benchmark to evaluate calcite’s performance impact?
> Or is there any documentation regarding performance concern?
> 
> Well, the performance depends on your use case.
> As far as I understand, here are the typical features:
> 1) Given a query, Calcite would try to push all the tables/predicates to
> the downstream executor (i.e. DB)
> 2) In case there are joins between different data stores, Calcite would
> still push as much filters as it is possible, yet perform the join in memory
> 3) Calcite has no idea which indices are available at the storage level,
> thus don't expect it to generate plans like "for each row from the
> datastore1 go and fetch a relevant row from datastore2". In 100% of the
> cases it would be "hashjoin(full fetch datastore1, full fetch datastore2)"
> 
> 
> In case you are going to use Calcite as a proxy (that is Calcite would
> parse and just send the whole query downstream), then you might be
> interested in JMH-based benchmarks.
> Here they are:
> https://github.com/apache/calcite/blob/master/ubenchmark/src/main/java/org/apache/calcite/benchmarks/StatementTest.java
> Feel free to add more benchmarks there.
> 
> 
> PS. Index support is doable (one can fetch the sets of indexes from the
> downstream datastores), however it is not done yet.
> 
> Vladimir

Re: Calcite performance related question

Reply via email to