yeyonghao created CALCITE-4069:
----------------------------------

             Summary: Elasticsearch the result of count(*) will be added to the 
result list twice when use multiple aggregations without group by
                 Key: CALCITE-4069
                 URL: https://issues.apache.org/jira/browse/CALCITE-4069
             Project: Calcite
          Issue Type: Improvement
          Components: elasticsearch-adapter
            Reporter: yeyonghao


when you write a sql:
{code:java}
select count(*) from mytable
{code}
in:
{code:java}
org.apache.calcite.adapter.elasticsearch.ElasticsearchTable#aggregate
{code}
As an optimization, totalHit will be the result of the query for count(*):
{code:java}
// elastic exposes total number of documents matching a query in "/hits/total" 
path
// this can be used for simple "select count(*) from table"
final long total = res.searchHits().total().value();
{code}
It is then added to the result set:
{code:java}
if (groupBy.isEmpty()) {
  // put totals automatically for count(*) expression(s), unless they contain 
group by
  for (String expr : countAll) {
    result.forEach(m -> m.put(expr, total));
  }
}
{code}
This is fine if there is only one count(*) aggregate function in SQL.

But when you write multiple aggregate functions in youe SQL:
{code:java}
select count(*),sum(field1) from mytable{code}
You'll notice that the result of count(*) is repeatedly added to the result set 
(it doesn't affect the final result, but I think it can be optimized).

The result of count(*) is first added to the result set along with other 
aggregate function result values in the following code:
{code:java}
final List<Map<String, Object>> result = new ArrayList<>();
if (res.aggregations() != null) {
  // collect values
  ElasticsearchJson.visitValueNodes(res.aggregations(), m -> {
    // using 'Collectors.toMap' will trigger Java 8 bug here
    Map<String, Object> newMap = new LinkedHashMap<>();
    for (String key: m.keySet()) {
      newMap.put(fieldMap.getOrDefault(key, key), m.get(key));
    }
    result.add(newMap);
  });
} else {
  // probably no group by. add single result
  result.add(new LinkedHashMap<>());
}
{code}
The second time is added again in the following code:
{code:java}
// elastic exposes total number of documents matching a query in "/hits/total" 
path
// this can be used for simple "select count(*) from table"
final long total = res.searchHits().total().value();

if (groupBy.isEmpty()) {
  // put totals automatically for count(*) expression(s), unless they contain 
group by
  for (String expr : countAll) {
    result.forEach(m -> m.put(expr, total));
  }
}
{code}
Although such operations have no effect on the final result when there are 
multiple aggregate functions, it is obvious that they can be optimized.TotalHit 
is added to the result set only when groupby. isEmpty() and aggregation. size() 
== 1 simultaneously satisfy:
{code:java}
if (groupBy.isEmpty() && aggregations.size() == 1) {
   ...
}
{code}
This avoids repeatedly adding count(*) to the result set.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to