[ 
https://issues.apache.org/jira/browse/SOLR-8593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376266#comment-15376266
 ] 

Julian Hyde commented on SOLR-8593:
-----------------------------------

You should probably model your join and aggregate operators as sub-classes of 
Join and Aggregate that understand the "distribution" trait. If you are doing, 
say, "group by x" then you will need your input either to be singleton (i.e. 
only one input stream) or partitioned on x. Calcite will be able to ensure that 
the input is partitioned appropriately, either because it is stored in 
partitions, or by applying a shuffle/exchange. 

There is the regular Exchange operator that changes the distribution (i.e. 
re-partitions) and there is SortExchange that changes the distribution and also 
sorts within each partition. SortExchange models what the shuffle does in 
MapReduce.

After you have a plan like

{noformat}
MyJoin[left.a = right.b]
  Exchange[a]
    MyAggregate
      Exchange
        Scan[T1]
  Exchange[b]
    Scan[T2]
{noformat}

you can turn into map-reduce by making the consumer of each Exchange into a 
reduce task, and the input to each Exchange a map task.

I asked [~ashutoshc] how he would generate Hive MapReduce plans in Calcite 
(most Hive plans these days are Tez) and he said you should consider writing a 
CoGroup operator (like the one in Pig). CoGroup is powerful enough to implement 
both join and aggregate, so it might save you some effort.

> Integrate Apache Calcite into the SQLHandler
> --------------------------------------------
>
>                 Key: SOLR-8593
>                 URL: https://issues.apache.org/jira/browse/SOLR-8593
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Joel Bernstein
>
> The Presto SQL Parser was perfect for phase one of the SQLHandler. It was 
> nicely split off from the larger Presto project and it did everything that 
> was needed for the initial implementation.
> Phase two of the SQL work though will require an optimizer. Here is where 
> Apache Calcite comes into play. It has a battle tested cost based optimizer 
> and has been integrated into Apache Drill and Hive.
> This work can begin in trunk following the 6.0 release. The final query plans 
> will continue to be translated to Streaming API objects (TupleStreams), so 
> continued work on the JDBC driver should plug in nicely with the Calcite work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to