[
https://issues.apache.org/jira/browse/TAJO-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513545#comment-14513545
]
Jihoon Son commented on TAJO-259:
---------------------------------
Sorry for late reply. Here are some brief explanation for them. I hope that
this would be helpful.
*SQL Parser*
Tajo uses antlr for parsing queries. Our syntax is defined SQLParser.g4 and
SQLLexer.g4 (You can see some definitions for data cube and rollup in
SQLParser.g4. We have only the syntax part for those operations). SQLAnalyzer
is responsible for generating abstract syntax trees based on the generated code
by antlr. An AST is a tree of Exprs.
*Logical Planner*
Logical planner generates logical plans for ASTs. A logical plan consists of
query blocks that is the basic unit of SQL that operates on tables or the
results of other queries. For example, the following query has two query blocks
for inner and outer queries.
{noformat}
select n_nationkey, n_name from (
select * from nation
) as t;
{noformat}
A query block contains a tree of LogicalNodes that contains the logical
information of operators. Basically, every LogicalNode has its type and
input/output schemas. Type-specific information can be additionally contained
such as aggregation keys or join types according to its type.
*Global Planner*
Global planner generates a MasterPlan based on the logical plan. As you know,
an operator such as aggregation, sort, or join can be executed in two or more
phases in distributed environments. The MasterPlan describes how the query will
be executed in a distributed manner. It consists of several ExecutionBlocks
each of which is mapped to a phase (called stage in Tajo).
Let me consider an example. A group-by query will be executed in two stages as
follows.
- Stage 2
{noformat}
Group-by (global aggregation)
|
Scan
{noformat}
- Stage 1
{noformat}
Shuffle (shuffle with aggregation keys)
|
Group-by (local aggregation)
|
Scan (scan shuffled data)
{noformat}
As noted above, each stage is mapped to different ExecutionBlocks. Tajo's query
master currently schedules these execution blocks one by one. (This will be
improved after improving our resource scheduler (TAJO-1397).
*Task Scheduler*
Once an execution block is scheduled, the query master generates tasks and
letting Tajo workers know that this execution block is ready to execute. Then,
the workers request tasks to the query master. Here, the query master tries to
assign tasks to workers while maximizing data locality.
*Physical Planner and Physical Executor*
Each task contains a logical plan of the execution block which the task is
generated from. Workers generate a physical plan from the logical plan. The
physical plan is a tree of PhysicalExec that is the actual implementation of
operators.
Our query execution is based on Volcano model. Every PhysicalExec has the
next() function which calls its child's next() function. So, when Tajo workers
call the next() function of the most top executor, subsequent child executors
are called recursively.
I think that you should investigate and implement the above parts for this
issue except task scheduler.
If you have any further questions, please feel free to ask.
> 'GROUP BY CUBE' query execution
> -------------------------------
>
> Key: TAJO-259
> URL: https://issues.apache.org/jira/browse/TAJO-259
> Project: Tajo
> Issue Type: Sub-task
> Components: distributed query plan
> Reporter: Jihoon Son
> Assignee: Atri Sharma
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)