Re: Parallel Aggregates

Jihoon Son Tue, 16 Jun 2015 04:38:24 -0700

In Tajo, aggregation is very similar to that in Hadoop MapReduce.
Let me consider an example. Given a query of "select *k*, count(*) from *t*
group by *k*", Tajo generates a LogicalPlan as follows.


group by (k)
       |
   scan (t)

This LogicalPlan is translated into a MasterPlan as follows.

-----------------
     Stage2
  group by *k*
-----------------
          |
shuffle tuples with *k*
          |
-----------------
     Stage1
  group by *k*
         |
    scan *t*
-----------------

As you can see in this example, the query plan consists of 2 stages. Each
stage is executed subsequently because the result of Stage 1 is used as the
input of Stage 2. Each stage is divided into multiple tasks for each input
split as follows.

Stage1

Task 1
group by *k*
       |
  scan *t* (0 - 99)

Task 2
group by *k*
       |
  scan *t* (100 - 199)
...

Each task is executed by a TajoWorker. As you can see, tasks of the first
stage execute a local aggregation after scanning input split. This local
aggregation result is shuffled among TajoWorkers with the aggregation key
*k*. Then, the final aggregation is computed at the second stage.

Stage1 and Stage2 are similar to Map and Reduce of MapReduce. The local
aggregation of Stage1 is similar to the Combiner of Hadoop MapReduce.

I hope that this will be helpful to you.
If you have any further questions, please feel free to ask.
Jihoon

2015년 6월 16일 (화) 오전 7:28, Atri Sharma <[email protected]>님이 작성:

Thanks.
>
> What are your thoughts on parallel aggregation? Generating query plans that
> allow states to be generated which can be executed independently and then
> states recombined?
> On 16 Jun 2015 05:25, "Jihoon Son" <[email protected]> wrote:
>
> > Hi Atri, thanks for your question.
> >
> > First of all, maybe you already did, I recommend that you read this
> article
> > <
> >
> http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html
> > >
> > before you start implementation. This is written by Hyunsik, and contains
> > the description of Tajo's overall infrastructure. Afterwards, I think
> that
> > you may ask more detailed question.
> >
> > Here, I'll roughly list some important classes for aggregate
> > implementation.
> >
> >    - SQLParser.g4 contains our SQL parsing rules. It is written in antlr.
> >    - SQLAnalyzer is our parser based on rules defined at SQLParser.g4.
> >    - SQLAnalyzer translates a SQL query into a tree of Expr which
> >    represents an algebraic expression.
> >    - LogicalPlanner translates the Expr tree into a LogicalPlan that
> >    logically describes how the given query will be executed.
> >    - GlobalPlanner translates the LogicalPlan into a MasterPlan
> >    (distributed query execution plan) that describes how the given query
> > will
> >    be executed in distributed cluster.
> >    - Once a MasterPlan is created, QueryMaster starts to execute query
> >    processing. A query consists of multiple stages, which are
> individually
> >    processed in some order.
> >       - For example, a simple aggregation query is executed in two
> stages,
> >       each of which is for parallel aggregation and combining aggregates.
> > These
> >       stages are executed sequentially.
> >    - A stage is concurrently processed by multiple tasks, and is executed
> >    by TajoWorker.
> >    - Each task contains meta information for input data and a LogicalPlan
> >    of the stage. This LogicalPlan is translated into PhysicalExec by
> >    PhysicalPlanner.
> >    - PhysicalExec describes how the query is actually executed.
> >       - For example, there are two types of AggregationExec,
> >       i.e., HashAggregateExec and SortAggregateExec, for hash-based
> > aggregation
> >       and sort-based aggregation, respectively.
> >
> > Best regards,
> > Jihoon
> >
> > 2015년 6월 15일 (월) 오후 11:32, Atri Sharma <[email protected]>님이 작성:
> >
> > > Folks,
> > >
> > > I am looking into parallel aggregates/combining aggregates. I have a
> plan
> > > around it which I think can work.
> > >
> > > Please update me on current infrastructure and point me around the
> > existing
> > > code base. Also, ideas would be most welcome around it.
> > >
> > > --
> > > Regards,
> > >
> > > Atri
> > > *l'apprenant*
> > >
> >
>

Re: Parallel Aggregates

Reply via email to