Thank you. Is there any improvement in aggregates that we are looking at please? On 16 Jun 2015 17:07, "Jihoon Son" <[email protected]> wrote:
> In Tajo, aggregation is very similar to that in Hadoop MapReduce. > Let me consider an example. Given a query of "select *k*, count(*) from *t* > group by *k*", Tajo generates a LogicalPlan as follows. > > group by (k) > | > scan (t) > > This LogicalPlan is translated into a MasterPlan as follows. > > ----------------- > Stage2 > group by *k* > ----------------- > | > shuffle tuples with *k* > | > ----------------- > Stage1 > group by *k* > | > scan *t* > ----------------- > > As you can see in this example, the query plan consists of 2 stages. Each > stage is executed subsequently because the result of Stage 1 is used as the > input of Stage 2. Each stage is divided into multiple tasks for each input > split as follows. > > Stage1 > > Task 1 > group by *k* > | > scan *t* (0 - 99) > > Task 2 > group by *k* > | > scan *t* (100 - 199) > ... > > Each task is executed by a TajoWorker. As you can see, tasks of the first > stage execute a local aggregation after scanning input split. This local > aggregation result is shuffled among TajoWorkers with the aggregation key > *k*. Then, the final aggregation is computed at the second stage. > > Stage1 and Stage2 are similar to Map and Reduce of MapReduce. The local > aggregation of Stage1 is similar to the Combiner of Hadoop MapReduce. > > I hope that this will be helpful to you. > If you have any further questions, please feel free to ask. > Jihoon > > 2015년 6월 16일 (화) 오전 7:28, Atri Sharma <[email protected]>님이 작성: > > Thanks. > > > > What are your thoughts on parallel aggregation? Generating query plans > that > > allow states to be generated which can be executed independently and then > > states recombined? > > On 16 Jun 2015 05:25, "Jihoon Son" <[email protected]> wrote: > > > > > Hi Atri, thanks for your question. > > > > > > First of all, maybe you already did, I recommend that you read this > > article > > > < > > > > > > http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html > > > > > > > before you start implementation. This is written by Hyunsik, and > contains > > > the description of Tajo's overall infrastructure. Afterwards, I think > > that > > > you may ask more detailed question. > > > > > > Here, I'll roughly list some important classes for aggregate > > > implementation. > > > > > > - SQLParser.g4 contains our SQL parsing rules. It is written in > antlr. > > > - SQLAnalyzer is our parser based on rules defined at SQLParser.g4. > > > - SQLAnalyzer translates a SQL query into a tree of Expr which > > > represents an algebraic expression. > > > - LogicalPlanner translates the Expr tree into a LogicalPlan that > > > logically describes how the given query will be executed. > > > - GlobalPlanner translates the LogicalPlan into a MasterPlan > > > (distributed query execution plan) that describes how the given > query > > > will > > > be executed in distributed cluster. > > > - Once a MasterPlan is created, QueryMaster starts to execute query > > > processing. A query consists of multiple stages, which are > > individually > > > processed in some order. > > > - For example, a simple aggregation query is executed in two > > stages, > > > each of which is for parallel aggregation and combining > aggregates. > > > These > > > stages are executed sequentially. > > > - A stage is concurrently processed by multiple tasks, and is > executed > > > by TajoWorker. > > > - Each task contains meta information for input data and a > LogicalPlan > > > of the stage. This LogicalPlan is translated into PhysicalExec by > > > PhysicalPlanner. > > > - PhysicalExec describes how the query is actually executed. > > > - For example, there are two types of AggregationExec, > > > i.e., HashAggregateExec and SortAggregateExec, for hash-based > > > aggregation > > > and sort-based aggregation, respectively. > > > > > > Best regards, > > > Jihoon > > > > > > 2015년 6월 15일 (월) 오후 11:32, Atri Sharma <[email protected]>님이 작성: > > > > > > > Folks, > > > > > > > > I am looking into parallel aggregates/combining aggregates. I have a > > plan > > > > around it which I think can work. > > > > > > > > Please update me on current infrastructure and point me around the > > > existing > > > > code base. Also, ideas would be most welcome around it. > > > > > > > > -- > > > > Regards, > > > > > > > > Atri > > > > *l'apprenant* > > > > > > > > > >
