Thanks! On 18 Jun 2015 21:24, "Jaehwa Jung" <[email protected]> wrote:
> Hi Atri > > I think that following articles would be helpful for you to understand tajo > architecture. > > - Tajo query execution and scheduling sequence : http://jaso.co.kr/501 > - Tajo stage flow and StorageManager: http://jaso.co.kr/503 > > For the reference, above articles had been written by Hyoungjun Kim who is > Tajo PMC. :) > > Cheers > Jaehwa > > 2015-06-18 20:21 GMT+09:00 Jihoon Son <[email protected]>: > > > It seems that there aren't ongoing issues for distinct aggregation. > > > > 2015년 6월 18일 (목) 오전 10:50, Atri Sharma <[email protected]>님이 작성: > > > > > And for DISTINCT issues? > > > On 18 Jun 2015 15:16, "Jihoon Son" <[email protected]> wrote: > > > > > > > As far as I know, TAJO-256, TAJO-259, and TAJO-1420 are issues for > data > > > > cube and grouping sets. > > > > You can create any issues if you want. Even though some issues can be > > > > duplicated, it's ok. > > > > 2015년 6월 18일 (목) 오전 10:39, Atri Sharma <[email protected]>님이 작성: > > > > > > > > > Do we have a ticket around that? > > > > > On 18 Jun 2015 15:07, "Jihoon Son" <[email protected]> wrote: > > > > > > > > > > > It looks good to start. > > > > > > Any questions welcome! > > > > > > > > > > > > Jihoon > > > > > > > > > > > > 2015년 6월 18일 (목) 오전 3:39, Atri Sharma <[email protected]>님이 > 작성: > > > > > > > > > > > > > So distinct aggregation is one area, thanks. > > > > > > > > > > > > > > I am trying to get enough knowledge of Internals of aggregation > > > > engine > > > > > > and > > > > > > > query planner to be able to work on rollup and cube so picking > > > > smaller > > > > > > > tickets first. > > > > > > > On 18 Jun 2015 02:42, "Jihoon Son" <[email protected]> > wrote: > > > > > > > > > > > > > > > As far as I know, there aren't any plans for improvement > except > > > in > > > > > > > distinct > > > > > > > > aggregation. I think that our code for distinct aggregation > is > > > too > > > > > > > > complicated, and the performance also should be improved. > > > > > > > > > > > > > > > > So, when you design the implementation of your algorithm on > > Tajo, > > > > you > > > > > > > don't > > > > > > > > have to consider distinct aggregation part, I think. > > > > > > > > > > > > > > > > 2015년 6월 18일 (목) 오전 2:16, Atri Sharma <[email protected] > >님이 > > > 작성: > > > > > > > > > > > > > > > > > Thank you. > > > > > > > > > > > > > > > > > > Is there any improvement in aggregates that we are looking > at > > > > > please? > > > > > > > > > On 16 Jun 2015 17:07, "Jihoon Son" <[email protected]> > > > wrote: > > > > > > > > > > > > > > > > > > > In Tajo, aggregation is very similar to that in Hadoop > > > > MapReduce. > > > > > > > > > > Let me consider an example. Given a query of "select *k*, > > > > > count(*) > > > > > > > from > > > > > > > > > *t* > > > > > > > > > > group by *k*", Tajo generates a LogicalPlan as follows. > > > > > > > > > > > > > > > > > > > > group by (k) > > > > > > > > > > | > > > > > > > > > > scan (t) > > > > > > > > > > > > > > > > > > > > This LogicalPlan is translated into a MasterPlan as > > follows. > > > > > > > > > > > > > > > > > > > > ----------------- > > > > > > > > > > Stage2 > > > > > > > > > > group by *k* > > > > > > > > > > ----------------- > > > > > > > > > > | > > > > > > > > > > shuffle tuples with *k* > > > > > > > > > > | > > > > > > > > > > ----------------- > > > > > > > > > > Stage1 > > > > > > > > > > group by *k* > > > > > > > > > > | > > > > > > > > > > scan *t* > > > > > > > > > > ----------------- > > > > > > > > > > > > > > > > > > > > As you can see in this example, the query plan consists > of > > 2 > > > > > > stages. > > > > > > > > Each > > > > > > > > > > stage is executed subsequently because the result of > Stage > > 1 > > > is > > > > > > used > > > > > > > as > > > > > > > > > the > > > > > > > > > > input of Stage 2. Each stage is divided into multiple > tasks > > > for > > > > > > each > > > > > > > > > input > > > > > > > > > > split as follows. > > > > > > > > > > > > > > > > > > > > Stage1 > > > > > > > > > > > > > > > > > > > > Task 1 > > > > > > > > > > group by *k* > > > > > > > > > > | > > > > > > > > > > scan *t* (0 - 99) > > > > > > > > > > > > > > > > > > > > Task 2 > > > > > > > > > > group by *k* > > > > > > > > > > | > > > > > > > > > > scan *t* (100 - 199) > > > > > > > > > > ... > > > > > > > > > > > > > > > > > > > > Each task is executed by a TajoWorker. As you can see, > > tasks > > > of > > > > > the > > > > > > > > first > > > > > > > > > > stage execute a local aggregation after scanning input > > split. > > > > > This > > > > > > > > local > > > > > > > > > > aggregation result is shuffled among TajoWorkers with the > > > > > > aggregation > > > > > > > > key > > > > > > > > > > *k*. Then, the final aggregation is computed at the > second > > > > stage. > > > > > > > > > > > > > > > > > > > > Stage1 and Stage2 are similar to Map and Reduce of > > MapReduce. > > > > The > > > > > > > local > > > > > > > > > > aggregation of Stage1 is similar to the Combiner of > Hadoop > > > > > > MapReduce. > > > > > > > > > > > > > > > > > > > > I hope that this will be helpful to you. > > > > > > > > > > If you have any further questions, please feel free to > ask. > > > > > > > > > > Jihoon > > > > > > > > > > > > > > > > > > > > 2015년 6월 16일 (화) 오전 7:28, Atri Sharma < > [email protected] > > > >님이 > > > > > 작성: > > > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > What are your thoughts on parallel aggregation? > > Generating > > > > > query > > > > > > > > plans > > > > > > > > > > that > > > > > > > > > > > allow states to be generated which can be executed > > > > > independently > > > > > > > and > > > > > > > > > then > > > > > > > > > > > states recombined? > > > > > > > > > > > On 16 Jun 2015 05:25, "Jihoon Son" < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi Atri, thanks for your question. > > > > > > > > > > > > > > > > > > > > > > > > First of all, maybe you already did, I recommend that > > you > > > > > read > > > > > > > this > > > > > > > > > > > article > > > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://www.hadoopsphere.com/2015/02/technical-deep-dive-into-apache-tajo.html > > > > > > > > > > > > > > > > > > > > > > > > > before you start implementation. This is written by > > > > Hyunsik, > > > > > > and > > > > > > > > > > contains > > > > > > > > > > > > the description of Tajo's overall infrastructure. > > > > > Afterwards, I > > > > > > > > think > > > > > > > > > > > that > > > > > > > > > > > > you may ask more detailed question. > > > > > > > > > > > > > > > > > > > > > > > > Here, I'll roughly list some important classes for > > > > aggregate > > > > > > > > > > > > implementation. > > > > > > > > > > > > > > > > > > > > > > > > - SQLParser.g4 contains our SQL parsing rules. It > is > > > > > written > > > > > > > in > > > > > > > > > > antlr. > > > > > > > > > > > > - SQLAnalyzer is our parser based on rules defined > > at > > > > > > > > > SQLParser.g4. > > > > > > > > > > > > - SQLAnalyzer translates a SQL query into a tree > of > > > Expr > > > > > > which > > > > > > > > > > > > represents an algebraic expression. > > > > > > > > > > > > - LogicalPlanner translates the Expr tree into a > > > > > LogicalPlan > > > > > > > > that > > > > > > > > > > > > logically describes how the given query will be > > > > executed. > > > > > > > > > > > > - GlobalPlanner translates the LogicalPlan into a > > > > > MasterPlan > > > > > > > > > > > > (distributed query execution plan) that describes > > how > > > > the > > > > > > > given > > > > > > > > > > query > > > > > > > > > > > > will > > > > > > > > > > > > be executed in distributed cluster. > > > > > > > > > > > > - Once a MasterPlan is created, QueryMaster starts > > to > > > > > > execute > > > > > > > > > query > > > > > > > > > > > > processing. A query consists of multiple stages, > > which > > > > are > > > > > > > > > > > individually > > > > > > > > > > > > processed in some order. > > > > > > > > > > > > - For example, a simple aggregation query is > > > executed > > > > > in > > > > > > > two > > > > > > > > > > > stages, > > > > > > > > > > > > each of which is for parallel aggregation and > > > > combining > > > > > > > > > > aggregates. > > > > > > > > > > > > These > > > > > > > > > > > > stages are executed sequentially. > > > > > > > > > > > > - A stage is concurrently processed by multiple > > tasks, > > > > and > > > > > > is > > > > > > > > > > executed > > > > > > > > > > > > by TajoWorker. > > > > > > > > > > > > - Each task contains meta information for input > data > > > > and a > > > > > > > > > > LogicalPlan > > > > > > > > > > > > of the stage. This LogicalPlan is translated into > > > > > > PhysicalExec > > > > > > > > by > > > > > > > > > > > > PhysicalPlanner. > > > > > > > > > > > > - PhysicalExec describes how the query is actually > > > > > executed. > > > > > > > > > > > > - For example, there are two types of > > > > AggregationExec, > > > > > > > > > > > > i.e., HashAggregateExec and SortAggregateExec, > > for > > > > > > > hash-based > > > > > > > > > > > > aggregation > > > > > > > > > > > > and sort-based aggregation, respectively. > > > > > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > > Jihoon > > > > > > > > > > > > > > > > > > > > > > > > 2015년 6월 15일 (월) 오후 11:32, Atri Sharma < > > > > [email protected] > > > > > >님이 > > > > > > > 작성: > > > > > > > > > > > > > > > > > > > > > > > > > Folks, > > > > > > > > > > > > > > > > > > > > > > > > > > I am looking into parallel aggregates/combining > > > > > aggregates. I > > > > > > > > have > > > > > > > > > a > > > > > > > > > > > plan > > > > > > > > > > > > > around it which I think can work. > > > > > > > > > > > > > > > > > > > > > > > > > > Please update me on current infrastructure and > point > > me > > > > > > around > > > > > > > > the > > > > > > > > > > > > existing > > > > > > > > > > > > > code base. Also, ideas would be most welcome around > > it. > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > > > > > > > Atri > > > > > > > > > > > > > *l'apprenant* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
