Thanks Hyunsik and Owen. The DAG based approach of representing query plans is quite aligned with the system I have been working on as a part of my current study at UC, Irvine with Prof Mike Carey: AsterixDb [0]
[0] : http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf On Mon, May 27, 2013 at 10:27 PM, Owen O'Malley <[email protected]> wrote: > On Mon, May 27, 2013 at 3:46 PM, Tejas Patil <[email protected] > >wrote: > > > Please correct me if I am wrong. > > > > Hive : converts query to Map Reduce job(s). Can work on large scale data > > irrespective of the size of result set. > > > > Hive will continue to support MapReduce, but it will also get support for > Tez. Tez is an Apache project that is building an execution engine that > runs under Yarn. By running under Tez, instead of MapReduce, Hive will > gain: > * Use one job instead of many and thus not let go of resources before the > query is done > * Remove the hard synchronization barrier between jobs > * Allow Hive to shuffle from memory instead of hard disk > > > > Impala : runs daemons across all data nodes to get results. no map-reduce > > job is launched. Good for queries with small result set. > > Tajo : converts query to Map Reduce 2 job(s). Smarter in terms of query > > plans generated and physical operator selection both based on cluster > > characteristics. > > > > > > On Sun, May 26, 2013 at 7:47 AM, Jihoon Son <[email protected]> wrote: > > > > > I'm sorry to send this mail again. > > > I cannot understand why the lower part of the above mail is regarded > as a > > > signature. > > > ===================================================== > > > > > > Hi, Tejas > > > > > > The key differences between Tajo and Impala is the design goal. To > > increase > > > the performance of query processing, Impala adopts an approach which > the > > > main memory is utilized as much as possible and intermediate data are > > > transfered via streaming. If a query requires too much memory, Impala > > > cannot process the query. Thus, Impala says that it is not an alternate > > of > > > Hive. > > > > > > However, Tajo uses a query optimization which considers user queries, > > > characteristics of data, the status of cluster, and so on. Thus, Tajo > can > > > process a query with Impala's algorithm, Hive's algorithm or any other > > > algorithms. For an example, Tajo can process a join query using the > > > repartition join, or the merge join. Intermediate results can be > > > materialized to disks or maintained in memory. Since Tajo builds a > query > > > plan considering above mentioned various factors, it can always process > > > user queries. So, we can say that Tajo can be an alternate of Hive. > > > > > > Tajo can perform well over Hive for most of queries. The key reason is > > that > > > Tajo uses the own query engine while Hive uses MapReduce. This limits > > that > > > Hive can uses only MapReduce-based algorithms. However, Tajo can uses a > > > more optimized algorithm. > > > > > > A sort query is a good example. Hive supports only the hash > partitioning. > > > Thus, each node sort data locally in the map phase and *ONE NODE* > should > > > perform global sort in the reduce phase. > > > However, Tajo supports a sort algorithm using the range partitioning. > In > > > the first phase, each node sort data locally as in Hive, but the > > > intermediate data are partitioned by the range of the sort key. In the > > > second phase, each node performs local sort to get the final results. > > Since > > > intermediate data are partitioned by the range of sort key, final > results > > > are correct. > > > > > > If you have any questions about this, > > > please feel free to ask. > > > > > > Thanks, > > > Jihoon > > > > > > > > > > > > 2013/5/26 Jihoon Son <[email protected]> > > > > > > > Hi, Tejas > > > > > > > > The key differences between Tajo and Impala is the design goal. To > > > > increase the performance of query processing, Impala adopts an > approach > > > > which the main memory is utilized as much as possible and > intermediate > > > data > > > > are transfered via streaming. If a query requires too much memory, > > Impala > > > > cannot process the query. Thus, Impala says that it is not an > alternate > > > of > > > > Hive. > > > > > > > > However, Tajo uses a query optimization which considers user queries, > > > > characteristics of data, the status of cluster, and so on. Thus, Tajo > > can > > > > process a query with Impala's algorithm, Hive's algorithm or any > other > > > > algorithms. For an example, Tajo can process a join query using the > > > > repartition join, or the merge join. Intermediate results can be > > > > materialized to disks or maintained in memory. Since Tajo builds a > > query > > > > plan considering above mentioned various factors, it can always > process > > > > user queries. So, we can say that Tajo can be an alternate of Hive. > > > > > > > > Tajo can perform well over Hive for most of queries. The key reason > is > > > > that Tajo uses the own query engine while Hive uses MapReduce. This > > > limits > > > > that Hive can uses only MapReduce-based algorithms. However, Tajo can > > > uses > > > > a more optimized algorithm. > > > > > > > > A sort query is a good example. Hive supports only the hash > > partitioning. > > > > Thus, each node sort data locally in the map phase and*ONE NODE* > should > > > > perform global sort in the reduce phase. > > > > However, Tajo supports a sort algorithm using the range partitioning. > > In > > > > the first phase, each node sort data locally as in Hive, but the > > > > intermediate data are partitioned by the range of the sort key. In > the > > > > second phase, each node performs local sort to get the final results. > > > Since > > > > intermediate data are partitioned by the range of sort key, final > > results > > > > are correct. > > > > > > > > If you have any questions about this, > > > > please feel free to ask. > > > > > > > > Thanks, > > > > Jihoon > > > > > > > > > > > > 2013/5/26 Tejas Patil <[email protected]> > > > > > > > >> Hi @dev, > > > >> > > > >> Can anyone comment about the difference between Tajo, Hive and > Impala > > ? > > > >> Also, what is the reason for Tajo to perform well over Hive ? In > what > > > >> scenario would it be good to use Tajo ? and when would it be bad ? > > > >> > > > >> Thanks, > > > >> Tejas Patil > > > >> http://www.linkedin.com/in/tejaspatil1 > > > >> > > > > > > > > > > > > > > > > -- > > > > Jihoon Son > > > > > > > > Database & Information Systems Group, > > > > Prof. Yon Dohn Chung Lab. > > > > Dept. of Computer Science & Engineering, > > > > Korea University > > > > 1, 5-ga, Anam-dong, Seongbuk-gu, > > > > Seoul, 136-713, Republic of Korea > > > > > > > > Tel : +82-2-3290-3580 > > > > E-mail : [email protected] > > > > > > > > > > > > > > > > -- > > > Jihoon Son > > > > > > Database & Information Systems Group, > > > Prof. Yon Dohn Chung Lab. > > > Dept. of Computer Science & Engineering, > > > Korea University > > > 1, 5-ga, Anam-dong, Seongbuk-gu, > > > Seoul, 136-713, Republic of Korea > > > > > > Tel : +82-2-3290-3580 > > > E-mail : [email protected] > > > > > >
