Re: difference between Tajo, Hive and Impala

Jihoon Son Sun, 26 May 2013 07:48:10 -0700

I'm sorry to send this mail again.
I cannot understand why the lower part of the above mail is regarded as a
signature.
=====================================================


Hi, Tejas

The key differences between Tajo and Impala is the design goal. To increase
the performance of query processing, Impala adopts an approach which the
main memory is utilized as much as possible and intermediate data are
transfered via streaming. If a query requires too much memory, Impala
cannot process the query. Thus, Impala says that it is not an alternate of
Hive.

However, Tajo uses a query optimization which considers user queries,
characteristics of data, the status of cluster, and so on. Thus, Tajo can
process a query with Impala's algorithm, Hive's algorithm or any other
algorithms. For an example, Tajo can process a join query using the
repartition join, or the merge join. Intermediate results can be
materialized to disks or maintained in memory. Since Tajo builds a query
plan considering above mentioned various factors, it can always process
user queries. So, we can say that Tajo can be an alternate of Hive.

Tajo can perform well over Hive for most of queries. The key reason is that
Tajo uses the own query engine while Hive uses MapReduce. This limits that
Hive can uses only MapReduce-based algorithms. However, Tajo can uses a
more optimized algorithm.

A sort query is a good example. Hive supports only the hash partitioning.
Thus, each node sort data locally in the map phase and *ONE NODE* should
perform global sort in the reduce phase.
However, Tajo supports a sort algorithm using the range partitioning. In
the first phase, each node sort data locally as in Hive, but the
intermediate data are partitioned by the range of the sort key. In the
second phase, each node performs local sort to get the final results. Since
intermediate data are partitioned by the range of sort key, final results
are correct.

If you have any questions about this,
please feel free to ask.

Thanks,
Jihoon



2013/5/26 Jihoon Son <[email protected]>

> Hi, Tejas
>
> The key differences between Tajo and Impala is the design goal. To
> increase the performance of query processing, Impala adopts an approach
> which the main memory is utilized as much as possible and intermediate data
> are transfered via streaming. If a query requires too much memory, Impala
> cannot process the query. Thus, Impala says that it is not an alternate of
> Hive.
>
> However, Tajo uses a query optimization which considers user queries,
> characteristics of data, the status of cluster, and so on. Thus, Tajo can
> process a query with Impala's algorithm, Hive's algorithm or any other
> algorithms. For an example, Tajo can process a join query using the
> repartition join, or the merge join. Intermediate results can be
> materialized to disks or maintained in memory. Since Tajo builds a query
> plan considering above mentioned various factors, it can always process
> user queries. So, we can say that Tajo can be an alternate of Hive.
>
> Tajo can perform well over Hive for most of queries. The key reason is
> that Tajo uses the own query engine while Hive uses MapReduce. This limits
> that Hive can uses only MapReduce-based algorithms. However, Tajo can uses
> a more optimized algorithm.
>
> A sort query is a good example. Hive supports only the hash partitioning.
> Thus, each node sort data locally in the map phase and*ONE NODE* should
> perform global sort in the reduce phase.
> However, Tajo supports a sort algorithm using the range partitioning. In
> the first phase, each node sort data locally as in Hive, but the
> intermediate data are partitioned by the range of the sort key. In the
> second phase, each node performs local sort to get the final results. Since
> intermediate data are partitioned by the range of sort key, final results
> are correct.
>
> If you have any questions about this,
> please feel free to ask.
>
> Thanks,
> Jihoon
>
>
> 2013/5/26 Tejas Patil <[email protected]>
>
>> Hi @dev,
>>
>> Can anyone comment about the difference between Tajo, Hive and Impala ?
>> Also, what is the reason for Tajo to perform well over Hive ? In what
>> scenario would it be good to use Tajo ? and when would it be bad ?
>>
>> Thanks,
>> Tejas Patil
>> http://www.linkedin.com/in/tejaspatil1
>>
>
>
>
> --
> Jihoon Son
>
> Database & Information Systems Group,
> Prof. Yon Dohn Chung Lab.
> Dept. of Computer Science & Engineering,
> Korea University
> 1, 5-ga, Anam-dong, Seongbuk-gu,
> Seoul, 136-713, Republic of Korea
>
> Tel : +82-2-3290-3580
> E-mail : [email protected]
>



-- 
Jihoon Son

Database & Information Systems Group,
Prof. Yon Dohn Chung Lab.
Dept. of Computer Science & Engineering,
Korea University
1, 5-ga, Anam-dong, Seongbuk-gu,
Seoul, 136-713, Republic of Korea

Tel : +82-2-3290-3580
E-mail : [email protected]

Re: difference between Tajo, Hive and Impala

Reply via email to