Re: difference between Tajo, Hive and Impala

Tejas Patil Mon, 27 May 2013 22:55:28 -0700

Thanks Hyunsik and Owen.

The DAG based approach of representing query plans is quite aligned with
the system I have been working on as a part of my current study at UC,
Irvine with Prof Mike Carey: AsterixDb [0]


[0] : http://asterix.ics.uci.edu/pub/ASTERIX-DPD-2011.pdf


On Mon, May 27, 2013 at 10:27 PM, Owen O'Malley <[email protected]> wrote:

> On Mon, May 27, 2013 at 3:46 PM, Tejas Patil <[email protected]
> >wrote:
>
> > Please correct me if I am wrong.
> >
> > Hive : converts query to Map Reduce job(s). Can work on large scale data
> > irrespective of the size of result set.
> >
>
> Hive will continue to support MapReduce, but it will also get support for
> Tez. Tez is an Apache project that is building an execution engine that
> runs under Yarn. By running under Tez, instead of MapReduce, Hive will
> gain:
>   * Use one job instead of many and thus not let go of resources before the
> query is done
>   * Remove the hard synchronization barrier between jobs
>   * Allow Hive to shuffle from memory instead of hard disk
>
>
> > Impala : runs daemons across all data nodes to get results. no map-reduce
> > job is launched. Good for queries with small result set.
> > Tajo : converts query to Map Reduce 2 job(s). Smarter in terms of query
> > plans generated and physical operator selection both based on cluster
> > characteristics.
> >
> >
> > On Sun, May 26, 2013 at 7:47 AM, Jihoon Son <[email protected]> wrote:
> >
> > > I'm sorry to send this mail again.
> > > I cannot understand why the lower part of the above mail is regarded
> as a
> > > signature.
> > > =====================================================
> > >
> > > Hi, Tejas
> > >
> > > The key differences between Tajo and Impala is the design goal. To
> > increase
> > > the performance of query processing, Impala adopts an approach which
> the
> > > main memory is utilized as much as possible and intermediate data are
> > > transfered via streaming. If a query requires too much memory, Impala
> > > cannot process the query. Thus, Impala says that it is not an alternate
> > of
> > > Hive.
> > >
> > > However, Tajo uses a query optimization which considers user queries,
> > > characteristics of data, the status of cluster, and so on. Thus, Tajo
> can
> > > process a query with Impala's algorithm, Hive's algorithm or any other
> > > algorithms. For an example, Tajo can process a join query using the
> > > repartition join, or the merge join. Intermediate results can be
> > > materialized to disks or maintained in memory. Since Tajo builds a
> query
> > > plan considering above mentioned various factors, it can always process
> > > user queries. So, we can say that Tajo can be an alternate of Hive.
> > >
> > > Tajo can perform well over Hive for most of queries. The key reason is
> > that
> > > Tajo uses the own query engine while Hive uses MapReduce. This limits
> > that
> > > Hive can uses only MapReduce-based algorithms. However, Tajo can uses a
> > > more optimized algorithm.
> > >
> > > A sort query is a good example. Hive supports only the hash
> partitioning.
> > > Thus, each node sort data locally in the map phase and *ONE NODE*
> should
> > > perform global sort in the reduce phase.
> > > However, Tajo supports a sort algorithm using the range partitioning.
> In
> > > the first phase, each node sort data locally as in Hive, but the
> > > intermediate data are partitioned by the range of the sort key. In the
> > > second phase, each node performs local sort to get the final results.
> > Since
> > > intermediate data are partitioned by the range of sort key, final
> results
> > > are correct.
> > >
> > > If you have any questions about this,
> > > please feel free to ask.
> > >
> > > Thanks,
> > > Jihoon
> > >
> > >
> > >
> > > 2013/5/26 Jihoon Son <[email protected]>
> > >
> > > > Hi, Tejas
> > > >
> > > > The key differences between Tajo and Impala is the design goal. To
> > > > increase the performance of query processing, Impala adopts an
> approach
> > > > which the main memory is utilized as much as possible and
> intermediate
> > > data
> > > > are transfered via streaming. If a query requires too much memory,
> > Impala
> > > > cannot process the query. Thus, Impala says that it is not an
> alternate
> > > of
> > > > Hive.
> > > >
> > > > However, Tajo uses a query optimization which considers user queries,
> > > > characteristics of data, the status of cluster, and so on. Thus, Tajo
> > can
> > > > process a query with Impala's algorithm, Hive's algorithm or any
> other
> > > > algorithms. For an example, Tajo can process a join query using the
> > > > repartition join, or the merge join. Intermediate results can be
> > > > materialized to disks or maintained in memory. Since Tajo builds a
> > query
> > > > plan considering above mentioned various factors, it can always
> process
> > > > user queries. So, we can say that Tajo can be an alternate of Hive.
> > > >
> > > > Tajo can perform well over Hive for most of queries. The key reason
> is
> > > > that Tajo uses the own query engine while Hive uses MapReduce. This
> > > limits
> > > > that Hive can uses only MapReduce-based algorithms. However, Tajo can
> > > uses
> > > > a more optimized algorithm.
> > > >
> > > > A sort query is a good example. Hive supports only the hash
> > partitioning.
> > > > Thus, each node sort data locally in the map phase and*ONE NODE*
> should
> > > > perform global sort in the reduce phase.
> > > > However, Tajo supports a sort algorithm using the range partitioning.
> > In
> > > > the first phase, each node sort data locally as in Hive, but the
> > > > intermediate data are partitioned by the range of the sort key. In
> the
> > > > second phase, each node performs local sort to get the final results.
> > > Since
> > > > intermediate data are partitioned by the range of sort key, final
> > results
> > > > are correct.
> > > >
> > > > If you have any questions about this,
> > > > please feel free to ask.
> > > >
> > > > Thanks,
> > > > Jihoon
> > > >
> > > >
> > > > 2013/5/26 Tejas Patil <[email protected]>
> > > >
> > > >> Hi @dev,
> > > >>
> > > >> Can anyone comment about the difference between Tajo, Hive and
> Impala
> > ?
> > > >> Also, what is the reason for Tajo to perform well over Hive ? In
> what
> > > >> scenario would it be good to use Tajo ? and when would it be bad ?
> > > >>
> > > >> Thanks,
> > > >> Tejas Patil
> > > >> http://www.linkedin.com/in/tejaspatil1
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Jihoon Son
> > > >
> > > > Database & Information Systems Group,
> > > > Prof. Yon Dohn Chung Lab.
> > > > Dept. of Computer Science & Engineering,
> > > > Korea University
> > > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > > Seoul, 136-713, Republic of Korea
> > > >
> > > > Tel : +82-2-3290-3580
> > > > E-mail : [email protected]
> > > >
> > >
> > >
> > >
> > > --
> > > Jihoon Son
> > >
> > > Database & Information Systems Group,
> > > Prof. Yon Dohn Chung Lab.
> > > Dept. of Computer Science & Engineering,
> > > Korea University
> > > 1, 5-ga, Anam-dong, Seongbuk-gu,
> > > Seoul, 136-713, Republic of Korea
> > >
> > > Tel : +82-2-3290-3580
> > > E-mail : [email protected]
> > >
> >
>

Re: difference between Tajo, Hive and Impala

Reply via email to