[Hadoop Wiki] Trivial Update of "Hive/Tutorial" by NeilConway

Apache Wiki Wed, 01 Apr 2009 17:37:52 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by NeilConway:
http://wiki.apache.org/hadoop/Hive/Tutorial

------------------------------------------------------------------------------
  = Concepts =
  == What is Hive ==
- Hive is the next generation infrastructure made with the goal of providing 
tools to enable easy data summarization, adhoc querying and analysis of detail 
data. In addition it also provides a simple query language called QL which is 
based on SQL and which enables users familiar with SQL to do adhoc querying, 
summarization and data analysis. At the same time, this language also allows 
traditional map/reduce programmers to be able to plug in their custom mappers 
and reducers to do more sophisticated analysis which may not be supported by 
the built in capabilities of the language. 
+ Hive is the next generation infrastructure made with the goal of providing 
tools to enable easy data summarization, adhoc querying and analysis of detail 
data. In addition it also provides a simple query language called QL which is 
based on SQL and which enables users familiar with SQL to do ad-hoc querying, 
summarization and data analysis. At the same time, this language also allows 
traditional map/reduce programmers to be able to plug in their custom mappers 
and reducers to do more sophisticated analysis which may not be supported by 
the built in capabilities of the language. 
  
  == What is NOT Hive ==
- Hive is based on hadoop which is a batch processing system. Accordingly, this 
system does not and cannot promise low latencies on queries. The paradigm here 
is strictly of submitting jobs and being notified when the jobs are completed 
as opposed to real time queries. As a result it should not be compared with 
systems like Oracle where analysis is done on a significantly smaller amount of 
data but the analysis proceeds much more iteratively with the response times 
between iterations being less than a few minutes. For Hive queries response 
times for even the smallest jobs can be of the order of 5-10 minutes and for 
larger jobs this may even run into hours.
+ Hive is based on Hadoop, which is a batch processing system. Accordingly, 
this system does not and cannot promise low latencies on queries. The paradigm 
here is strictly of submitting jobs and being notified when the jobs are 
completed as opposed to real time queries. As a result it should not be 
compared with systems like Oracle where analysis is done on a significantly 
smaller amount of data but the analysis proceeds much more iteratively with the 
response times between iterations being less than a few minutes. For Hive 
queries response times for even the smallest jobs can be of the order of 5-10 
minutes and for larger jobs this may even run into hours.
  
  In the following sections we provide a tutorial on the capabilities of the 
system. We start by describing the concepts of data types, tables and 
partitions (which are very similar to what you would find in a traditional 
relational database) and then illustrate the capabilities of the language with 
the help of some examples

[Hadoop Wiki] Trivial Update of "Hive/Tutorial" by NeilConway

Reply via email to