[Hadoop Wiki] Update of "Hive" by Ning Zhang

Apache Wiki Tue, 29 Dec 2009 14:13:58 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive" page has been changed by Ning Zhang.
http://wiki.apache.org/hadoop/Hive?action=diff&rev1=54&rev2=55

--------------------------------------------------

  = What is Hive =
- [[http://hadoop.apache.org/hive/|Hive]] is a data warehouse infrastructure 
built on top of Hadoop that provides tools to enable easy data summarization, 
adhoc querying and analysis of large datasets data stored in Hadoop files. It 
provides a mechanism to put structure on this data and it also provides a 
simple query language called QL which is based on SQL and which enables users 
familiar with SQL to query this data. At the same time, this language also 
allows traditional map/reduce programmers to be able to plug in their custom 
mappers and reducers to do more sophisticated analysis which may not be 
supported by the built in capabilities of the language.
+ [[http://hadoop.apache.org/hive/|Hive]] is a data warehouse infrastructure 
built on top of [[.|Hadoop]]. It provides tools to enable easy data ETL, a 
mechanism to put structures on the data, and the capability to querying and 
analysis of large data sets stored in Hadoop files. Hive defines a simple 
SQL-like query language, called QL, that enables users familiar with SQL to 
query the data. At the same time, this language also allows programmers who are 
familiar with the MapReduce fromwork to be able to plug in their custom mappers 
and reducers to perform more sophisticated analysis that may not be supported 
by the built-in capabilities of the language.
  
- Hive does not mandate read or written data be in "hive format" - there is no 
such thing; Hive works equally well on Thrift, control delimited, or your data 
format.  Please see File Format and 
[[http://www.slideshare.net/ragho/hive-user-meeting-august-2009-facebook|SerDe]]
 in Developer Guide for details.
+ Hive does not mandate read or written data be in the "Hive format"---there is 
no such thing. Hive works equally well on Thrift, control delimited, or your 
specialized data formats.  Please see [[/DeveloperGuide#File_Formats|File 
Format]] and 
[[http://www.slideshare.net/ragho/hive-user-meeting-august-2009-facebook|SerDe]]
 in [[/DeveloperGuide|Developer Guide]] for details.
  
  = What Hive is NOT =
- Hive is based on Hadoop which is a batch processing system. Accordingly, this 
system does not and cannot promise low latencies on queries. The paradigm here 
is strictly of submitting jobs and being notified when the jobs are completed 
as opposed to real time queries. As a result it should not be compared with 
systems like Oracle where analysis is done on a significantly smaller amount of 
data but the analysis proceeds much more iteratively with the response times 
between iterations being less than a few minutes. For Hive queries response 
times for even the smallest jobs can be of the order of 5-10 minutes and for 
larger jobs this may even run into hours.
+ Hive is based on Hadoop, which is a batch processing system. As a result, 
Hive does not and cannot promise low latencies on queries. The paradigm here is 
strictly of submitting jobs and being notified when the jobs are completed as 
opposed to real-time queries. In contrast to the systems such as Oracle where 
analysis is run on a significantly smaller amount of data, but the analysis 
proceeds much more iteratively with the response times between iterations being 
less than a few minutes, Hive queries response times for even the smallest jobs 
can be of the order of several minutes. If your input data is small you can 
execute a query in a shorter time. For example, if a table has 100 rows you can 
'set mapred.reduce.tasks=1' and 'set mapred.map.tasks=1' and the query time 
will be around a dozen seconds.  However for larger jobs (e.g., jobs processing 
terabytes of data) in general they may run into hours. 
  
- If your input data is small you can execute a query in a short time. For 
example, if a table has 100 rows you can 'set mapred.reduce.tasks=1' and 'set 
mapred.map.tasks=1' and the query time will be ~15 seconds.
+ In summary, low latency performance is not the top-priority of Hive's design 
principles. What Hive values most are scalability (scale out with more machines 
added dynamically to the Hadoop cluster), extensibility (with MapReduce 
framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its 
input formats.
  
  = Information =
   * General information about Hive

[Hadoop Wiki] Update of "Hive" by Ning Zhang

Reply via email to