[Hadoop Wiki] Update of "Hive/Tutorial" by Ning Zhang

Apache Wiki Wed, 23 Dec 2009 22:17:29 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/Tutorial" page has been changed by Ning Zhang.
http://wiki.apache.org/hadoop/Hive/Tutorial?action=diff&rev1=18&rev2=19

--------------------------------------------------

  
  ''' Note that any data for this table or partitions will be dropped and may 
not be recoverable. '''
  
- ---- /!\ '''Edit conflict - your version:''' ----
- 
- == Language capabilities ==
- '''Hive query language provides the basic SQL like operations. These 
operations work on tables or partitions. These operations are: '''
- 
-  * '''Ability to filter rows from a table using a where clause. '''
-  * '''Ability to select certain columns from the table using a select clause. 
'''
-  * '''Ability to do equi-joins between two tables. '''
-  * '''Ability to evaluate aggregations on multiple "group by" columns for the 
data stored in a table. '''
-  * '''Ability to store the results of a query into another table. '''
-  * '''Ability to download the contents of a table to a local (e.g., nfs) 
directory. '''
-  * '''Ability to store the results of a query in a hadoop dfs directory. '''
-  * '''Ability to manage tables and partitions (create, drop and alter). '''
-  * '''Ability to plug in custom scripts in the language of choice for custom 
map/reduce jobs. '''
- 
- = Usage and Examples =
- '''The following examples highlight some salient features of the system. A 
detailed set of query test cases can be found at 
[[http://svn.apache.org/viewvc/hadoop/hive/trunk/ql/src/test/queries/clientpositive/|Hive
 Query Test Cases]] and the corresponding results can be found at 
[[http://svn.apache.org/viewvc/hadoop/hive/trunk/ql/src/test/results/clientpositive/|Query
 Test Case Results]] '''
- 
- == Creating Tables ==
- '''An example statement that would create the page_view table mentioned above 
would be like: '''
- 
- {{{
- }}}
- '''In this example the columns of the table are specified with the 
corresponding types. Comments can be attached both at the column level as well 
as at the table level. Additionally the partitioned by clause defines the 
partitioning columns which are different from the data columns and are actually 
not stored with the data. When specified in this way, the data in the files is 
assumed to be delimited with ascii 001(ctrl-A) used as the field delimiter and 
newline used as a row delimiter. '''
- 
- '''These delimiters can be parametrized if the data is not in the above 
format as illustrated in the following example: '''
- 
- {{{
- }}}
- '''The ROW FORMAT clause allows the user to specify both the field delimiters 
as well as the line delimiters. '''
- 
- '''It is also a good idea to bucket the tables on certain columns so that 
efficient sampling queries can be executed against the data set (note: If 
bucketing is absent, random sampling can still be done on the table). The 
following example illustrates the case of the page_view table which is bucketed 
on userid column: '''
- 
- {{{
- }}}
- '''In the example above, the table is bucketed (clustered by) userid and 
within each bucket the data is sorted in the increasing order of viewTime. Such 
an organization allows the user to do efficient sampling on the clustered 
column - in this case userid. The sorting property allows internal operators to 
take advantage of the better-known data structure while evaluating queries, 
also increasing efficiency. '''
- 
- {{{
- }}}
- '''In this example the columns that comprise of the table row are specified 
in a similar way as the definition of types. Comments can be attached both at 
the column level as well as at the table level. Additionally the partitioned by 
clause defines the partitioning columns which are different from the data 
columns and are actually not stored with the data. The bucketed on clause 
specifies which column to use for bucketing as well as how many buckets to 
create. The delimited row format specifies how the rows are stored in the hive 
table. In the case of the delimited format, this specifies how the fields are 
terminated, how the items within collections (arrays or maps) are terminated 
and how the map keys are terminated. STORED AS SEQUENCEFILE indicates that this 
data is stored in a binary format (using hadoop SequenceFiles) on hdfs. The 
values shown for the ROW FORMAT and STORED AS clauses in the above example 
represent the system defaults. '''
- 
- '''Table names and column names are case insensitive. '''
- 
- == Browsing Tables and Partitions ==
- {{{
- }}}
- '''To list existing tables in the warehouse; there are many of these, likely 
more than you want to browse. '''
- 
- {{{
- }}}
- '''To list tables with prefix 'page'. The pattern follows Java regular 
expression syntax (so the period is a wildcard). '''
- 
- {{{
- }}}
- '''To list partitions of a table. If the table is not a partitioned table 
then an error is thrown. '''
- 
- {{{
- }}}
- '''To list columns and column types of table. '''
- 
- {{{
- }}}
- '''To list columns and all other properties of table. This prints lot of 
information and that too not in a pretty format. Usually used for debugging. '''
- 
- {{{
- }}}
- '''To list columns and all other properties of a partition. This also prints 
lot of information which is usually used for debugging. '''
- 
- == Loading Data ==
- '''There are multiple mechanisms of loading data into Hive tables. The user 
can create an external table that points to a specified location within hdfs. 
In this particular usage, the user can copy a file into the specified location 
using the hdfs put or copy commands and create a table pointing to this 
location with all the relevant row format information. Once this is done, the 
user can transform this data and insert into any other Hive table. e.g. if the 
file /tmp/pv_2008-06-08.txt contains comma separated page views served on 
2008-06-08, and this needs to be loaded into the page_view table in the 
appropriate partition, the following sequence of commands can achieve this: '''
- 
- {{{
- }}}
- '''In the example above nulls are inserted for the array and map types in the 
destination tables but potentially these can also come from the external table 
if the proper row formats are specified. '''
- 
- '''This method is useful if there is already legacy data in hdfs on which the 
user wants to put some metadata so that that the data can be queried and 
manipulated using hive. '''
- 
- '''Additionally, the system also supports syntax that can load the data from 
a file in the local files system directly into a hive table where the input 
data format is same as the table format. If /tmp/pv_2008-06-08_us.txt already 
contains the data for US, then we do not need any additional filtering as shown 
in the previous example. The load in this case can be done using the following 
syntax: '''
- 
- {{{
- }}}
- '''The path argument can take a directory (in which case all the files in the 
directory are loaded), a single file name, or a wildcard (in which case all the 
matching files are uploaded). If the argument is a directory - it cannot 
contain subdirectories. Similarly - the wildcard must match file names only. '''
- 
- '''In the case that the input file /tmp/pv_2008-06-08_us.txt is very large, 
the user may decide to do a parallel load of the data (using tools that are 
external to Hive). Once the file is in HDFS - the following syntax can be used 
to load the data into a Hive table: '''
- 
- {{{
- }}}
- '''It is assumed that the array and map fields in the input.txt files are 
null fields for these examples. '''
- 
- == Simple Query ==
- '''For all the active users, one can use the query of the following form: '''
- 
- {{{
- }}}
- '''Note that unlike SQL, we always insert the results into a table. We will 
illustrate later how the user can inspect these results and even dump them to a 
local file. '''
- 
- == Partition Based Query ==
- '''What partitions to use in a query is determined automatically by the 
system on the basis of where clause conditions on partition columns. e.g. in 
order to get all the page_views in the month of 03/2008 referred from domain 
xyz.com, one could write the following query: '''
- 
- {{{
- }}}
- '''(Note that page_views.date is used here because the table (above) was 
defined with PARTITIONED BY(date DATETIME, country STRING) ; if you name your 
partition something different, don't expect .date to do what you think!) '''
- 
- == Joins ==
- '''In order to get a demographic breakdown (by gender) of page_view of 
2008-03-03 one would need to join the page_view table and the user table on the 
userid column. This can be accomplished with a join as shown in the following 
query: '''
- 
- {{{
- }}}
- '''In order to do outer joins the user can qualify the join with LEFT OUTER, 
RIGHT OUTER or FULL OUTER keywords in order to indicate the kind of outer join 
(left preserved, right preserved or both sides preserved). e.g. in order to do 
a full outer join in the query above, the corresponding syntax would look like 
the following query: '''
- 
- {{{
- }}}
- '''In order to join more than one tables, the user can use the following 
syntax: '''
- 
- {{{
- }}}
- '''Note that Hive only supports 
[[http://en.wikipedia.org/wiki/Join_(SQL)#Equi-join|equi-joins]]. Note also 
that it is best to put the largest table on the rightmost side of the join in 
order to avoid memory errors. '''
- 
- == Aggregations ==
- '''In order to count the number of distinct users by gender one could write 
the following query: '''
- 
- {{{
- }}}
- '''Multiple aggregations can be done at the same time, however, no two 
aggregations can have different DISTINCT columns .e.g while the following is 
possible '''
- 
- {{{
- }}}
- '''however, the following query is not allowed '''
- 
- {{{
- }}}
- == Multi Table/File Inserts ==
- '''The output of the aggregations or simple selects can be further sent into 
multiple tables or even to hadoop dfs files (which can then be manipulated 
using hdfs utilities). e.g. if along with the gender breakdown, one needed to 
find the breakdown of unique page views by age, one could accomplish that with 
the following query: '''
- 
- {{{
- }}}
- '''The first insert clause sends the results of the first group by to a Hive 
table while the second one sends the results to a hadoop dfs files. '''
- 
- == Inserting into local files ==
- '''In certain situations you would want to write the output into a local file 
so that you could load it into an excel spreadsheet. This can be accomplished 
with the following command: '''
- 
- {{{
- }}}
- == Sampling ==
- '''The sampling clause allows the users to write queries for samples of the 
data instead of the whole table. Currently the sampling is done on the columns 
that are specified in the BUCKETED ON clause of the CREATE TABLE statement. In 
the following example we choose 3rd bucket out of the 32 buckets of the 
pv_gender_sum table: '''
- 
- {{{
- }}}
- '''In general the TABLESAMPLE syntax looks like: '''
- 
- {{{
- }}}
- '''y has to be a multiple or divisor of the number of buckets in that table 
as specified at the table creation time. The buckets chosen are determined if 
bucket_number module y is equal to x. So in the above example the following 
tablesample clause '''
- 
- {{{
- }}}
- '''would pick out the 3rd and 19th buckets. The buckets are numbered starting 
from 0. '''
- 
- '''On the other hand the tablesample clause '''
- 
- {{{
- }}}
- '''would pick out half of the 3rd bucket. '''
- 
- == Union all ==
- '''The language also supports union all, e.g. if we suppose there are two 
different tables that track which user has published a video and which user has 
published a comment, the following query joins the results of a union all with 
the user table to create a single annotated stream for all the video publishing 
and comment publishing events: '''
- 
- {{{
- }}}
- == Array Operations ==
- '''Array columns in tables can only be created programmatically currently. We 
will be extending this soon to be available as part of the create table 
statement. For the purpose of the current example assume that pv.friends is of 
the type array<INT> i.e. it is an array of integers.The user can get a specific 
element in the array by its index as shown in the following command: '''
- 
- {{{
- }}}
- '''The select expressions gets the third item in the pv.friends array. '''
- 
- '''The user can also get the length of the array using the size function as 
shown below: '''
- 
- {{{
- }}}
- == Map(Associative Arrays) Operations ==
- '''Maps provide collections similar to associative arrays. Such structures 
can only be created programmatically currently. We will be extending this soon. 
For the purpose of the current example assume that pv.properties is of the type 
map<String, String> i.e. it is an associative array from strings to string. 
Accordingly, the following query: '''
- 
- {{{
- }}}
- '''can be used to select the 'page_type' property from the page_views table. 
'''
- 
- '''Similar to arrays, the size function can also be used to get the number of 
elements in a map as shown in the following query: '''
- 
- {{{
- }}}
- == Custom map/reduce scripts ==
- '''Users can also plug in their own custom mappers and reducers in the data 
stream by using features natively supported in the Hive language. e.g. in order 
to run a custom mapper script - map_script - and a custom reducer script - 
reduce_script - the user can issue the following command which uses the 
TRANSFORM clause to embed the mapper and the reducer scripts. '''
- 
- '''Note that columns will be transformed to string and delimited by TAB 
before feeding to the user script, and the standard output of the user script 
will be treated as TAB-separated string columns. User scripts can output debug 
information to standard error which will be shown on the task detail page on 
hadoop. '''
- 
- {{{
- }}}
- '''Sample map script (weekday_mapper.py ) '''
- 
- {{{
- }}}
- '''Of course, both MAP and REDUCE are "syntactic sugar" for the more general 
select transform. The inner query could also have been written as such: '''
- 
- {{{
- }}}
- '''Schema-less map/reduce: If there is no "AS" clause after "USING 
map_script", Hive assumes the output of the script contains 2 parts: key which 
is before the first tab, and value which is the rest after the first tab. Note 
that this is different from specifying "AS key, value" because in that case 
value will only contains the portion between the first tab and the second tab 
if there are multiple tabs. '''
- 
- '''In this way, we allow users to migrate old map/reduce scripts without 
knowing the schema of the map output. User still needs to know the reduce 
output schema because that has to match what is in the table that we are 
inserting to. '''
- 
- {{{
- }}}
- '''Distribute By and Sort By: Instead of specifying "cluster by", the user 
can specify "distribute by" and "sort by", so the partition columns and sort 
columns can be different. The usual case is that the partition columns are a 
prefix of sort columns, but that is not required. '''
- 
- {{{
- }}}
- == Co groups ==
- '''Amongst the user community using map/reduce, cogroup is a fairly common 
operation wherein the data from multiple tables are sent to a custom reducer 
such that the rows are grouped by the values of certain columns on the tables. 
With the UNION ALL operator and the CLUSTER BY specification, this can be 
achieved in the Hive query language in the following way. Suppose we wanted to 
cogroup the rows from the actions_video and action_comments table on the uid 
column and send them to the 'reduce_script' custom reducer, the following 
syntax can be used by the user: '''
- 
- {{{
- }}}
- == Altering Tables ==
- '''To rename existing table to a new name. If a table with new name already 
exists then an error is returned: '''
- 
- {{{
- }}}
- '''To rename the columns of an existing table. Be sure to use the same column 
types, and to include an entry for each preexisting column: '''
- 
- {{{
- }}}
- '''To add columns to an existing table: '''
- 
- {{{
- }}}
- '''Note that a change in the schema (such as the adding of the columns), 
preserves the schema for the old partitions of the table in case it is a 
partitioned table. All the queries that access these columns and run over the 
old partitions implicitly return a null value or the specified default values 
for these columns. '''
- 
- '''In the later versions we can make the behavior of assuming certain values 
as opposed to throwing an error in case the column is not found in a particular 
partition configurable. '''
- 
- == Dropping Tables and Partitions ==
- '''Dropping tables is fairly trivial. A drop on the table would implicitly 
drop any indexes(this is a future feature) that would have been built on the 
table. The associated command is '''
- 
- {{{
- }}}
- '''To dropping a partition. Alter the table to drop the partition. '''
- 
- {{{
- }}}
- Note that any data for this table or partitions will be dropped and may not 
be recoverable. ''' '''
- 
- ---- /!\ '''End of edit conflict''' ----
-

[Hadoop Wiki] Update of "Hive/Tutorial" by Ning Zhang

Reply via email to