[Hadoop Wiki] Update of "Hive/GettingStarted" by ZhengS hao

Apache Wiki Mon, 07 Dec 2009 00:56:31 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/GettingStarted" page has been changed by ZhengShao.
http://wiki.apache.org/hadoop/Hive/GettingStarted?action=diff&rev1=27&rev2=28

--------------------------------------------------

  This streams the data in the map phase through the script /bin/cat (like 
hadoop streaming). 
  Similarly - streaming can be used on the reduce side (please see the Hive 
Tutorial or examples)
  
+ 
+ == Example Use Cases ==
+ 
+ === MovieLens User Ratings ===
+ First, create a table with tab-delimited text file format:
+ {{{
+ CREATE TABLE u_data (
+   userid INT,
+   movieid INT,
+   rating INT,
+   unixtime STRING)
+ ROW FORMAT DELIMITED
+ FIELDS TERMINATED BY '\t'
+ STORED AS TEXTFILE;
+ }}}
+ 
+ Then, download and extract the data files:
+ {{{
+ wget http://www.grouplens.org/system/files/ml-data.tar__0.gz
+ tar xvzf ml-data.tar__0.gz
+ }}}
+  
+ And load it into the table that was just created:
+ {{{
+ LOAD DATA LOCAL INPATH 'ml-data/u.data'
+ OVERWRITE INTO TABLE u_data;
+ }}}
+ 
+ Count the number of rows in table u_data:
+ {{{
+ SELECT COUNT(1) FROM u_data;
+ }}}
+ 
+ Now we can do some complex data analysis on the table u_data:
+ 
+ Create weekday_mapper.py:
+ {{{
+ import sys
+ import datetime
+ 
+ for line in sys.stdin:
+   line = line.strip()
+   userid, movieid, rating, unixtime = line.split('\t')
+   weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
+   print '\t'.join([userid, movieid, rating, str(weekday)])
+ }}}
+ 
+ Use the mapper script:
+ {{{
+ CREATE TABLE u_data_new (
+   userid INT,
+   movieid INT,
+   rating INT,
+   weekday INT)
+ ROW FORMAT DELIMITED
+ FIELDS TERMINATED BY '\t';
+ 
+ INSERT OVERWRITE TABLE u_data_new
+ SELECT
+   TRANSFORM (userid, movieid, rating, unixtime)
+   USING 'python weekday_mapper.py'
+   AS (userid, movieid, rating, weekday)
+ FROM u_data;
+ 
+ SELECT weekday, COUNT(1)
+ FROM u_data_new
+ GROUP BY weekday;
+ }}}
+ 
+ === Apache Weblog Data ===
+ 
+ The format of Apache weblog is customizable, while most webmasters uses the 
default.
+ For default Apache weblog, we can create a table with the following command.
+ 
+ More about !RegexSerDe can be found here: 
http://issues.apache.org/jira/browse/HIVE-662
+ 
+ {{{
+ add jar ../build/contrib/hive_contrib.jar;
+ 
+ CREATE TABLE apachelog (
+   host STRING,
+   identity STRING,
+   user STRING,
+   time STRING,
+   request STRING,
+   status STRING,
+   size STRING,
+   referer STRING,
+   agent STRING)
+ ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
+ WITH SERDEPROPERTIES (
+   "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ 
\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ 
\"]*|\"[^\"]*\"))?",
+   "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
+ )
+ STORED AS TEXTFILE;
+ }}}
+

[Hadoop Wiki] Update of "Hive/GettingStarted" by ZhengS hao

Reply via email to