Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "Hive/GettingStarted" page has been changed by ZhengShao. http://wiki.apache.org/hadoop/Hive/GettingStarted?action=diff&rev1=27&rev2=28 -------------------------------------------------- This streams the data in the map phase through the script /bin/cat (like hadoop streaming). Similarly - streaming can be used on the reduce side (please see the Hive Tutorial or examples) + + == Example Use Cases == + + === MovieLens User Ratings === + First, create a table with tab-delimited text file format: + {{{ + CREATE TABLE u_data ( + userid INT, + movieid INT, + rating INT, + unixtime STRING) + ROW FORMAT DELIMITED + FIELDS TERMINATED BY '\t' + STORED AS TEXTFILE; + }}} + + Then, download and extract the data files: + {{{ + wget http://www.grouplens.org/system/files/ml-data.tar__0.gz + tar xvzf ml-data.tar__0.gz + }}} + + And load it into the table that was just created: + {{{ + LOAD DATA LOCAL INPATH 'ml-data/u.data' + OVERWRITE INTO TABLE u_data; + }}} + + Count the number of rows in table u_data: + {{{ + SELECT COUNT(1) FROM u_data; + }}} + + Now we can do some complex data analysis on the table u_data: + + Create weekday_mapper.py: + {{{ + import sys + import datetime + + for line in sys.stdin: + line = line.strip() + userid, movieid, rating, unixtime = line.split('\t') + weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() + print '\t'.join([userid, movieid, rating, str(weekday)]) + }}} + + Use the mapper script: + {{{ + CREATE TABLE u_data_new ( + userid INT, + movieid INT, + rating INT, + weekday INT) + ROW FORMAT DELIMITED + FIELDS TERMINATED BY '\t'; + + INSERT OVERWRITE TABLE u_data_new + SELECT + TRANSFORM (userid, movieid, rating, unixtime) + USING 'python weekday_mapper.py' + AS (userid, movieid, rating, weekday) + FROM u_data; + + SELECT weekday, COUNT(1) + FROM u_data_new + GROUP BY weekday; + }}} + + === Apache Weblog Data === + + The format of Apache weblog is customizable, while most webmasters uses the default. + For default Apache weblog, we can create a table with the following command. + + More about !RegexSerDe can be found here: http://issues.apache.org/jira/browse/HIVE-662 + + {{{ + add jar ../build/contrib/hive_contrib.jar; + + CREATE TABLE apachelog ( + host STRING, + identity STRING, + user STRING, + time STRING, + request STRING, + status STRING, + size STRING, + referer STRING, + agent STRING) + ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' + WITH SERDEPROPERTIES ( + "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?", + "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" + ) + STORED AS TEXTFILE; + }}} +
