Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The following page has been changed by RaghothamMurthy: http://wiki.apache.org/hadoop/Hive/GettingStarted ------------------------------------------------------------------------------ our testing has been on Hadoop 0.17 - so we would advise running it against this version of hadoop - even though it may compile/work against other versions - - '''Downloading and building''' + == Downloading and building == - ------------------------ You can either build hive jar files for your environment: - - apply patch + * apply patch - - cd hadoop/src/contrib/hive + * cd hadoop/src/contrib/hive - - src/contrib/hive> ant -Dtarget.dir=<your-install-dir> package + * src/contrib/hive> ant -Dtarget.dir=<your-install-dir> package Or you can use the pre-packaged jars that is available at the following location and untar it: http://mirror.facebook.com/facebook/hive/hadoop-0.17/dist.tar.gz - $ wget http://mirror.facebook.com/facebook/hive/hadoop-0.17/dist.tar.gz + * $ wget http://mirror.facebook.com/facebook/hive/hadoop-0.17/dist.tar.gz - $ tar zxvf dist.tar.gz + * $ tar zxvf dist.tar.gz - $ cd dist + * $ cd dist - $ ls + * $ ls - README bin conf lib + o README bin conf lib - bin/ (all the shell scripts) + * bin/ (all the shell scripts) - lib/ (required jar files) + * lib/ (required jar files) - conf/ (configuration files) + * conf/ (configuration files) In the rest of the README, we use dist and <install-dir> interchangeably. + == Running Hive == - '''Running Hive''' - ------------ Hive uses hadoop that means: - - you must have hadoop in your path OR + * you must have hadoop in your path OR - - export HADOOP=<hadoop-install-dir>/bin/hadoop + * export HADOOP=<hadoop-install-dir>/bin/hadoop To use hive command line interface (cli) from the shell: - $ bin/hive + ''$ bin/hive'' + == Using Hive == - '''Using Hive''' - ---------- - - '''Configuration management overview''' + === Configuration management overview === - --------------------------------- - hive configuration is stored in <install-dir>/conf/hive-default.xml and log4j in hive-log4j.properties @@ -63, +57 @@ this sets the variables x1 and x2 to y1 and y2 + === Error Logs === - '''Error Logs''' - ---------- Hive uses log4j for logging. By default logs are not emitted to the console by the cli. They are stored in the file: - /tmp/{user.name}/hive.log @@ -80, +73 @@ any bugs (of which there are many!) to [EMAIL PROTECTED] + == DDL Operations == - '''DDL Operations''' - -------------- Creating Hive tables and browsing through them @@ -120, +112 @@ Altering tables. Table name can be changed and additional columns can be dropped hive> ALTER TABLE pokes ADD COLUMNS (new_col INT); + hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment'); + hive> ALTER TABLE events RENAME TO 3koobecaf; Dropping tables hive> DROP TABLE pokes; + == Metadata Store == - '''Metadata Store''' - -------------- Metadata is in an embedded Derby database whose location is determined by the hive configuration variable named javax.jdo.option.ConnectionURL. By default @@ -147, +140 @@ In the future - the metastore itself can be a standalone server. + == DML Operations == - '''DML Operations''' - -------------- Loading data from flat files into Hive @@ -162, +154 @@ If the 'overwrite' keyword is omitted - then data files are appended to existing data sets. NOTES: - - NO verification of data against the schema + * NO verification of data against the schema - - if the file is in hdfs it is moved into hive controlled file system namespace. + * if the file is in hdfs it is moved into hive controlled file system namespace. The root of the hive directory is specified by the option hive.metastore.warehouse.dir in hive-default.xml. We would advise that this directory be pre-existing before trying to create tables via Hive. hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); + hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08'); The two LOAD statements above load data into two different partitions of the table invites. Table invites must be created as partitioned by the key ds for this to succeed. - '''Loading/Extracting data using Queries''' + == Loading/Extracting data using Queries == - ------------------------------------- - - '''Runtime configuration''' + === Runtime configuration === - --------------------- - - Hives queries are executed using map-reduce queries and as such the behavior + * Hives queries are executed using map-reduce queries and as such the behavior of such queries can be controlled by the hadoop configuration variables - - The cli can be used to set any hadoop (or hive) configuration variable. For example: + * The cli can be used to set any hadoop (or hive) configuration variable. For example: - o hive> SET mapred.job.tracker=myhost.mycompany.com:50030 + o hive> SET mapred.job.tracker=myhost.mycompany.com:50030 + - o hive> SET - v + o hive> SET - v The latter shows all the current settings. Without the v option only the variables that differ from the base hadoop configuration are displayed - - In particular the number of reducers should be set to a reasonable number + * In particular the number of reducers should be set to a reasonable number to get good performance (the default is 1!) + === Example Queries === - '''Example Queries''' - --------------- Some example queries are shown below. They are available in examples/queries. More are available in the hive contrib sources src/test/queries/positive - SELECTS and FILTERS + ==== SELECTS and FILTERS ==== - ------------------- - hive> SELECT a.foo FROM invites a; select column 'foo' from all rows of invites table. The results are not @@ -221, +209 @@ Select all rows from pokes table into a local directory hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a; + hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; + hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a; + hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a; + hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a; + hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a; + hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a; Sum of a column. avg, min, max can also be used @@ -233, +227 @@ NOTE: there are some flaws with the type system that cause doubles to be returned with integer types would be expected. We expect to fix these in the coming week. + ==== GROUP BY ==== - GROUP BY - -------- hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar; + hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar; NOTE: Currently Hive always uses two stage map-reduce for groupby operation. This is to handle skews in input data. We will be optimizing this in the coming weeks. + ==== JOIN ==== - JOIN - ---- hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo - MULTITABLE INSERT + ==== MULTITABLE INSERT ==== - ----------------- + - FROM src + FROM src - INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100 + INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100 - INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200 + INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200 - INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300 + INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300 - INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300 + INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300 - STREAMING - --------- + ==== STREAMING ==== + hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09'; This streams the data in the map phase through the script /bin/cat (like hadoop streaming).
