Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The following page has been changed by PrasadChakka: http://wiki.apache.org/hadoop/Hive/GettingStarted ------------------------------------------------------------------------------ - #pragma section-numbers 2 - = Getting Started With Hive = - == Where to find the code and how to download it == - == Compiling the code == - == Types of Deployments == - == Embedded non-shared test deployment == - == Shared warehouse deployment (production) == - == Using the CLI == + DISCLAIMER: This is a prototype version of Hive and is NOT production + quality. This is provided mainly as a way of illustrating the capabilities + of Hive and is provided as-is. However - we are working hard to make + Hive a production quality system. Hive has only been tested on unix(linux) + and mac systems using Java 1.6 for now - although it may very well work + on other similar platforms. It does not work on Cygwin right now. Most of + our testing has been on Hadoop 0.17 - so we would advise running it against + this version of hadoop - even though it may compile/work against other versions + + '''Downloading and building''' + ------------------------ + + You can either build hive jar files for your environment: + - apply patch + - cd hadoop/src/contrib/hive + - src/contrib/hive> ant -Dtarget.dir=<your-install-dir> package + + Or you can use the pre-packaged jars that is available at the following location + and untar it: + http://mirror.facebook.com/facebook/hive/hadoop-0.17/dist.tar.gz + + $ wget http://mirror.facebook.com/facebook/hive/hadoop-0.17/dist.tar.gz + $ tar zxvf dist.tar.gz + $ cd dist + $ ls + README bin conf lib + + bin/ (all the shell scripts) + lib/ (required jar files) + conf/ (configuration files) + + In the rest of the README, we use dist and <install-dir> interchangeably. + + '''Running Hive''' + ------------ + + Hive uses hadoop that means: + - you must have hadoop in your path OR + - export HADOOP=<hadoop-install-dir>/bin/hadoop + + To use hive command line interface (cli) from the shell: + $ bin/hive + + '''Using Hive''' + ---------- + + '''Configuration management overview''' + --------------------------------- + + - hive configuration is stored in <install-dir>/conf/hive-default.xml + and log4j in hive-log4j.properties + - hive configuration is an overlay on top of hadoop - meaning the + hadoop configuration variables are inherited by default. + - hive configuration can be manipulated by: + o editing hive-default.xml and defining any desired variables + (including hadoop variables) in it + o from the cli using the set command (see below) + o by invoking hive using the syntax: + * bin/hive -hiveconf x1=y1 -hiveconf x2=y2 + this sets the variables x1 and x2 to y1 and y2 + + + '''Error Logs''' + ---------- + Hive uses log4j for logging. By default logs are not emitted to the + console by the cli. They are stored in the file: + - /tmp/{user.name}/hive.log + + If the user wishes - the logs can be emitted to the console by adding + the arguments shown below: + - bin/hive -hiveconf hive.root.logger=INFO,console + + Note that setting hive.root.logger via the 'set' command does not + change logging properties since they are determined at initialization time. + + Error logs are very useful to debug problems. Please send them with + any bugs (of which there are many!) to [EMAIL PROTECTED] + + + '''DDL Operations''' + -------------- + + Creating Hive tables and browsing through them + + hive> CREATE TABLE pokes (foo INT, bar STRING); + + Creates a table called pokes with two columns, first being an + integer and other a string columns + + hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); + + Creates a table called pokes with two columns and a partition column + called ds. The partition column is a virtual column it is not part + of the data itself but is derived from the partition that a + particular dataset is loaded into. + + + By default tables are assumed to be of text input format and the + delimiters are assumed to be ^A(ctrl-a). We will be soon publish additional + commands/recipes to add binary (sequencefiles) data and configurable + delimiters etc. + + hive> SHOW TABLES; + + lists all the tables + + hive> SHOW TABLES '.*s'; + + lists all the table that end with 's'. The pattern matching follows Java regular + expressions. Check out this link for documentation + http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html + + hive> DESCRIBE invites; + + shows the list of columns + + Altering tables. Table name can be changed and additional columns can be dropped + + hive> ALTER TABLE pokes ADD COLUMNS (new_col INT); + hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment'); + hive> ALTER TABLE events RENAME TO 3koobecaf; + + Dropping tables + hive> DROP TABLE pokes; + + + '''Metadata Store''' + -------------- + + Metadata is in an embedded Derby database whose location is determined by the + hive configuration variable named javax.jdo.option.ConnectionURL. By default + (see conf/hive-default.xml) - this location is ./metastore_db + + Right now - in the default configuration, this metadata can only be seen by + one user at a time. + + Metastore can be stored in any database that is supported by JPOX. The + location and the type of the RDBMS can be controlled by the two variables + 'javax.jdo.option.ConnectionURL' and 'javax.jdo.option.ConnectionDriverName'. + Refer to JDO (or JPOX) documentation for more details on supported databases. + The database schema is defined in JDO metadata annotations file package.jdo + at src/contrib/hive/metastore/src/model. + + In the future - the metastore itself can be a standalone server. + + + '''DML Operations''' + -------------- + + Loading data from flat files into Hive + + hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; + + Loads a file that contains two columns separated by ctrl-a into pokes table. + 'local' signifies that the input file is on the local system. If 'local' + is omitted then it looks for the file in HDFS. + + The keyword 'overwrite' signifies that existing data in the table is deleted. + If the 'overwrite' keyword is omitted - then data files are appended to existing data sets. + + NOTES: + - NO verification of data against the schema + - if the file is in hdfs it is moved into hive controlled file system namespace. + The root of the hive directory is specified by the option hive.metastore.warehouse.dir + in hive-default.xml. We would advise that this directory be pre-existing before + trying to create tables via Hive. + + hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); + hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08'); + + The two LOAD statements above load data into two different partitions of the table + invites. Table invites must be created as partitioned by the key ds for this to succeed. + + '''Loading/Extracting data using Queries''' + ------------------------------------- + + '''Runtime configuration''' + --------------------- + + - Hives queries are executed using map-reduce queries and as such the behavior + of such queries can be controlled by the hadoop configuration variables + + - The cli can be used to set any hadoop (or hive) configuration variable. For example: + o hive> SET mapred.job.tracker=myhost.mycompany.com:50030 + o hive> SET - v + The latter shows all the current settings. Without the v option only the + variables that differ from the base hadoop configuration are displayed + - In particular the number of reducers should be set to a reasonable number + to get good performance (the default is 1!) + + + '''Example Queries''' + --------------- + + Some example queries are shown below. They are available in examples/queries. + More are available in the hive contrib sources src/test/queries/positive + + SELECTS and FILTERS + ------------------- + + hive> SELECT a.foo FROM invites a; + + select column 'foo' from all rows of invites table. The results are not + stored anywhere, but are displayed on the console. + + Note that in all the examples that follow, INSERT (into a hive table, local + directory or HDFS directory) is optional. + + hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a; + + select all rows from invites table into an HDFS directory. The result data + is in files (depending on the number of mappers) in that directory. + NOTE: partition columns if any are selected by the use of *. They can also + be specified in the projection clauses. + + hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a; + + Select all rows from pokes table into a local directory + + hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a; + hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; + hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a; + hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a; + hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a; + hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a; + hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a; + + Sum of a column. avg, min, max can also be used + + NOTE: there are some flaws with the type system that cause doubles to be + returned with integer types would be expected. We expect to fix these in the coming week. + + GROUP BY + -------- + + hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar; + hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar; + + NOTE: Currently Hive always uses two stage map-reduce for groupby operation. This is + to handle skews in input data. We will be optimizing this in the coming weeks. + + JOIN + ---- + + hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo + + MULTITABLE INSERT + ----------------- + FROM src + INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100 + INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200 + INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300 + INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300 + + STREAMING + --------- + hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09'; + + This streams the data in the map phase through the script /bin/cat (like hadoop streaming). + Similarly - streaming can be used on the reduce side (please see the Hive Tutorial or examples) +
