Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "Hive/GettingStarted" page has been changed by CarlSteinbach. http://wiki.apache.org/hadoop/Hive/GettingStarted?action=diff&rev1=37&rev2=38 -------------------------------------------------- examples/ (sample input and query files) }}} - In the rest of the page, we use build/dist and <install-dir> interchangeably. + In the rest of the page, we use `build/dist` and `<install-dir>` interchangeably. === Running Hive === Hive uses hadoop that means: * you must have hadoop in your path OR - * export HADOOP_HOME=<hadoop-install-dir> + * `export HADOOP_HOME=<hadoop-install-dir>` - In addition, you must create /tmp and /user/hive/warehouse + In addition, you must create `/tmp` and `/user/hive/warehouse` - (aka hive.metastore.warehouse.dir) and set them chmod g+w in + (aka `hive.metastore.warehouse.dir`) and set them `chmod g+w` in HDFS before a table can be created in Hive. Commands to perform this setup @@ -56, +56 @@ $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse }}} - I also find it useful but not necessary to set HIVE_HOME + I also find it useful but not necessary to set `HIVE_HOME` {{{ $ export HIVE_HOME=<hive-install-dir> }}} @@ -68, +68 @@ === Configuration management overview === - - hive default configuration is stored in <install-dir>/conf/hive-default.xml + - hive default configuration is stored in `<install-dir>/conf/hive-default.xml` - Configuration variables can be changed by (re-)defining them in <install-dir>/conf/hive-site.xml + Configuration variables can be changed by (re-)defining them in `<install-dir>/conf/hive-site.xml` - - log4j configuration is stored in <install-dir>/conf/hive-log4j.properties + - log4j configuration is stored in `<install-dir>/conf/hive-log4j.properties` - hive configuration is an overlay on top of hadoop - meaning the hadoop configuration variables are inherited by default. @@ -78, +78 @@ * editing hive-site.xml and defining any desired variables (including hadoop variables) in it * from the cli using the set command (see below) * by invoking hive using the syntax: - * $ bin/hive -hiveconf x1=y1 -hiveconf x2=y2 + * `$ bin/hive -hiveconf x1=y1 -hiveconf x2=y2` this sets the variables x1 and x2 to y1 and y2 respectively - * By setting the HIVE_OPTS environment variable to "-hiveconf x1=y1 -hiveconf x2=y2" which does the same as above + * By setting the `HIVE_OPTS` environment variable to "-hiveconf x1=y1 -hiveconf x2=y2" which does the same as above === Runtime configuration === @@ -94, +94 @@ hive> SET -v }}} - The latter shows all the current settings. Without the -v option only the + The latter shows all the current settings. Without the `-v` option only the variables that differ from the base hadoop configuration are displayed === Hive, Map-Reduce and Local-Mode === @@ -110, +110 @@ {{{ hive> SET mapred.job.tracker=local; }}} - In addition, mapred.local.dir should point to a path that's valid on the local machine (for example /tmp/<username>/mapred/local). (Otherwise, the user will get an exception allocating local disk space). + In addition, `mapred.local.dir` should point to a path that's valid on the local machine (for example `/tmp/<username>/mapred/local`). (Otherwise, the user will get an exception allocating local disk space). Starting v-0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options are: {{{ @@ -118, +118 @@ }}} note that this feature is ''disabled'' by default. If enabled - Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied: - * The total input size of the job is lower than: ''hive.exec.mode.local.auto.inputbytes.max'' (128MB by default) + * The total input size of the job is lower than: `hive.exec.mode.local.auto.inputbytes.max` (128MB by default) - * The total number of map-tasks is less than: ''hive.exec.mode.local.auto.tasks.max'' (4 by default) + * The total number of map-tasks is less than: `hive.exec.mode.local.auto.tasks.max` (4 by default) * The total number of reduce tasks required is 1 or 0. So for queries over small data sets, or for queries with multiple map-reduce jobs where the input to subsequent jobs is substantially smaller (because of reduction/filtering in the prior job), jobs may be run locally. - Note that there may be differences in the runtime environment of hadoop server nodes and the machine running the hive client (because of different jvm versions or different software libraries). This can cause unexpected behavior/errors while running in local mode. Also note that local mode execution is done in a separate, child jvm (of the hive client). If the user so wishes, the maximum amount of memory for this child jvm can be controlled via the option ''hive.mapred.local.mem''. By default, it's set to zero, in which case Hive lets Hadoop determine the default memory limits of the child jvm. + Note that there may be differences in the runtime environment of hadoop server nodes and the machine running the hive client (because of different jvm versions or different software libraries). This can cause unexpected behavior/errors while running in local mode. Also note that local mode execution is done in a separate, child jvm (of the hive client). If the user so wishes, the maximum amount of memory for this child jvm can be controlled via the option `hive.mapred.local.mem`. By default, it's set to zero, in which case Hive lets Hadoop determine the default memory limits of the child jvm. === Error Logs === Hive uses log4j for logging. By default logs are not emitted to the - console by the CLI. The default logging level is WARN and the logs are stored in the folder: + console by the CLI. The default logging level is `WARN` and the logs are stored in the folder: - * /tmp/{user.name}/hive.log + * `/tmp/{user.name}/hive.log` If the user wishes - the logs can be emitted to the console by adding the arguments shown below: - * bin/hive -hiveconf hive.root.logger=INFO,console + * `bin/hive -hiveconf hive.root.logger=INFO,console` Alternatively, the user can change the logging level only by using: - * bin/hive -hiveconf hive.root.logger=INFO,DRFA + * `bin/hive -hiveconf hive.root.logger=INFO,DRFA` - Note that setting hive.root.logger via the 'set' command does not + Note that setting `hive.root.logger` via the 'set' command does not change logging properties since they are determined at initialization time. Logging during Hive execution on a Hadoop cluster is controlled by Hadoop configuration. Usually Hadoop will produce one log file per map and reduce task stored on the cluster machine(s) where the task was executed. The log files can be obtained by clicking through to the Task Details page from the Hadoop JobTracker Web UI. - When using local mode (using mapred.job.tracker=local), Hadoop/Hive execution logs are produced on the client machine itself. Starting v-0.6 - Hive uses the hive-exec-log4j.properties (falling back to hive-log4j.properties only if it's missing) to determine where these logs are delivered by default. The default configuration file produces one log file per query executed in local mode and stores it under /tmp/{user.name}. The intent of providing a separate configuration file is to enable administrators to centralize execution log capture if desired (on a NFS file server for example). Execution logs are invaluable for debugging run-time errors. + When using local mode (using `mapred.job.tracker=local`), Hadoop/Hive execution logs are produced on the client machine itself. Starting v-0.6 - Hive uses the `hive-exec-log4j.properties` (falling back to `hive-log4j.properties` only if it's missing) to determine where these logs are delivered by default. The default configuration file produces one log file per query executed in local mode and stores it under `/tmp/{user.name}`. The intent of providing a separate configuration file is to enable administrators to centralize execution log capture if desired (on a NFS file server for example). Execution logs are invaluable for debugging run-time errors. - Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to [email protected]. + Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to `[email protected]`. == DDL Operations == @@ -184, +184 @@ expressions. Check out this link for documentation http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html + {{{ hive> DESCRIBE invites; + }}} shows the list of columns @@ -205, +207 @@ === Metadata Store === Metadata is in an embedded Derby database whose disk storage location is determined by the - hive configuration variable named javax.jdo.option.ConnectionURL. By default + hive configuration variable named `javax.jdo.option.ConnectionURL`. By default - (see conf/hive-default.xml), this location is ./metastore_db + (see `conf/hive-default.xml`), this location is `./metastore_db` Right now, in the default configuration, this metadata can only be seen by one user at a time. Metastore can be stored in any database that is supported by JPOX. The location and the type of the RDBMS can be controlled by the two variables - 'javax.jdo.option.ConnectionURL' and 'javax.jdo.option.ConnectionDriverName'. + `javax.jdo.option.ConnectionURL` and `javax.jdo.option.ConnectionDriverName`. Refer to JDO (or JPOX) documentation for more details on supported databases. - The database schema is defined in JDO metadata annotations file package.jdo + The database schema is defined in JDO metadata annotations file `package.jdo` - at src/contrib/hive/metastore/src/model. + at `src/contrib/hive/metastore/src/model`. In the future, the metastore itself can be a standalone server. @@ -241, +243 @@ NOTES: * NO verification of data against the schema is performed by the load command. * If the file is in hdfs, it is moved into the Hive-controlled file system namespace. - The root of the Hive directory is specified by the option 'hive.metastore.warehouse.dir' + The root of the Hive directory is specified by the option `hive.metastore.warehouse.dir` - in hive-default.xml. We advise users to create this directory before + in `hive-default.xml`. We advise users to create this directory before trying to create tables via Hive. {{{ @@ -263, +265 @@ == SQL Operations == === Example Queries === - Some example queries are shown below. They are available in build/dist/examples/queries. + Some example queries are shown below. They are available in `build/dist/examples/queries`. - More are available in the hive sources at ql/src/test/queries/positive + More are available in the hive sources at `ql/src/test/queries/positive` ==== SELECTS and FILTERS ==== {{{ - hive> SELECT a.foo FROM invites a WHERE a.ds='<DATE>'; + hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15'; }}} - selects column 'foo' from all rows of partition <DATE> of invites table. The results are not + selects column 'foo' from all rows of partition `ds=2008-08-15` of the `invites` table. The results are not stored anywhere, but are displayed on the console. - Note that in all the examples that follow, INSERT (into a hive table, local + Note that in all the examples that follow, `INSERT` (into a hive table, local directory or HDFS directory) is optional. {{{ - hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='<DATE>'; + hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15'; }}} - selects all rows from partition <DATE> OF invites table into an HDFS directory. The result data + selects all rows from partition `ds=2008-08-15` of the `invites` table into an HDFS directory. The result data is in files (depending on the number of mappers) in that directory. NOTE: partition columns if any are selected by the use of *. They can also be specified in the projection clauses. - Partitioned tables must always have a partition selected in the WHERE clause of the statement. + Partitioned tables must always have a partition selected in the `WHERE` clause of the statement. {{{ hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a; @@ -300, +302 @@ hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a; - hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE a.ds='<DATE>'; + hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE a.ds='2008-08-15'; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a; hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a; }}} - Sum of a column. avg, min, max can also be used. Note that for versions of Hive which don't include [[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]], you'll need to use COUNT(1) in place of COUNT(*). + Sum of a column. avg, min, max can also be used. Note that for versions of Hive which don't include [[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]], you'll need to use `COUNT(1)` in place of `COUNT(*)`. ==== GROUP BY ==== @@ -314, +316 @@ hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar; }}} - Note that for versions of Hive which don't include [[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]], you'll need to use COUNT(1) in place of COUNT(*). + Note that for versions of Hive which don't include [[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]], you'll need to use `COUNT(1)` in place of `COUNT(*)`. ==== JOIN ==== @@ -337, +339 @@ hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09'; }}} - This streams the data in the map phase through the script /bin/cat (like hadoop streaming). + This streams the data in the map phase through the script `/bin/cat` (like hadoop streaming). Similarly - streaming can be used on the reduce side (please see the Hive Tutorial or examples) @@ -375, +377 @@ Note that for versions of Hive which don't include [[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]], you'll need to use COUNT(1) in place of COUNT(*). - Now we can do some complex data analysis on the table u_data: + Now we can do some complex data analysis on the table `u_data`: - Create weekday_mapper.py: + Create `weekday_mapper.py`: {{{ import sys import datetime @@ -413, +415 @@ GROUP BY weekday; }}} - Note that for versions of Hive which don't include [[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]], you'll need to use COUNT(1) in place of COUNT(*). + Note that if you're using Hive 0.5.0 or earlier you will need to use `COUNT(1)` in place of COUNT(*). === Apache Weblog Data ===
