[Hadoop Wiki] Update of "Hive/GettingStarted" by CarlSt einbach

Apache Wiki Tue, 19 Oct 2010 20:08:51 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/GettingStarted" page has been changed by CarlSteinbach.
http://wiki.apache.org/hadoop/Hive/GettingStarted?action=diff&rev1=37&rev2=38

--------------------------------------------------

    examples/ (sample input and query files)
  }}}
  
- In the rest of the page, we use build/dist and <install-dir> interchangeably.
+ In the rest of the page, we use `build/dist` and `<install-dir>` 
interchangeably.
  
  === Running Hive ===
  
  Hive uses hadoop  that means:
    * you must have hadoop in your path OR
-   * export HADOOP_HOME=<hadoop-install-dir>
+   * `export HADOOP_HOME=<hadoop-install-dir>`
  
- In addition, you must create /tmp and /user/hive/warehouse 
+ In addition, you must create `/tmp` and `/user/hive/warehouse` 
- (aka hive.metastore.warehouse.dir) and set them chmod g+w in 
+ (aka `hive.metastore.warehouse.dir`) and set them `chmod g+w` in 
  HDFS before a table can be created in Hive. 
  
  Commands to perform this setup
@@ -56, +56 @@

    $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse
  }}}
  
- I also find it useful but not necessary to set HIVE_HOME
+ I also find it useful but not necessary to set `HIVE_HOME`
  {{{
    $ export HIVE_HOME=<hive-install-dir>
  }}}
@@ -68, +68 @@

  
  === Configuration management overview ===
  
- - hive default configuration is stored in <install-dir>/conf/hive-default.xml
+ - hive default configuration is stored in 
`<install-dir>/conf/hive-default.xml`
-     Configuration variables can be changed by (re-)defining them in 
<install-dir>/conf/hive-site.xml
+     Configuration variables can be changed by (re-)defining them in 
`<install-dir>/conf/hive-site.xml`
- - log4j configuration is stored in <install-dir>/conf/hive-log4j.properties
+ - log4j configuration is stored in `<install-dir>/conf/hive-log4j.properties`
  
  - hive configuration is an overlay on top of hadoop - meaning the hadoop 
configuration variables are inherited by default.
  
@@ -78, +78 @@

    * editing hive-site.xml and defining any desired variables (including 
hadoop variables) in it
    * from the cli using the set command (see below)
    * by invoking hive using the syntax:
-     * $ bin/hive -hiveconf x1=y1 -hiveconf x2=y2
+     * `$ bin/hive -hiveconf x1=y1 -hiveconf x2=y2`
        this sets the variables x1 and x2 to y1 and y2 respectively
-     * By setting the HIVE_OPTS environment variable to "-hiveconf x1=y1 
-hiveconf x2=y2" which does the same as above
+     * By setting the `HIVE_OPTS` environment variable to "-hiveconf x1=y1 
-hiveconf x2=y2" which does the same as above
  
  === Runtime configuration ===
  
@@ -94, +94 @@

      hive> SET -v 
  }}}
  
-   The latter shows all the current settings. Without the -v option only the 
+   The latter shows all the current settings. Without the `-v` option only the 
    variables that differ from the base hadoop configuration are displayed
  
  === Hive, Map-Reduce and Local-Mode ===
@@ -110, +110 @@

  {{{
    hive> SET mapred.job.tracker=local;
  }}}
- In addition, mapred.local.dir should point to a path that's valid on the 
local machine (for example /tmp/<username>/mapred/local). (Otherwise, the user 
will get an exception allocating local disk space). 
+ In addition, `mapred.local.dir` should point to a path that's valid on the 
local machine (for example `/tmp/<username>/mapred/local`). (Otherwise, the 
user will get an exception allocating local disk space). 
  
  Starting v-0.7, Hive also supports a mode to run map-reduce jobs in 
local-mode automatically. The relevant options are:
  {{{
@@ -118, +118 @@

  }}}
  
  note that this feature is ''disabled'' by default. If enabled - Hive analyzes 
the size of each map-reduce job in a query and may run it locally if the 
following thresholds are satisfied:
-   * The total input size of the job is lower than: 
''hive.exec.mode.local.auto.inputbytes.max'' (128MB by default)
+   * The total input size of the job is lower than: 
`hive.exec.mode.local.auto.inputbytes.max` (128MB by default)
-   * The total number of map-tasks is less than: 
''hive.exec.mode.local.auto.tasks.max'' (4 by default)
+   * The total number of map-tasks is less than: 
`hive.exec.mode.local.auto.tasks.max` (4 by default)
    * The total number of reduce tasks required is 1 or 0.
  
  So for queries over small data sets, or for queries with multiple map-reduce 
jobs where the input to subsequent jobs is substantially smaller (because of 
reduction/filtering in the prior job), jobs may be run locally. 
  
- Note that there may be differences in the runtime environment of hadoop 
server nodes and the machine running the hive client (because of different jvm 
versions or different software libraries). This can cause unexpected 
behavior/errors while running in local mode. Also note that local mode 
execution is done in a separate, child jvm (of the hive client). If the user so 
wishes, the maximum amount of memory for this child jvm can be controlled via 
the option ''hive.mapred.local.mem''. By default, it's set to zero, in which 
case Hive lets Hadoop determine the default memory limits of the child jvm.
+ Note that there may be differences in the runtime environment of hadoop 
server nodes and the machine running the hive client (because of different jvm 
versions or different software libraries). This can cause unexpected 
behavior/errors while running in local mode. Also note that local mode 
execution is done in a separate, child jvm (of the hive client). If the user so 
wishes, the maximum amount of memory for this child jvm can be controlled via 
the option `hive.mapred.local.mem`. By default, it's set to zero, in which case 
Hive lets Hadoop determine the default memory limits of the child jvm.
  
  === Error Logs ===
  Hive uses log4j for logging. By default logs are not emitted to the 
- console by the CLI. The default logging level is WARN and the logs are stored 
in the folder:
+ console by the CLI. The default logging level is `WARN` and the logs are 
stored in the folder:
-  * /tmp/{user.name}/hive.log
+  * `/tmp/{user.name}/hive.log`
  
  If the user wishes - the logs can be emitted to the console by adding 
  the arguments shown below:
-  * bin/hive -hiveconf hive.root.logger=INFO,console
+  * `bin/hive -hiveconf hive.root.logger=INFO,console`
  
  Alternatively, the user can change the logging level only by using:
-  * bin/hive -hiveconf hive.root.logger=INFO,DRFA
+  * `bin/hive -hiveconf hive.root.logger=INFO,DRFA`
  
- Note that setting hive.root.logger via the 'set' command does not 
+ Note that setting `hive.root.logger` via the 'set' command does not 
  change logging properties since they are determined at initialization time.
  
  Logging during Hive execution on a Hadoop cluster is controlled by Hadoop 
configuration. Usually Hadoop will produce  one log file per map and reduce 
task stored on the cluster machine(s) where the task was executed. The log 
files can be obtained by clicking through to the Task Details page from the 
Hadoop JobTracker Web UI.
  
- When using local mode (using mapred.job.tracker=local), Hadoop/Hive execution 
logs are produced on the client machine itself. Starting v-0.6 - Hive uses the 
hive-exec-log4j.properties (falling back to hive-log4j.properties only if it's 
missing) to determine where these logs are delivered by default. The default 
configuration file produces one log file per query executed in local mode and 
stores it under /tmp/{user.name}. The intent of providing a separate 
configuration file is to enable administrators to centralize execution log 
capture if desired (on a NFS file server for example). Execution logs are 
invaluable for debugging run-time errors.
+ When using local mode (using `mapred.job.tracker=local`), Hadoop/Hive 
execution logs are produced on the client machine itself. Starting v-0.6 - Hive 
uses the `hive-exec-log4j.properties` (falling back to `hive-log4j.properties` 
only if it's missing) to determine where these logs are delivered by default. 
The default configuration file produces one log file per query executed in 
local mode and stores it under `/tmp/{user.name}`. The intent of providing a 
separate configuration file is to enable administrators to centralize execution 
log capture if desired (on a NFS file server for example). Execution logs are 
invaluable for debugging run-time errors.
  
- Error logs are very useful to debug problems. Please send them with any bugs 
(of which there are many!) to [email protected].
+ Error logs are very useful to debug problems. Please send them with any bugs 
(of which there are many!) to `[email protected]`.
  
  == DDL Operations ==
  
@@ -184, +184 @@

  expressions. Check out this link for documentation 
  http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html
  
+ {{{
  hive> DESCRIBE invites;
+ }}}
  
  shows the list of columns
  
@@ -205, +207 @@

  === Metadata Store ===
  
  Metadata is in an embedded Derby database whose disk storage location is 
determined by the 
- hive configuration variable named javax.jdo.option.ConnectionURL. By default 
+ hive configuration variable named `javax.jdo.option.ConnectionURL`. By 
default 
- (see conf/hive-default.xml), this location is ./metastore_db
+ (see `conf/hive-default.xml`), this location is `./metastore_db`
  
  Right now, in the default configuration, this metadata can only be seen by 
  one user at a time. 
  
  Metastore can be stored in any database that is supported by JPOX. The 
  location and the type of the RDBMS can be controlled by the two variables 
- 'javax.jdo.option.ConnectionURL' and 'javax.jdo.option.ConnectionDriverName'. 
+ `javax.jdo.option.ConnectionURL` and `javax.jdo.option.ConnectionDriverName`. 
  Refer to JDO (or JPOX) documentation for more details on supported databases. 
- The database schema is defined in JDO metadata annotations file package.jdo 
+ The database schema is defined in JDO metadata annotations file `package.jdo` 
- at src/contrib/hive/metastore/src/model.
+ at `src/contrib/hive/metastore/src/model`.
  
  In the future, the metastore itself can be a standalone server.
  
@@ -241, +243 @@

  NOTES:
    * NO verification of data against the schema is performed by the load 
command.
    * If the file is in hdfs, it is moved into the Hive-controlled file system 
namespace. 
-   The root of the Hive directory is specified by the option 
'hive.metastore.warehouse.dir' 
+   The root of the Hive directory is specified by the option 
`hive.metastore.warehouse.dir` 
-   in hive-default.xml. We advise users to create this directory before 
+   in `hive-default.xml`. We advise users to create this directory before 
    trying to create tables via Hive.
  
  {{{
@@ -263, +265 @@

  == SQL Operations ==
  === Example Queries ===
  
- Some example queries are shown below. They are available in 
build/dist/examples/queries.
+ Some example queries are shown below. They are available in 
`build/dist/examples/queries`.
- More are available in the hive sources at ql/src/test/queries/positive
+ More are available in the hive sources at `ql/src/test/queries/positive`
  
  ==== SELECTS and FILTERS ====
  
  {{{
-   hive> SELECT a.foo FROM invites a WHERE a.ds='<DATE>';
+   hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
  }}}
  
- selects column 'foo' from all rows of partition <DATE> of invites table. The 
results are not
+ selects column 'foo' from all rows of partition `ds=2008-08-15` of the 
`invites` table. The results are not
  stored anywhere, but are displayed on the console.
  
- Note that in all the examples that follow, INSERT (into a hive table, local 
+ Note that in all the examples that follow, `INSERT` (into a hive table, local 
  directory or HDFS directory) is optional. 
  
  {{{
-   hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a 
WHERE a.ds='<DATE>';
+   hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a 
WHERE a.ds='2008-08-15';
  }}}
  
- selects all rows from partition <DATE> OF invites table into an HDFS  
directory. The result data 
+ selects all rows from partition `ds=2008-08-15` of the `invites` table into 
an HDFS directory. The result data 
  is in files (depending on the number of mappers) in that directory.
  NOTE: partition columns if any are selected by the use of *. They can also 
  be specified in the projection clauses.
  
- Partitioned tables must always have a partition selected in the WHERE clause 
of the statement.
+ Partitioned tables must always have a partition selected in the `WHERE` 
clause of the statement.
  
  {{{
    hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM 
pokes a;
@@ -300, +302 @@

    hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key 
< 100; 
    hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events 
a;
    hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes 
FROM profiles a;
-   hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites 
a WHERE a.ds='<DATE>';
+   hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites 
a WHERE a.ds='2008-08-15';
    hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM 
invites a;
    hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 
a;
  }}}
  
- Sum of a column. avg, min, max can also be used. Note that for versions of 
Hive which don't include 
[[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]], you'll need to use 
COUNT(1) in place of COUNT(*).
+ Sum of a column. avg, min, max can also be used. Note that for versions of 
Hive which don't include 
[[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]], you'll need to use 
`COUNT(1)` in place of `COUNT(*)`.
  
  ==== GROUP BY ====
  
@@ -314, +316 @@

    hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a 
WHERE a.foo > 0 GROUP BY a.bar;
  }}}
  
- Note that for versions of Hive which don't include 
[[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]], you'll need to use 
COUNT(1) in place of COUNT(*).
+ Note that for versions of Hive which don't include 
[[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]], you'll need to use 
`COUNT(1)` in place of `COUNT(*)`.
  
  ==== JOIN ====
  
@@ -337, +339 @@

    hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, 
a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';
  }}}
  
- This streams the data in the map phase through the script /bin/cat (like 
hadoop streaming). 
+ This streams the data in the map phase through the script `/bin/cat` (like 
hadoop streaming). 
  Similarly - streaming can be used on the reduce side (please see the Hive 
Tutorial or examples)
  
  
@@ -375, +377 @@

  
  Note that for versions of Hive which don't include 
[[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]], you'll need to use 
COUNT(1) in place of COUNT(*).
  
- Now we can do some complex data analysis on the table u_data:
+ Now we can do some complex data analysis on the table `u_data`:
  
- Create weekday_mapper.py:
+ Create `weekday_mapper.py`:
  {{{
  import sys
  import datetime
@@ -413, +415 @@

  GROUP BY weekday;
  }}}
  
- Note that for versions of Hive which don't include 
[[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]], you'll need to use 
COUNT(1) in place of COUNT(*).
+ Note that if you're using Hive 0.5.0 or earlier you will need to use 
`COUNT(1)` in place of COUNT(*).
  
  === Apache Weblog Data ===

[Hadoop Wiki] Update of "Hive/GettingStarted" by CarlSt einbach

Reply via email to