[Hadoop Wiki] Trivial Update of "Hive/GettingStarted" by RodrigoSchmidt

Apache Wiki Wed, 21 Jan 2009 17:42:29 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by RodrigoSchmidt:
http://wiki.apache.org/hadoop/Hive/GettingStarted

------------------------------------------------------------------------------
  == Loading/Extracting data using Queries ==
  === Runtime configuration ===
  
-   * Hives queries are executed using map-reduce queries and as such the 
behavior 
+   * Hive queries are executed using map-reduce queries and, therefore, the 
behavior 
-   of such queries can be controlled by the hadoop configuration variables
+   of such queries can be controlled by the hadoop configuration variables.
  
-   * The cli can be used to set any hadoop (or hive) configuration variable. 
For example:
+   * The cli command 'SET' can be used to set any hadoop (or hive) 
configuration variable. For example:
      o hive> SET mapred.job.tracker=myhost.mycompany.com:50030
  
-     o hive> SET - v 
+     o hive> SET -v 
-   The latter shows all the current settings. Without the v option only the 
+   The latter shows all the current settings. Without the -v option only the 
    variables that differ from the base hadoop configuration are displayed
-   * In particular the number of reducers should be set to a reasonable number 
+   * In particular, the number of reducers should be set to a reasonable 
number 
    to get good performance (the default is 1!)
  
  
  === Example Queries ===
  
- Some example queries are shown below. They are available in examples/queries.
+ Some example queries are shown below. They are available in 
build/dist/examples/queries.
- More are available in the hive contrib sources src/test/queries/positive
+ More are available in the hive sources at ql/src/test/queries/positive
  
  ==== SELECTS and FILTERS ====
- hive> SELECT a.foo FROM invites a;
+ hive> SELECT a.foo FROM invites a WHERE a.ds='<DATE>';
  
- select column 'foo' from all rows of invites table. The results are not
+ selects column 'foo' from all rows of partition <DATE> of invites table. The 
results are not
  stored anywhere, but are displayed on the console.
  
  Note that in all the examples that follow, INSERT (into a hive table, local 
  directory or HDFS directory) is optional. 
  
- hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a;
+ hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a 
WHERE a.ds='<DATE>';
  
- select all rows from invites table into an HDFS  directory. The result data 
+ selects all rows from partition <DATE> OF invites table into an HDFS  
directory. The result data 
  is in files (depending on the number of mappers) in that directory.
  NOTE: partition columns if any are selected by the use of *. They can also 
  be specified in the projection clauses.
  
+ Partitioned tables must always have a partition selected in the WHERE clause 
of the statement.
+ 
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes 
a;
  
- Select all rows from pokes table into a local directory
+ Selects all rows from pokes table into a local directory
  
  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
  
@@ -222, +224 @@

  
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM 
profiles a;
  
- hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a;
+ hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a 
WHERE a.ds='<DATE>';
  
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM 
invites a;
  
@@ -230, +232 @@

  
  Sum of a column. avg, min, max can also be used
  
- NOTE: there are some flaws with the type system that cause doubles to be 
- returned with integer types would be expected. We expect to fix these in the 
coming week.
- 
  ==== GROUP BY ====
  
  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) 
WHERE a.foo > 0 GROUP BY a.bar;
  
  hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a 
WHERE a.foo > 0 GROUP BY a.bar;
  
- NOTE: Currently Hive always uses two stage map-reduce for groupby operation. 
This is 
- to handle skews in input data. We will be optimizing this in the coming weeks.
- 
  ==== JOIN ====
  
- hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE 
TABLE events SELECT t1.bar, t1.foo, t2.foo
+ hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE 
TABLE events SELECT t1.bar, t1.foo, t2.foo;
  
  ==== MULTITABLE INSERT ====
  
@@ -252, +248 @@

    INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
    INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 
and src.key < 200
    INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT 
src.key WHERE src.key >= 200 and src.key < 300
-   INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE 
src.key >= 300
+   INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE 
src.key >= 300;
  
  ==== STREAMING ====

[Hadoop Wiki] Trivial Update of "Hive/GettingStarted" by RodrigoSchmidt

Reply via email to