Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The following page has been changed by RodrigoSchmidt: http://wiki.apache.org/hadoop/Hive/GettingStarted ------------------------------------------------------------------------------ == Loading/Extracting data using Queries == === Runtime configuration === - * Hives queries are executed using map-reduce queries and as such the behavior + * Hive queries are executed using map-reduce queries and, therefore, the behavior - of such queries can be controlled by the hadoop configuration variables + of such queries can be controlled by the hadoop configuration variables. - * The cli can be used to set any hadoop (or hive) configuration variable. For example: + * The cli command 'SET' can be used to set any hadoop (or hive) configuration variable. For example: o hive> SET mapred.job.tracker=myhost.mycompany.com:50030 - o hive> SET - v + o hive> SET -v - The latter shows all the current settings. Without the v option only the + The latter shows all the current settings. Without the -v option only the variables that differ from the base hadoop configuration are displayed - * In particular the number of reducers should be set to a reasonable number + * In particular, the number of reducers should be set to a reasonable number to get good performance (the default is 1!) === Example Queries === - Some example queries are shown below. They are available in examples/queries. + Some example queries are shown below. They are available in build/dist/examples/queries. - More are available in the hive contrib sources src/test/queries/positive + More are available in the hive sources at ql/src/test/queries/positive ==== SELECTS and FILTERS ==== - hive> SELECT a.foo FROM invites a; + hive> SELECT a.foo FROM invites a WHERE a.ds='<DATE>'; - select column 'foo' from all rows of invites table. The results are not + selects column 'foo' from all rows of partition <DATE> of invites table. The results are not stored anywhere, but are displayed on the console. Note that in all the examples that follow, INSERT (into a hive table, local directory or HDFS directory) is optional. - hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a; + hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='<DATE>'; - select all rows from invites table into an HDFS directory. The result data + selects all rows from partition <DATE> OF invites table into an HDFS directory. The result data is in files (depending on the number of mappers) in that directory. NOTE: partition columns if any are selected by the use of *. They can also be specified in the projection clauses. + Partitioned tables must always have a partition selected in the WHERE clause of the statement. + hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a; - Select all rows from pokes table into a local directory + Selects all rows from pokes table into a local directory hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a; @@ -222, +224 @@ hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a; - hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a; + hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a WHERE a.ds='<DATE>'; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a; @@ -230, +232 @@ Sum of a column. avg, min, max can also be used - NOTE: there are some flaws with the type system that cause doubles to be - returned with integer types would be expected. We expect to fix these in the coming week. - ==== GROUP BY ==== hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar; hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar; - NOTE: Currently Hive always uses two stage map-reduce for groupby operation. This is - to handle skews in input data. We will be optimizing this in the coming weeks. - ==== JOIN ==== - hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo + hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo; ==== MULTITABLE INSERT ==== @@ -252, +248 @@ INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100 INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200 INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300 - INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300 + INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300; ==== STREAMING ====
