Repository: incubator-impala Updated Branches: refs/heads/asf-site d96cd395e -> 75c469182
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_tutorial.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_tutorial.html b/docs/build/html/topics/impala_tutorial.html new file mode 100644 index 0000000..ca76d1e --- /dev/null +++ b/docs/build/html/topics/impala_tutorial.html @@ -0,0 +1,2270 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" cont ent="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="tutorial"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Impala Tutorials</title></head><body id="tutorial"><main role="main"><article role="article" aria-labelledby="tutorial__tutorials"> + + <h1 class="title topictitle1" id="tutorial__tutorials">Impala Tutorials</h1> + + + + <div class="body conbody"> + + <p class="p"> + This section includes tutorial scenarios that demonstrate how to begin using Impala once the software is + installed. It focuses on techniques for loading data, because once you have some data in tables and can query + that data, you can quickly progress to more advanced Impala features. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + <p class="p"> + Where practical, the tutorials take you from <span class="q">"ground zero"</span> to having the desired Impala tables and + data. In some cases, you might need to download additional files from outside sources, set up additional + software components, modify commands or scripts to fit your own configuration, or substitute your own + sample data. + </p> + </div> + + <p class="p"> + Before trying these tutorial lessons, install Impala using one of these procedures: + </p> + + <ul class="ul"> + <li class="li"> + If you already have some <span class="keyword">Apache Hadoop</span> environment set up and just need to add Impala to it, + follow the installation process described in <a class="xref" href="impala_install.html#install">Installing Impala</a>. Make sure to also install the Hive + metastore service if you do not already have Hive configured. + </li> + + </ul> + + <p class="p toc inpage"></p> + </div> + + <article class="topic concept nested1" aria-labelledby="ariaid-title2" id="tutorial__tut_beginner"> + + <h2 class="title topictitle2" id="ariaid-title2">Tutorials for Getting Started</h2> + + <div class="body conbody"> + + <p class="p"> + These tutorials demonstrate the basics of using Impala. They are intended for first-time users, and for + trying out Impala on any new cluster to make sure the major components are working correctly. + </p> + + <p class="p toc inpage"></p> + </div> + + <article class="topic concept nested2" aria-labelledby="ariaid-title3" id="tut_beginner__tutorial_explore"> + + <h3 class="title topictitle3" id="ariaid-title3">Explore a New Impala Instance</h3> + + <div class="body conbody"> + + <p class="p"> + This tutorial demonstrates techniques for finding your way around the tables and databases of an + unfamiliar (possibly empty) Impala instance. + </p> + + <p class="p"> + When you connect to an Impala instance for the first time, you use the <code class="ph codeph">SHOW DATABASES</code> + and <code class="ph codeph">SHOW TABLES</code> statements to view the most common types of objects. Also, call the + <code class="ph codeph">version()</code> function to confirm which version of Impala you are running; the version + number is important when consulting documentation and dealing with support issues. + </p> + + <p class="p"> + A completely empty Impala instance contains no tables, but still has two databases: + </p> + + <ul class="ul"> + <li class="li"> + <code class="ph codeph">default</code>, where new tables are created when you do not specify any other database. + </li> + + <li class="li"> + <code class="ph codeph">_impala_builtins</code>, a system database used to hold all the built-in functions. + </li> + </ul> + + <p class="p"> + The following example shows how to see the available databases, and the tables in each. If the list of + databases or tables is long, you can use wildcard notation to locate specific databases or tables based + on their names. + </p> + +<pre class="pre codeblock"><code>$ impala-shell -i localhost --quiet +Starting Impala Shell without Kerberos authentication +Welcome to the Impala shell. Press TAB twice to see a list of available commands. +... +<span class="ph">(Shell + build version: Impala Shell v2.8.x (<var class="keyword varname">hash</var>) built on + <var class="keyword varname">date</var>)</span> +[localhost:21000] > select version(); ++------------------------------------------- +| version() ++------------------------------------------- +| impalad version ... +| Built on ... ++------------------------------------------- +[localhost:21000] > show databases; ++--------------------------+ +| name | ++--------------------------+ +| _impala_builtins | +| ctas | +| d1 | +| d2 | +| d3 | +| default | +| explain_plans | +| external_table | +| file_formats | +| tpc | ++--------------------------+ +[localhost:21000] > select current_database(); ++--------------------+ +| current_database() | ++--------------------+ +| default | ++--------------------+ +[localhost:21000] > show tables; ++-------+ +| name | ++-------+ +| ex_t | +| t1 | ++-------+ +[localhost:21000] > show tables in d3; + +[localhost:21000] > show tables in tpc; ++------------------------+ +| name | ++------------------------+ +| city | +| customer | +| customer_address | +| customer_demographics | +| household_demographics | +| item | +| promotion | +| store | +| store2 | +| store_sales | +| ticket_view | +| time_dim | +| tpc_tables | ++------------------------+ +[localhost:21000] > show tables in tpc like 'customer*'; ++-----------------------+ +| name | ++-----------------------+ +| customer | +| customer_address | +| customer_demographics | ++-----------------------+ +</code></pre> + + <p class="p"> + Once you know what tables and databases are available, you descend into a database with the + <code class="ph codeph">USE</code> statement. To understand the structure of each table, you use the + <code class="ph codeph">DESCRIBE</code> command. Once inside a database, you can issue statements such as + <code class="ph codeph">INSERT</code> and <code class="ph codeph">SELECT</code> that operate on particular tables. + </p> + + <p class="p"> + The following example explores a database named <code class="ph codeph">TPC</code> whose name we learned in the + previous example. It shows how to filter the table names within a database based on a search string, + examine the columns of a table, and run queries to examine the characteristics of the table data. For + example, for an unfamiliar table you might want to know the number of rows, the number of different + values for a column, and other properties such as whether the column contains any <code class="ph codeph">NULL</code> + values. When sampling the actual data values from a table, use a <code class="ph codeph">LIMIT</code> clause to avoid + excessive output if the table contains more rows or distinct values than you expect. + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > use tpc; +[localhost:21000] > show tables like '*view*'; ++-------------+ +| name | ++-------------+ +| ticket_view | ++-------------+ +[localhost:21000] > describe city; ++-------------+--------+---------+ +| name | type | comment | ++-------------+--------+---------+ +| id | int | | +| name | string | | +| countrycode | string | | +| district | string | | +| population | int | | ++-------------+--------+---------+ +[localhost:21000] > select count(*) from city; ++----------+ +| count(*) | ++----------+ +| 0 | ++----------+ +[localhost:21000] > desc customer; ++------------------------+--------+---------+ +| name | type | comment | ++------------------------+--------+---------+ +| c_customer_sk | int | | +| c_customer_id | string | | +| c_current_cdemo_sk | int | | +| c_current_hdemo_sk | int | | +| c_current_addr_sk | int | | +| c_first_shipto_date_sk | int | | +| c_first_sales_date_sk | int | | +| c_salutation | string | | +| c_first_name | string | | +| c_last_name | string | | +| c_preferred_cust_flag | string | | +| c_birth_day | int | | +| c_birth_month | int | | +| c_birth_year | int | | +| c_birth_country | string | | +| c_login | string | | +| c_email_address | string | | +| c_last_review_date | string | | ++------------------------+--------+---------+ +[localhost:21000] > select count(*) from customer; ++----------+ +| count(*) | ++----------+ +| 100000 | ++----------+ +[localhost:21000] > select count(distinct c_birth_month) from customer; ++-------------------------------+ +| count(distinct c_birth_month) | ++-------------------------------+ +| 12 | ++-------------------------------+ +[localhost:21000] > select count(*) from customer where c_email_address is null; ++----------+ +| count(*) | ++----------+ +| 0 | ++----------+ +[localhost:21000] > select distinct c_salutation from customer limit 10; ++--------------+ +| c_salutation | ++--------------+ +| Mr. | +| Ms. | +| Dr. | +| | +| Miss | +| Sir | +| Mrs. | ++--------------+ +</code></pre> + + <p class="p"> + When you graduate from read-only exploration, you use statements such as <code class="ph codeph">CREATE DATABASE</code> + and <code class="ph codeph">CREATE TABLE</code> to set up your own database objects. + </p> + + <p class="p"> + The following example demonstrates creating a new database holding a new table. Although the last example + ended inside the <code class="ph codeph">TPC</code> database, the new <code class="ph codeph">EXPERIMENTS</code> database is not + nested inside <code class="ph codeph">TPC</code>; all databases are arranged in a single top-level list. + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > create database experiments; +[localhost:21000] > show databases; ++--------------------------+ +| name | ++--------------------------+ +| _impala_builtins | +| ctas | +| d1 | +| d2 | +| d3 | +| default | +| experiments | +| explain_plans | +| external_table | +| file_formats | +| tpc | ++--------------------------+ +[localhost:21000] > show databases like 'exp*'; ++---------------+ +| name | ++---------------+ +| experiments | +| explain_plans | ++---------------+ +</code></pre> + + <p class="p"> + The following example creates a new table, <code class="ph codeph">T1</code>. To illustrate a common mistake, it creates this table inside + the wrong database, the <code class="ph codeph">TPC</code> database where the previous example ended. The <code class="ph codeph">ALTER + TABLE</code> statement lets you move the table to the intended database, <code class="ph codeph">EXPERIMENTS</code>, as part of a rename operation. + The <code class="ph codeph">USE</code> statement is always needed to switch to a new database, and the + <code class="ph codeph">current_database()</code> function confirms which database the session is in, to avoid these + kinds of mistakes. + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > create table t1 (x int); + +[localhost:21000] > show tables; ++------------------------+ +| name | ++------------------------+ +| city | +| customer | +| customer_address | +| customer_demographics | +| household_demographics | +| item | +| promotion | +| store | +| store2 | +| store_sales | +| t1 | +| ticket_view | +| time_dim | +| tpc_tables | ++------------------------+ +[localhost:21000] > select current_database(); ++--------------------+ +| current_database() | ++--------------------+ +| tpc | ++--------------------+ +[localhost:21000] > alter table t1 rename to experiments.t1; +[localhost:21000] > use experiments; +[localhost:21000] > show tables; ++------+ +| name | ++------+ +| t1 | ++------+ +[localhost:21000] > select current_database(); ++--------------------+ +| current_database() | ++--------------------+ +| experiments | ++--------------------+ +</code></pre> + + <p class="p"> + For your initial experiments with tables, you can use ones with just a few columns and a few rows, and + text-format data files. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + As you graduate to more realistic scenarios, you will use more elaborate tables with many columns, + features such as partitioning, and file formats such as Parquet. When dealing with realistic data + volumes, you will bring in data using <code class="ph codeph">LOAD DATA</code> or <code class="ph codeph">INSERT ... SELECT</code> + statements to operate on millions or billions of rows at once. + </div> + + <p class="p"> + The following example sets up a couple of simple tables with a few rows, and performs queries involving + sorting, aggregate functions and joins. + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > insert into t1 values (1), (3), (2), (4); +[localhost:21000] > select x from t1 order by x desc; ++---+ +| x | ++---+ +| 4 | +| 3 | +| 2 | +| 1 | ++---+ +[localhost:21000] > select min(x), max(x), sum(x), avg(x) from t1; ++--------+--------+--------+--------+ +| min(x) | max(x) | sum(x) | avg(x) | ++--------+--------+--------+--------+ +| 1 | 4 | 10 | 2.5 | ++--------+--------+--------+--------+ + +[localhost:21000] > create table t2 (id int, word string); +[localhost:21000] > insert into t2 values (1, "one"), (3, "three"), (5, 'five'); +[localhost:21000] > select word from t1 join t2 on (t1.x = t2.id); ++-------+ +| word | ++-------+ +| one | +| three | ++-------+ +</code></pre> + + <p class="p"> + After completing this tutorial, you should now know: + </p> + + <ul class="ul"> + <li class="li"> + How to tell which version of Impala is running on your system. + </li> + + <li class="li"> + How to find the names of databases in an Impala instance, either displaying the full list or + searching for specific names. + </li> + + <li class="li"> + How to find the names of tables in an Impala database, either displaying the full list or + searching for specific names. + </li> + + <li class="li"> + How to switch between databases and check which database you are currently in. + </li> + + <li class="li"> + How to learn the column names and types of a table. + </li> + + <li class="li"> + How to create databases and tables, insert small amounts of test data, and run simple queries. + </li> + </ul> + </div> + </article> + + + + + + <article class="topic concept nested2" aria-labelledby="ariaid-title4" id="tut_beginner__tutorial_csv_setup"> + + <h3 class="title topictitle3" id="ariaid-title4">Load CSV Data from Local Files</h3> + + <div class="body conbody"> + + <p class="p"> + This scenario illustrates how to create some very small tables, suitable for first-time users to + experiment with Impala SQL features. <code class="ph codeph">TAB1</code> and <code class="ph codeph">TAB2</code> are loaded with data + from files in HDFS. A subset of data is copied from <code class="ph codeph">TAB1</code> into <code class="ph codeph">TAB3</code>. + </p> + + <p class="p"> + Populate HDFS with the data you want to query. To begin this process, create one or more new + subdirectories underneath your user directory in HDFS. The data for each table resides in a separate + subdirectory. Substitute your own username for <code class="ph codeph">username</code> where appropriate. This example + uses the <code class="ph codeph">-p</code> option with the <code class="ph codeph">mkdir</code> operation to create any necessary + parent directories if they do not already exist. + </p> + +<pre class="pre codeblock"><code>$ whoami +username +$ hdfs dfs -ls /user +Found 3 items +drwxr-xr-x - username username 0 2013-04-22 18:54 /user/username +drwxrwx--- - mapred mapred 0 2013-03-15 20:11 /user/history +drwxr-xr-x - hue supergroup 0 2013-03-15 20:10 /user/hive + +$ hdfs dfs -mkdir -p /user/username/sample_data/tab1 /user/username/sample_data/tab2</code></pre> + + <p class="p"> + Here is some sample data, for two tables named <code class="ph codeph">TAB1</code> and <code class="ph codeph">TAB2</code>. + </p> + + <p class="p"> + Copy the following content to <code class="ph codeph">.csv</code> files in your local filesystem: + </p> + + <p class="p"> + <span class="ph filepath">tab1.csv</span>: + </p> + +<pre class="pre codeblock"><code>1,true,123.123,2012-10-24 08:55:00 +2,false,1243.5,2012-10-25 13:40:00 +3,false,24453.325,2008-08-22 09:33:21.123 +4,false,243423.325,2007-05-12 22:32:21.33454 +5,true,243.325,1953-04-22 09:11:33 +</code></pre> + + <p class="p"> + <span class="ph filepath">tab2.csv</span>: + </p> + +<pre class="pre codeblock"><code>1,true,12789.123 +2,false,1243.5 +3,false,24453.325 +4,false,2423.3254 +5,true,243.325 +60,false,243565423.325 +70,true,243.325 +80,false,243423.325 +90,true,243.325 +</code></pre> + + <p class="p"> + Put each <code class="ph codeph">.csv</code> file into a separate HDFS directory using commands like the following, + which use paths available in the Impala Demo VM: + </p> + +<pre class="pre codeblock"><code>$ hdfs dfs -put tab1.csv /user/username/sample_data/tab1 +$ hdfs dfs -ls /user/username/sample_data/tab1 +Found 1 items +-rw-r--r-- 1 username username 192 2013-04-02 20:08 /user/username/sample_data/tab1/tab1.csv + + +$ hdfs dfs -put tab2.csv /user/username/sample_data/tab2 +$ hdfs dfs -ls /user/username/sample_data/tab2 +Found 1 items +-rw-r--r-- 1 username username 158 2013-04-02 20:09 /user/username/sample_data/tab2/tab2.csv +</code></pre> + + <p class="p"> + The name of each data file is not significant. In fact, when Impala examines the contents of the data + directory for the first time, it considers all files in the directory to make up the data of the table, + regardless of how many files there are or what the files are named. + </p> + + <p class="p"> + To understand what paths are available within your own HDFS filesystem and what the permissions are for + the various directories and files, issue <code class="ph codeph">hdfs dfs -ls /</code> and work your way down the tree + doing <code class="ph codeph">-ls</code> operations for the various directories. + </p> + + <p class="p"> + Use the <code class="ph codeph">impala-shell</code> command to create tables, either interactively or through a SQL + script. + </p> + + <p class="p"> + The following example shows creating three tables. For each table, the example shows creating columns + with various attributes such as Boolean or integer types. The example also includes commands that provide + information about how the data is formatted, such as rows terminating with commas, which makes sense in + the case of importing data from a <code class="ph codeph">.csv</code> file. Where we already have <code class="ph codeph">.csv</code> + files containing data in the HDFS directory tree, we specify the location of the directory containing the + appropriate <code class="ph codeph">.csv</code> file. Impala considers all the data from all the files in that + directory to represent the data for the table. + </p> + +<pre class="pre codeblock"><code>DROP TABLE IF EXISTS tab1; +-- The EXTERNAL clause means the data is located outside the central location +-- for Impala data files and is preserved when the associated Impala table is dropped. +-- We expect the data to already exist in the directory specified by the LOCATION clause. +CREATE EXTERNAL TABLE tab1 +( + id INT, + col_1 BOOLEAN, + col_2 DOUBLE, + col_3 TIMESTAMP +) +ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' +LOCATION '/user/username/sample_data/tab1'; + +DROP TABLE IF EXISTS tab2; +-- TAB2 is an external table, similar to TAB1. +CREATE EXTERNAL TABLE tab2 +( + id INT, + col_1 BOOLEAN, + col_2 DOUBLE +) +ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' +LOCATION '/user/username/sample_data/tab2'; + +DROP TABLE IF EXISTS tab3; +-- Leaving out the EXTERNAL clause means the data will be managed +-- in the central Impala data directory tree. Rather than reading +-- existing data files when the table is created, we load the +-- data after creating the table. +CREATE TABLE tab3 +( + id INT, + col_1 BOOLEAN, + col_2 DOUBLE, + month INT, + day INT +) +ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; +</code></pre> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + Getting through these <code class="ph codeph">CREATE TABLE</code> statements successfully is an important validation + step to confirm everything is configured correctly with the Hive metastore and HDFS permissions. If you + receive any errors during the <code class="ph codeph">CREATE TABLE</code> statements: + <ul class="ul"> + <li class="li"> + Make sure you followed the installation instructions closely, in + <a class="xref" href="impala_install.html#install">Installing Impala</a>. + </li> + + <li class="li"> + Make sure the <code class="ph codeph">hive.metastore.warehouse.dir</code> property points to a directory that + Impala can write to. The ownership should be <code class="ph codeph">hive:hive</code>, and the + <code class="ph codeph">impala</code> user should also be a member of the <code class="ph codeph">hive</code> group. + </li> + </ul> + </div> + </div> + </article> + + <article class="topic concept nested2" aria-labelledby="ariaid-title5" id="tut_beginner__tutorial_create_table"> + + <h3 class="title topictitle3" id="ariaid-title5">Point an Impala Table at Existing Data Files</h3> + + <div class="body conbody"> + + <p class="p"> + A convenient way to set up data for Impala to access is to use an external table, where the data already + exists in a set of HDFS files and you just point the Impala table at the directory containing those + files. For example, you might run in <code class="ph codeph">impala-shell</code> a <code class="ph codeph">*.sql</code> file with + contents similar to the following, to create an Impala table that accesses an existing data file used by + Hive. + </p> + + <p class="p"> + The following examples set up 2 tables, referencing the paths and sample data from the sample TPC-DS kit for Impala. + For historical reasons, the data physically resides in an HDFS directory tree under + <span class="ph filepath">/user/hive</span>, although this particular data is entirely managed by Impala rather than + Hive. When we create an external table, we specify the directory containing one or more data files, and + Impala queries the combined content of all the files inside that directory. Here is how we examine the + directories and files within the HDFS filesystem: + </p> + +<pre class="pre codeblock"><code>$ cd ~/username/datasets +$ ./tpcds-setup.sh +... Downloads and unzips the kit, builds the data and loads it into HDFS ... +$ hdfs dfs -ls /user/hive/tpcds/customer +Found 1 items +-rw-r--r-- 1 username supergroup 13209372 2013-03-22 18:09 /user/hive/tpcds/customer/customer.dat +$ hdfs dfs -cat /user/hive/tpcds/customer/customer.dat | more +1|AAAAAAAABAAAAAAA|980124|7135|32946|2452238|2452208|Mr.|Javier|Lewis|Y|9|12|1936|CHILE||Javie [email protected]|2452508| +2|AAAAAAAACAAAAAAA|819667|1461|31655|2452318|2452288|Dr.|Amy|Moses|Y|9|4|1966|TOGO||Amy.Moses@ +Ovk9KjHH.com|2452318| +3|AAAAAAAADAAAAAAA|1473522|6247|48572|2449130|2449100|Miss|Latisha|Hamilton|N|18|9|1979|NIUE|| [email protected]|2452313| +4|AAAAAAAAEAAAAAAA|1703214|3986|39558|2450030|2450000|Dr.|Michael|White|N|7|6|1983|MEXICO||Mic [email protected]|2452361| +5|AAAAAAAAFAAAAAAA|953372|4470|36368|2449438|2449408|Sir|Robert|Moran|N|8|5|1956|FIJI||Robert. [email protected]|2452469| +... +</code></pre> + + <p class="p"> + Here is a SQL script to set up Impala tables pointing to some of these data files in HDFS. + (The script in the VM sets up tables like this through Hive; ignore those tables + for purposes of this demonstration.) + Save the following as <span class="ph filepath">customer_setup.sql</span>: + </p> + +<pre class="pre codeblock"><code>-- +-- store_sales fact table and surrounding dimension tables only +-- +create database tpcds; +use tpcds; + +drop table if exists customer; +create external table customer +( + c_customer_sk int, + c_customer_id string, + c_current_cdemo_sk int, + c_current_hdemo_sk int, + c_current_addr_sk int, + c_first_shipto_date_sk int, + c_first_sales_date_sk int, + c_salutation string, + c_first_name string, + c_last_name string, + c_preferred_cust_flag string, + c_birth_day int, + c_birth_month int, + c_birth_year int, + c_birth_country string, + c_login string, + c_email_address string, + c_last_review_date string +) +row format delimited fields terminated by '|' +location '/user/hive/tpcds/customer'; + +drop table if exists customer_address; +create external table customer_address +( + ca_address_sk int, + ca_address_id string, + ca_street_number string, + ca_street_name string, + ca_street_type string, + ca_suite_number string, + ca_city string, + ca_county string, + ca_state string, + ca_zip string, + ca_country string, + ca_gmt_offset float, + ca_location_type string +) +row format delimited fields terminated by '|' +location '/user/hive/tpcds/customer_address'; +</code></pre> + + <div class="p"> + We would run this script with a command such as: +<pre class="pre codeblock"><code>impala-shell -i localhost -f customer_setup.sql</code></pre> + </div> + + </div> + </article> + + <article class="topic concept nested2" aria-labelledby="ariaid-title6" id="tut_beginner__tutorial_describe_impala"> + + <h3 class="title topictitle3" id="ariaid-title6">Describe the Impala Table</h3> + + <div class="body conbody"> + + <p class="p"> + Now that you have updated the database metadata that Impala caches, you can confirm that the expected + tables are accessible by Impala and examine the attributes of one of the tables. We created these tables + in the database named <code class="ph codeph">default</code>. If the tables were in a database other than the default, + we would issue a command <code class="ph codeph">use <var class="keyword varname">db_name</var> </code> to switch to that database + before examining or querying its tables. We could also qualify the name of a table by prepending the + database name, for example <code class="ph codeph">default.customer</code> and <code class="ph codeph">default.customer_name</code>. + </p> + +<pre class="pre codeblock"><code>[impala-host:21000] > show databases +Query finished, fetching results ... +default +Returned 1 row(s) in 0.00s +[impala-host:21000] > show tables +Query finished, fetching results ... +customer +customer_address +Returned 2 row(s) in 0.00s +[impala-host:21000] > describe customer_address ++------------------+--------+---------+ +| name | type | comment | ++------------------+--------+---------+ +| ca_address_sk | int | | +| ca_address_id | string | | +| ca_street_number | string | | +| ca_street_name | string | | +| ca_street_type | string | | +| ca_suite_number | string | | +| ca_city | string | | +| ca_county | string | | +| ca_state | string | | +| ca_zip | string | | +| ca_country | string | | +| ca_gmt_offset | float | | +| ca_location_type | string | | ++------------------+--------+---------+ +Returned 13 row(s) in 0.01 +</code></pre> + </div> + </article> + + <article class="topic concept nested2" aria-labelledby="ariaid-title7" id="tut_beginner__tutorial_query_impala"> + + <h3 class="title topictitle3" id="ariaid-title7">Query the Impala Table</h3> + + <div class="body conbody"> + + <p class="p"> + You can query data contained in the tables. Impala coordinates the query execution across a single node + or multiple nodes depending on your configuration, without the overhead of running MapReduce jobs to + perform the intermediate processing. + </p> + + <p class="p"> + There are a variety of ways to execute queries on Impala: + </p> + + <ul class="ul"> + <li class="li"> + Using the <code class="ph codeph">impala-shell</code> command in interactive mode: +<pre class="pre codeblock"><code>$ impala-shell -i impala-host +Connected to localhost:21000 +[impala-host:21000] > select count(*) from customer_address; +50000 +Returned 1 row(s) in 0.37s +</code></pre> + </li> + + <li class="li"> + Passing a set of commands contained in a file: +<pre class="pre codeblock"><code>$ impala-shell -i impala-host -f myquery.sql +Connected to localhost:21000 +50000 +Returned 1 row(s) in 0.19s</code></pre> + </li> + + <li class="li"> + Passing a single command to the <code class="ph codeph">impala-shell</code> command. The query is executed, the + results are returned, and the shell exits. Make sure to quote the command, preferably with single + quotation marks to avoid shell expansion of characters such as <code class="ph codeph">*</code>. +<pre class="pre codeblock"><code>$ impala-shell -i impala-host -q 'select count(*) from customer_address' +Connected to localhost:21000 +50000 +Returned 1 row(s) in 0.29s</code></pre> + </li> + </ul> + </div> + </article> + + <article class="topic concept nested2" aria-labelledby="ariaid-title8" id="tut_beginner__tutorial_etl"> + + <h3 class="title topictitle3" id="ariaid-title8">Data Loading and Querying Examples</h3> + + <div class="body conbody"> + + <p class="p"> + This section describes how to create some sample tables and load data into them. These tables can then be + queried using the Impala shell. + </p> + </div> + + <article class="topic concept nested3" aria-labelledby="ariaid-title9" id="tutorial_etl__tutorial_loading"> + + <h4 class="title topictitle4" id="ariaid-title9">Loading Data</h4> + + <div class="body conbody"> + + <p class="p"> + Loading data involves: + </p> + + <ul class="ul"> + <li class="li"> + Establishing a data set. The example below uses <code class="ph codeph">.csv</code> files. + </li> + + <li class="li"> + Creating tables to which to load data. + </li> + + <li class="li"> + Loading the data into the tables you created. + </li> + </ul> + + + + + + + </div> + </article> + + <article class="topic concept nested3" aria-labelledby="ariaid-title10" id="tutorial_etl__tutorial_queries"> + + <h4 class="title topictitle4" id="ariaid-title10">Sample Queries</h4> + + <div class="body conbody"> + + <p class="p"> + To run these sample queries, create a SQL query file <code class="ph codeph">query.sql</code>, copy and paste each + query into the query file, and then run the query file using the shell. For example, to run + <code class="ph codeph">query.sql</code> on <code class="ph codeph">impala-host</code>, you might use the command: + </p> + +<pre class="pre codeblock"><code>impala-shell.sh -i impala-host -f query.sql</code></pre> + + <p class="p"> + The examples and results below assume you have loaded the sample data into the tables as described + above. + </p> + + <div class="example"><h5 class="title sectiontitle">Example: Examining Contents of Tables</h5> + + + + <p class="p"> + Let's start by verifying that the tables do contain the data we expect. Because Impala often deals + with tables containing millions or billions of rows, when examining tables of unknown size, include + the <code class="ph codeph">LIMIT</code> clause to avoid huge amounts of unnecessary output, as in the final query. + (If your interactive query starts displaying an unexpected volume of data, press + <code class="ph codeph">Ctrl-C</code> in <code class="ph codeph">impala-shell</code> to cancel the query.) + </p> + +<pre class="pre codeblock"><code>SELECT * FROM tab1; +SELECT * FROM tab2; +SELECT * FROM tab2 LIMIT 5;</code></pre> + + <p class="p"> + Results: + </p> + +<pre class="pre codeblock"><code>+----+-------+------------+-------------------------------+ +| id | col_1 | col_2 | col_3 | ++----+-------+------------+-------------------------------+ +| 1 | true | 123.123 | 2012-10-24 08:55:00 | +| 2 | false | 1243.5 | 2012-10-25 13:40:00 | +| 3 | false | 24453.325 | 2008-08-22 09:33:21.123000000 | +| 4 | false | 243423.325 | 2007-05-12 22:32:21.334540000 | +| 5 | true | 243.325 | 1953-04-22 09:11:33 | ++----+-------+------------+-------------------------------+ + ++----+-------+---------------+ +| id | col_1 | col_2 | ++----+-------+---------------+ +| 1 | true | 12789.123 | +| 2 | false | 1243.5 | +| 3 | false | 24453.325 | +| 4 | false | 2423.3254 | +| 5 | true | 243.325 | +| 60 | false | 243565423.325 | +| 70 | true | 243.325 | +| 80 | false | 243423.325 | +| 90 | true | 243.325 | ++----+-------+---------------+ + ++----+-------+-----------+ +| id | col_1 | col_2 | ++----+-------+-----------+ +| 1 | true | 12789.123 | +| 2 | false | 1243.5 | +| 3 | false | 24453.325 | +| 4 | false | 2423.3254 | +| 5 | true | 243.325 | ++----+-------+-----------+</code></pre> + + </div> + + <div class="example"><h5 class="title sectiontitle">Example: Aggregate and Join</h5> + + + +<pre class="pre codeblock"><code>SELECT tab1.col_1, MAX(tab2.col_2), MIN(tab2.col_2) +FROM tab2 JOIN tab1 USING (id) +GROUP BY col_1 ORDER BY 1 LIMIT 5;</code></pre> + + <p class="p"> + Results: + </p> + +<pre class="pre codeblock"><code>+-------+-----------------+-----------------+ +| col_1 | max(tab2.col_2) | min(tab2.col_2) | ++-------+-----------------+-----------------+ +| false | 24453.325 | 1243.5 | +| true | 12789.123 | 243.325 | ++-------+-----------------+-----------------+</code></pre> + + </div> + + <div class="example"><h5 class="title sectiontitle">Example: Subquery, Aggregate and Joins</h5> + + + +<pre class="pre codeblock"><code>SELECT tab2.* +FROM tab2, +(SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2 + FROM tab2, tab1 + WHERE tab1.id = tab2.id + GROUP BY col_1) subquery1 +WHERE subquery1.max_col2 = tab2.col_2;</code></pre> + + <p class="p"> + Results: + </p> + +<pre class="pre codeblock"><code>+----+-------+-----------+ +| id | col_1 | col_2 | ++----+-------+-----------+ +| 1 | true | 12789.123 | +| 3 | false | 24453.325 | ++----+-------+-----------+</code></pre> + + </div> + + <div class="example"><h5 class="title sectiontitle">Example: INSERT Query</h5> + + + +<pre class="pre codeblock"><code>INSERT OVERWRITE TABLE tab3 +SELECT id, col_1, col_2, MONTH(col_3), DAYOFMONTH(col_3) +FROM tab1 WHERE YEAR(col_3) = 2012;</code></pre> + + <p class="p"> + Query <code class="ph codeph">TAB3</code> to check the result: + </p> + +<pre class="pre codeblock"><code>SELECT * FROM tab3; +</code></pre> + + <p class="p"> + Results: + </p> + +<pre class="pre codeblock"><code>+----+-------+---------+-------+-----+ +| id | col_1 | col_2 | month | day | ++----+-------+---------+-------+-----+ +| 1 | true | 123.123 | 10 | 24 | +| 2 | false | 1243.5 | 10 | 25 | ++----+-------+---------+-------+-----+</code></pre> + + </div> + </div> + </article> + </article> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title11" id="tutorial__tut_advanced"> + + <h2 class="title topictitle2" id="ariaid-title11">Advanced Tutorials</h2> + + <div class="body conbody"> + + <p class="p"> + These tutorials walk you through advanced scenarios or specialized features. + </p> + + <p class="p toc inpage"></p> + </div> + + <article class="topic concept nested2" aria-labelledby="ariaid-title12" id="tut_advanced__tut_external_partition_data"> + + <h3 class="title topictitle3" id="ariaid-title12">Attaching an External Partitioned Table to an HDFS Directory Structure</h3> + + <div class="body conbody"> + + <p class="p"> + This tutorial shows how you might set up a directory tree in HDFS, put data files into the lowest-level + subdirectories, and then use an Impala external table to query the data files from their original + locations. + </p> + + <p class="p"> + The tutorial uses a table with web log data, with separate subdirectories for the year, month, day, and + host. For simplicity, we use a tiny amount of CSV data, loading the same data into each partition. + </p> + + <p class="p"> + First, we make an Impala partitioned table for CSV data, and look at the underlying HDFS directory + structure to understand the directory structure to re-create elsewhere in HDFS. The columns + <code class="ph codeph">field1</code>, <code class="ph codeph">field2</code>, and <code class="ph codeph">field3</code> correspond to the contents + of the CSV data files. The <code class="ph codeph">year</code>, <code class="ph codeph">month</code>, <code class="ph codeph">day</code>, and + <code class="ph codeph">host</code> columns are all represented as subdirectories within the table structure, and are + not part of the CSV files. We use <code class="ph codeph">STRING</code> for each of these columns so that we can + produce consistent subdirectory names, with leading zeros for a consistent length. + </p> + +<pre class="pre codeblock"><code>create database external_partitions; +use external_partitions; +create table logs (field1 string, field2 string, field3 string) + partitioned by (year string, month string , day string, host string) + row format delimited fields terminated by ','; +insert into logs partition (year="2013", month="07", day="28", host="host1") values ("foo","foo","foo"); +insert into logs partition (year="2013", month="07", day="28", host="host2") values ("foo","foo","foo"); +insert into logs partition (year="2013", month="07", day="29", host="host1") values ("foo","foo","foo"); +insert into logs partition (year="2013", month="07", day="29", host="host2") values ("foo","foo","foo"); +insert into logs partition (year="2013", month="08", day="01", host="host1") values ("foo","foo","foo"); +</code></pre> + + <p class="p"> + Back in the Linux shell, we examine the HDFS directory structure. (Your Impala data directory might be in + a different location; for historical reasons, it is sometimes under the HDFS path + <span class="ph filepath">/user/hive/warehouse</span>.) We use the <code class="ph codeph">hdfs dfs -ls</code> command to examine + the nested subdirectories corresponding to each partitioning column, with separate subdirectories at each + level (with <code class="ph codeph">=</code> in their names) representing the different values for each partitioning + column. When we get to the lowest level of subdirectory, we use the <code class="ph codeph">hdfs dfs -cat</code> + command to examine the data file and see CSV-formatted data produced by the <code class="ph codeph">INSERT</code> + statement in Impala. + </p> + +<pre class="pre codeblock"><code>$ hdfs dfs -ls /user/impala/warehouse/external_partitions.db +Found 1 items +drwxrwxrwt - impala hive 0 2013-08-07 12:24 /user/impala/warehouse/external_partitions.db/logs +$ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs +Found 1 items +drwxr-xr-x - impala hive 0 2013-08-07 12:24 /user/impala/warehouse/external_partitions.db/logs/year=2013 +$ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs/year=2013 +Found 2 items +drwxr-xr-x - impala hive 0 2013-08-07 12:23 /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07 +drwxr-xr-x - impala hive 0 2013-08-07 12:24 /user/impala/warehouse/external_partitions.db/logs/year=2013/month=08 +$ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07 +Found 2 items +drwxr-xr-x - impala hive 0 2013-08-07 12:22 /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28 +drwxr-xr-x - impala hive 0 2013-08-07 12:23 /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=29 +$ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28 +Found 2 items +drwxr-xr-x - impala hive 0 2013-08-07 12:21 /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/host=host1 +drwxr-xr-x - impala hive 0 2013-08-07 12:22 /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/host=host2 +$ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/host=host1 +Found 1 items +-rw-r--r-- 3 impala hive 12 2013-08-07 12:21 /user/impala/warehouse/external_partiti +ons.db/logs/year=2013/month=07/day=28/host=host1/3981726974111751120--8907184999369517436_822630111_data.0 +$ hdfs dfs -cat /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/\ +host=host1/3981726974111751120--8 907184999369517436_822630111_data.0 +foo,foo,foo +</code></pre> + + <p class="p"> + Still in the Linux shell, we use <code class="ph codeph">hdfs dfs -mkdir</code> to create several data directories + outside the HDFS directory tree that Impala controls (<span class="ph filepath">/user/impala/warehouse</span> in this + example, maybe different in your case). Depending on your configuration, you might need to log in as a + user with permission to write into this HDFS directory tree; for example, the commands shown here were + run while logged in as the <code class="ph codeph">hdfs</code> user. + </p> + +<pre class="pre codeblock"><code>$ hdfs dfs -mkdir -p /user/impala/data/logs/year=2013/month=07/day=28/host=host1 +$ hdfs dfs -mkdir -p /user/impala/data/logs/year=2013/month=07/day=28/host=host2 +$ hdfs dfs -mkdir -p /user/impala/data/logs/year=2013/month=07/day=28/host=host1 +$ hdfs dfs -mkdir -p /user/impala/data/logs/year=2013/month=07/day=29/host=host1 +$ hdfs dfs -mkdir -p /user/impala/data/logs/year=2013/month=08/day=01/host=host1 +</code></pre> + + <p class="p"> + We make a tiny CSV file, with values different than in the <code class="ph codeph">INSERT</code> statements used + earlier, and put a copy within each subdirectory that we will use as an Impala partition. + </p> + +<pre class="pre codeblock"><code>$ cat >dummy_log_data +bar,baz,bletch +$ hdfs dfs -mkdir -p /user/impala/data/external_partitions/year=2013/month=08/day=01/host=host1 +$ hdfs dfs -mkdir -p /user/impala/data/external_partitions/year=2013/month=07/day=28/host=host1 +$ hdfs dfs -mkdir -p /user/impala/data/external_partitions/year=2013/month=07/day=28/host=host2 +$ hdfs dfs -mkdir -p /user/impala/data/external_partitions/year=2013/month=07/day=29/host=host1 +$ hdfs dfs -put dummy_log_data /user/impala/data/logs/year=2013/month=07/day=28/host=host1 +$ hdfs dfs -put dummy_log_data /user/impala/data/logs/year=2013/month=07/day=28/host=host2 +$ hdfs dfs -put dummy_log_data /user/impala/data/logs/year=2013/month=07/day=29/host=host1 +$ hdfs dfs -put dummy_log_data /user/impala/data/logs/year=2013/month=08/day=01/host=host1 +</code></pre> + + <p class="p"> + Back in the <span class="keyword cmdname">impala-shell</span> interpreter, we move the original Impala-managed table aside, + and create a new <em class="ph i">external</em> table with a <code class="ph codeph">LOCATION</code> clause pointing to the directory + under which we have set up all the partition subdirectories and data files. + </p> + +<pre class="pre codeblock"><code>use external_partitions; +alter table logs rename to logs_original; +create external table logs (field1 string, field2 string, field3 string) + partitioned by (year string, month string, day string, host string) + row format delimited fields terminated by ',' + location '/user/impala/data/logs'; +</code></pre> + + <p class="p"> + Because partition subdirectories and data files come and go during the data lifecycle, you must identify + each of the partitions through an <code class="ph codeph">ALTER TABLE</code> statement before Impala recognizes the + data files they contain. + </p> + +<pre class="pre codeblock"><code>alter table logs add partition (year="2013",month="07",day="28",host="host1") +alter table log_type add partition (year="2013",month="07",day="28",host="host2"); +alter table log_type add partition (year="2013",month="07",day="29",host="host1"); +alter table log_type add partition (year="2013",month="08",day="01",host="host1"); +</code></pre> + + <p class="p"> + We issue a <code class="ph codeph">REFRESH</code> statement for the table, always a safe practice when data files have + been manually added, removed, or changed. Then the data is ready to be queried. The <code class="ph codeph">SELECT + *</code> statement illustrates that the data from our trivial CSV file was recognized in each of the + partitions where we copied it. Although in this case there are only a few rows, we include a + <code class="ph codeph">LIMIT</code> clause on this test query just in case there is more data than we expect. + </p> + +<pre class="pre codeblock"><code>refresh log_type; +select * from log_type limit 100; ++--------+--------+--------+------+-------+-----+-------+ +| field1 | field2 | field3 | year | month | day | host | ++--------+--------+--------+------+-------+-----+-------+ +| bar | baz | bletch | 2013 | 07 | 28 | host1 | +| bar | baz | bletch | 2013 | 08 | 01 | host1 | +| bar | baz | bletch | 2013 | 07 | 29 | host1 | +| bar | baz | bletch | 2013 | 07 | 28 | host2 | ++--------+--------+--------+------+-------+-----+-------+ +</code></pre> + </div> + </article> + + <article class="topic concept nested2" aria-labelledby="ariaid-title13" id="tut_advanced__tutorial_impala_hive"> + + <h3 class="title topictitle3" id="ariaid-title13">Switching Back and Forth Between Impala and Hive</h3> + + <div class="body conbody"> + + <p class="p"> + Sometimes, you might find it convenient to switch to the Hive shell to perform some data loading or + transformation operation, particularly on file formats such as RCFile, SequenceFile, and Avro that Impala + currently can query but not write to. + </p> + + <p class="p"> + Whenever you create, drop, or alter a table or other kind of object through Hive, the next time you + switch back to the <span class="keyword cmdname">impala-shell</span> interpreter, issue a one-time <code class="ph codeph">INVALIDATE + METADATA</code> statement so that Impala recognizes the new or changed object. + </p> + + <p class="p"> + Whenever you load, insert, or change data in an existing table through Hive (or even through manual HDFS + operations such as the <span class="keyword cmdname">hdfs</span> command), the next time you switch back to the + <span class="keyword cmdname">impala-shell</span> interpreter, issue a one-time <code class="ph codeph">REFRESH + <var class="keyword varname">table_name</var></code> statement so that Impala recognizes the new or changed data. + </p> + + <p class="p"> + For examples showing how this process works for the <code class="ph codeph">REFRESH</code> statement, look at the + examples of creating RCFile and SequenceFile tables in Impala, loading data through Hive, and then + querying the data through Impala. See <a class="xref" href="impala_rcfile.html#rcfile">Using the RCFile File Format with Impala Tables</a> and + <a class="xref" href="impala_seqfile.html#seqfile">Using the SequenceFile File Format with Impala Tables</a> for those examples. + </p> + + <p class="p"> + For examples showing how this process works for the <code class="ph codeph">INVALIDATE METADATA</code> statement, look + at the example of creating and loading an Avro table in Hive, and then querying the data through Impala. + See <a class="xref" href="impala_avro.html#avro">Using the Avro File Format with Impala Tables</a> for that example. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + <p class="p"> + Originally, Impala did not support UDFs, but this feature is available in Impala starting in Impala + 1.2. Some <code class="ph codeph">INSERT ... SELECT</code> transformations that you originally did through Hive can + now be done through Impala. See <a class="xref" href="impala_udf.html#udfs">Impala User-Defined Functions (UDFs)</a> for details. + </p> + + <p class="p"> + Prior to Impala 1.2, the <code class="ph codeph">REFRESH</code> and <code class="ph codeph">INVALIDATE METADATA</code> statements + needed to be issued on each Impala node to which you connected and issued queries. In Impala 1.2 and + higher, when you issue either of those statements on any Impala node, the results are broadcast to all + the Impala nodes in the cluster, making it truly a one-step operation after each round of DDL or ETL + operations in Hive. + </p> + </div> + </div> + </article> + + <article class="topic concept nested2" aria-labelledby="ariaid-title14" id="tut_advanced__tut_cross_join"> + + <h3 class="title topictitle3" id="ariaid-title14">Cross Joins and Cartesian Products with the CROSS JOIN Operator</h3> + + <div class="body conbody"> + + <p class="p"> + Originally, Impala restricted join queries so that they had to include at least one equality comparison + between the columns of the tables on each side of the join operator. With the huge tables typically + processed by Impala, any miscoded query that produced a full Cartesian product as a result set could + consume a huge amount of cluster resources. + </p> + + <p class="p"> + In Impala 1.2.2 and higher, this restriction is lifted when you use the <code class="ph codeph">CROSS JOIN</code> + operator in the query. You still cannot remove all <code class="ph codeph">WHERE</code> clauses from a query like + <code class="ph codeph">SELECT * FROM t1 JOIN t2</code> to produce all combinations of rows from both tables. But you + can use the <code class="ph codeph">CROSS JOIN</code> operator to explicitly request such a Cartesian product. + Typically, this operation is applicable for smaller tables, where the result set still fits within the + memory of a single Impala node. + </p> + + <p class="p"> + The following example sets up data for use in a series of comic books where characters battle each other. + At first, we use an equijoin query, which only allows characters from the same time period and the same + planet to meet. + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > create table heroes (name string, era string, planet string); +[localhost:21000] > create table villains (name string, era string, planet string); +[localhost:21000] > insert into heroes values + > ('Tesla','20th century','Earth'), + > ('Pythagoras','Antiquity','Earth'), + > ('Zopzar','Far Future','Mars'); +Inserted 3 rows in 2.28s +[localhost:21000] > insert into villains values + > ('Caligula','Antiquity','Earth'), + > ('John Dillinger','20th century','Earth'), + > ('Xibulor','Far Future','Venus'); +Inserted 3 rows in 1.93s +[localhost:21000] > select concat(heroes.name,' vs. ',villains.name) as battle + > from heroes join villains + > where heroes.era = villains.era and heroes.planet = villains.planet; ++--------------------------+ +| battle | ++--------------------------+ +| Tesla vs. John Dillinger | +| Pythagoras vs. Caligula | ++--------------------------+ +Returned 2 row(s) in 0.47s</code></pre> + + <p class="p"> + Readers demanded more action, so we added elements of time travel and space travel so that any hero could + face any villain. Prior to Impala 1.2.2, this type of query was impossible because all joins had to + reference matching values between the two tables: + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > -- Cartesian product not possible in Impala 1.1. + > select concat(heroes.name,' vs. ',villains.name) as battle from heroes join villains; +ERROR: NotImplementedException: Join between 'heroes' and 'villains' requires at least one conjunctive equality predicate between the two tables</code></pre> + + <p class="p"> + With Impala 1.2.2, we rewrite the query slightly to use <code class="ph codeph">CROSS JOIN</code> rather than + <code class="ph codeph">JOIN</code>, and now the result set includes all combinations: + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > -- Cartesian product available in Impala 1.2.2 with the CROSS JOIN syntax. + > select concat(heroes.name,' vs. ',villains.name) as battle from heroes cross join villains; ++-------------------------------+ +| battle | ++-------------------------------+ +| Tesla vs. Caligula | +| Tesla vs. John Dillinger | +| Tesla vs. Xibulor | +| Pythagoras vs. Caligula | +| Pythagoras vs. John Dillinger | +| Pythagoras vs. Xibulor | +| Zopzar vs. Caligula | +| Zopzar vs. John Dillinger | +| Zopzar vs. Xibulor | ++-------------------------------+ +Returned 9 row(s) in 0.33s</code></pre> + + <p class="p"> + The full combination of rows from both tables is known as the Cartesian product. This type of result set + is often used for creating grid data structures. You can also filter the result set by including + <code class="ph codeph">WHERE</code> clauses that do not explicitly compare columns between the two tables. The + following example shows how you might produce a list of combinations of year and quarter for use in a + chart, and then a shorter list with only selected quarters. + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > create table x_axis (x int); +[localhost:21000] > create table y_axis (y int); +[localhost:21000] > insert into x_axis values (1),(2),(3),(4); +Inserted 4 rows in 2.14s +[localhost:21000] > insert into y_axis values (2010),(2011),(2012),(2013),(2014); +Inserted 5 rows in 1.32s +[localhost:21000] > select y as year, x as quarter from x_axis cross join y_axis; ++------+---------+ +| year | quarter | ++------+---------+ +| 2010 | 1 | +| 2011 | 1 | +| 2012 | 1 | +| 2013 | 1 | +| 2014 | 1 | +| 2010 | 2 | +| 2011 | 2 | +| 2012 | 2 | +| 2013 | 2 | +| 2014 | 2 | +| 2010 | 3 | +| 2011 | 3 | +| 2012 | 3 | +| 2013 | 3 | +| 2014 | 3 | +| 2010 | 4 | +| 2011 | 4 | +| 2012 | 4 | +| 2013 | 4 | +| 2014 | 4 | ++------+---------+ +Returned 20 row(s) in 0.38s +[localhost:21000] > select y as year, x as quarter from x_axis cross join y_axis where x in (1,3); ++------+---------+ +| year | quarter | ++------+---------+ +| 2010 | 1 | +| 2011 | 1 | +| 2012 | 1 | +| 2013 | 1 | +| 2014 | 1 | +| 2010 | 3 | +| 2011 | 3 | +| 2012 | 3 | +| 2013 | 3 | +| 2014 | 3 | ++------+---------+ +Returned 10 row(s) in 0.39s</code></pre> + </div> + </article> + + + </article> + + + + <article class="topic concept nested1" aria-labelledby="ariaid-title15" id="tutorial__tut_parquet_schemaless"> + + <h2 class="title topictitle2" id="ariaid-title15">Dealing with Parquet Files with Unknown Schema</h2> + + + <div class="body conbody"> + + <p class="p"> + As data pipelines start to include more aspects such as NoSQL or loosely specified schemas, you might encounter + situations where you have data files (particularly in Parquet format) where you do not know the precise table definition. + This tutorial shows how you can build an Impala table around data that comes from non-Impala or even non-SQL sources, + where you do not have control of the table layout and might not be familiar with the characteristics of the data. + </p> + +<p class="p"> +The data used in this tutorial represents airline on-time arrival statistics, from October 1987 through April 2008. +See the details on the <a class="xref" href="http://stat-computing.org/dataexpo/2009/" target="_blank">2009 ASA Data Expo web site</a>. +You can also see the <a class="xref" href="http://stat-computing.org/dataexpo/2009/the-data.html" target="_blank">explanations of the columns</a>; +for purposes of this exercise, wait until after following the tutorial before examining the schema, to better simulate +a real-life situation where you cannot rely on assumptions and assertions about the ranges and representations of +data values. +</p> + +<p class="p"> +We will download Parquet files containing this data from the Ibis blog. +First, we download and unpack the data files. +There are 8 files totalling 1.4 GB. +Each file is less than 256 MB. +</p> + +<pre class="pre codeblock"><code>$ wget -O airlines_parquet.tar.gz https://www.dropbox.com/s/ol9x51tqp6cv4yc/airlines_parquet.tar.gz?dl=0 +... +Length: 1245204740 (1.2G) [application/octet-stream] +Saving to: âairlines_parquet.tar.gzâ + +2015-08-12 17:14:24 (23.6 MB/s) - âairlines_parquet.tar.gzâ saved [1245204740/1245204740] + +$ tar xvzf airlines_parquet.tar.gz +airlines_parquet/ +airlines_parquet/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq +airlines_parquet/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq +airlines_parquet/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq +airlines_parquet/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq +airlines_parquet/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq +airlines_parquet/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq +airlines_parquet/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq +airlines_parquet/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq +$ cd airlines_parquet/ +$ du -kch *.parq +253M 93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq +65M 93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq +156M 93459d994898a9ba-77674173b331fa97_574780876_data.0.parq +240M 93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq +253M 93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq +16M 93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq +177M 93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq +213M 93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq +1.4G total +</code></pre> + +<p class="p"> +Next, we put the Parquet data files in HDFS, all together in a single directory, +with permissions on the directory and the files so that the <code class="ph codeph">impala</code> +user will be able to read them. +</p> + +<div class="note note note_note"><span class="note__title notetitle">Note:</span> +After unpacking, we saw the largest Parquet file was 253 MB. +When copying Parquet files into HDFS for Impala to use, +for maximum query performance, make sure that each file resides in a single HDFS data block. +Therefore, we pick a size larger than any single file and specify that as the block size, using the argument +<code class="ph codeph">-Ddfs.block.size=256m</code> on the <code class="ph codeph">hdfs dfs -put</code> command. +</div> + +<pre class="pre codeblock"><code>$ hdfs dfs -mkdir -p hdfs://demo_host.example.com:8020/user/impala/staging/airlines +$ hdfs dfs -Ddfs.block.size=256m -put *.parq /user/impala/staging/airlines +$ hdfs dfs -ls /user/impala/staging +Found 1 items +drwxrwxrwx - hdfs supergroup 0 2015-08-12 13:52 /user/impala/staging/airlines +$ hdfs dfs -ls hdfs://demo_host.example.com:8020/user/impala/staging/airlines +Found 8 items +-rw-r--r-- 3 jrussell supergroup 265107489 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq +-rw-r--r-- 3 jrussell supergroup 67544715 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq +-rw-r--r-- 3 jrussell supergroup 162556490 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq +-rw-r--r-- 3 jrussell supergroup 251603518 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq +-rw-r--r-- 3 jrussell supergroup 265186603 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq +-rw-r--r-- 3 jrussell supergroup 16663754 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq +-rw-r--r-- 3 jrussell supergroup 185511677 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq +-rw-r--r-- 3 jrussell supergroup 222794621 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq +</code></pre> + +<p class="p"> +With the files in an accessible location in HDFS, we create a database table that uses the data in those files. +The <code class="ph codeph">CREATE EXTERNAL</code> syntax and the <code class="ph codeph">LOCATION</code> attribute point Impala at the appropriate HDFS directory. +The <code class="ph codeph">LIKE PARQUET '<var class="keyword varname">path_to_any_parquet_file</var>'</code> clause means we skip the list of column names and types; +Impala automatically gets the column names and data types straight from the data files. +(Currently, this technique only works for Parquet files.) +We ignore the warning about lack of <code class="ph codeph">READ_WRITE</code> access to the files in HDFS; +the <code class="ph codeph">impala</code> user can read the files, which will be sufficient for us to experiment with +queries and perform some copy and transform operations into other tables. +</p> + +<pre class="pre codeblock"><code>$ impala-shell -i localhost +Starting Impala Shell without Kerberos authentication + +Connected to localhost:21000 +<span class="ph">Server version: impalad version 2.8.x (build + x.y.z)</span> +Welcome to the Impala shell. Press TAB twice to see a list of available commands. +... +<span class="ph">(Shell + build version: Impala Shell v2.8.x (<var class="keyword varname">hash</var>) built on + <var class="keyword varname">date</var>)</span> +[localhost:21000] > create database airline_data; +[localhost:21000] > use airline_data; +[localhost:21000] > create external table airlines_external + > like parquet 'hdfs://demo_host.example.com:8020/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq' + > stored as parquet location 'hdfs://demo_host.example.com:8020/user/impala/staging/airlines'; +WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://demo_host.example.com:8020/user/impala/staging' +</code></pre> + +<p class="p"> +With the table created, we examine its physical and logical characteristics to confirm that the data is really +there and in a format and shape that we can work with. +The <code class="ph codeph">SHOW TABLE STATS</code> statement gives a very high-level summary of the table, +showing how many files and how much total data it contains. +Also, it confirms that the table is expecting all the associated data files to be in Parquet format. +(The ability to work with all kinds of HDFS data files in different formats means that it is +possible to have a mismatch between the format of the data files, and the format +that the table expects the data files to be in.) +The <code class="ph codeph">SHOW FILES</code> statement confirms that the data in the table has the expected number, +names, and sizes of the original Parquet files. +The <code class="ph codeph">DESCRIBE</code> statement (or its abbreviation <code class="ph codeph">DESC</code>) confirms the names and types +of the columns that Impala automatically created after reading that metadata from the Parquet file. +The <code class="ph codeph">DESCRIBE FORMATTED</code> statement prints out some extra detail along with the column definitions; +the pieces we care about for this exercise are the containing database for the table, +the location of the associated data files in HDFS, the fact that it's an external table so Impala will not +delete the HDFS files when we finish the experiments and drop the table, and the fact that the +table is set up to work exclusively with files in the Parquet format. +</p> + +<pre class="pre codeblock"><code>[localhost:21000] > show table stats airlines_external; ++-------+--------+--------+--------------+-------------------+---------+-------------------+ +| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | ++-------+--------+--------+--------------+-------------------+---------+-------------------+ +| -1 | 8 | 1.34GB | NOT CACHED | NOT CACHED | PARQUET | false | ++-------+--------+--------+--------------+-------------------+---------+-------------------+ +[localhost:21000] > show files in airlines_external; ++----------------------------------------------------------------------------------------+----------+-----------+ +| path | size | partition | ++----------------------------------------------------------------------------------------+----------+-----------+ +| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq | 252.83MB | | +| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq | 64.42MB | | +| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq | 155.03MB | | +| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq | 239.95MB | | +| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq | 252.90MB | | +| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq | 15.89MB | | +| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq | 176.92MB | | +| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq | 212.47MB | | ++----------------------------------------------------------------------------------------+----------+-----------+ +[localhost:21000] > describe airlines_external; ++---------------------+--------+---------------------------------------------------+ +| name | type | comment | ++---------------------+--------+---------------------------------------------------+ +| year | int | inferred from: optional int32 year | +| month | int | inferred from: optional int32 month | +| day | int | inferred from: optional int32 day | +| dayofweek | int | inferred from: optional int32 dayofweek | +| dep_time | int | inferred from: optional int32 dep_time | +| crs_dep_time | int | inferred from: optional int32 crs_dep_time | +| arr_time | int | inferred from: optional int32 arr_time | +| crs_arr_time | int | inferred from: optional int32 crs_arr_time | +| carrier | string | inferred from: optional binary carrier | +| flight_num | int | inferred from: optional int32 flight_num | +| tail_num | int | inferred from: optional int32 tail_num | +| actual_elapsed_time | int | inferred from: optional int32 actual_elapsed_time | +| crs_elapsed_time | int | inferred from: optional int32 crs_elapsed_time | +| airtime | int | inferred from: optional int32 airtime | +| arrdelay | int | inferred from: optional int32 arrdelay | +| depdelay | int | inferred from: optional int32 depdelay | +| origin | string | inferred from: optional binary origin | +| dest | string | inferred from: optional binary dest | +| distance | int | inferred from: optional int32 distance | +| taxi_in | int | inferred from: optional int32 taxi_in | +| taxi_out | int | inferred from: optional int32 taxi_out | +| cancelled | int | inferred from: optional int32 cancelled | +| cancellation_code | string | inferred from: optional binary cancellation_code | +| diverted | int | inferred from: optional int32 diverted | +| carrier_delay | int | inferred from: optional int32 carrier_delay | +| weather_delay | int | inferred from: optional int32 weather_delay | +| nas_delay | int | inferred from: optional int32 nas_delay | +| security_delay | int | inferred from: optional int32 security_delay | +| late_aircraft_delay | int | inferred from: optional int32 late_aircraft_delay | ++---------------------+--------+---------------------------------------------------+ +[localhost:21000] > desc formatted airlines_external; ++------------------------------+------------------------------- +| name | type ++------------------------------+------------------------------- +... +| # Detailed Table Information | NULL +| Database: | airline_data +| Owner: | jrussell +... +| Location: | /user/impala/staging/airlines +| Table Type: | EXTERNAL_TABLE +... +| # Storage Information | NULL +| SerDe Library: | parquet.hive.serde.ParquetHiveSerDe +| InputFormat: | parquet.hive.DeprecatedParquetInputFormat +| OutputFormat: | parquet.hive.DeprecatedParquetOutputFormat +... +</code></pre> + +<p class="p"> +Now that we are confident that the connections are solid between the Impala table and the +underlying Parquet files, we run some initial queries to understand the characteristics +of the data: the overall number of rows, and the ranges and how many +different values are in certain columns. +For convenience in understanding the magnitude of the <code class="ph codeph">COUNT(*)</code> +result, we run another query dividing the number of rows by 1 million, demonstrating that there are 123 million rows in the table. +</p> + + + +<pre class="pre codeblock"><code>[localhost:21000] > select count(*) from airlines_external; ++-----------+ +| count(*) | ++-----------+ +| 123534969 | ++-----------+ +Fetched 1 row(s) in 1.32s +[localhost:21000] > select count(*) / 1e6 as 'millions of rows' from airlines_external; ++------------------+ +| millions of rows | ++------------------+ +| 123.534969 | ++------------------+ +Fetched 1 row(s) in 1.24s +</code></pre> + +<p class="p"> The <code class="ph codeph">NDV()</code> function stands for <span class="q">"number of distinct + values"</span>, which for performance reasons is an estimate when there + are lots of different values in the column, but is precise when the + cardinality is less than 16 K. Use <code class="ph codeph">NDV()</code> calls for this + kind of exploration rather than <code class="ph codeph">COUNT(DISTINCT + <var class="keyword varname">colname</var>)</code>, because Impala can evaluate + multiple <code class="ph codeph">NDV()</code> functions in a single query, but only a + single instance of <code class="ph codeph">COUNT DISTINCT</code>. Here we see that + there are modest numbers of different airlines, flight numbers, and + origin and destination airports. Two things jump out from this query: + the number of <code class="ph codeph">tail_num</code> values is much smaller than we + might have expected, and there are more destination airports than origin + airports. Let's dig further. What we find is that most + <code class="ph codeph">tail_num</code> values are <code class="ph codeph">NULL</code>. It looks + like this was an experimental column that wasn't filled in accurately. + We make a mental note that if we use this data as a starting point, + we'll ignore this column. We also find that certain airports are + represented in the <code class="ph codeph">ORIGIN</code> column but not the + <code class="ph codeph">DEST</code> column; now we know that we cannot rely on the + assumption that those sets of airport codes are identical. </p> + +<div class="note note note_note"><span class="note__title notetitle">Note:</span> +A slight digression for some performance tuning. Notice how the first +<code class="ph codeph">SELECT DISTINCT DEST</code> query takes almost 40 seconds. +We expect all queries on such a small data set, less than 2 GB, to +take a few seconds at most. The reason is because the expression +<code class="ph codeph">NOT IN (SELECT origin FROM airlines_external)</code> +produces an intermediate result set of 123 million rows, then +runs 123 million comparisons on each data node against the tiny set of destination airports. +The way the <code class="ph codeph">NOT IN</code> operator works internally means that +this intermediate result set with 123 million rows might be transmitted +across the network to each data node in the cluster. +Applying another <code class="ph codeph">DISTINCT</code> inside the <code class="ph codeph">NOT IN</code> +subquery means that the intermediate result set is only 340 items, +resulting in much less network traffic and fewer comparison operations. +The more efficient query with the added <code class="ph codeph">DISTINCT</code> is approximately 7 times as fast. +</div> + +<pre class="pre codeblock"><code>[localhost:21000] > select ndv(carrier), ndv(flight_num), ndv(tail_num), + > ndv(origin), ndv(dest) from airlines_external; ++--------------+-----------------+---------------+-------------+-----------+ +| ndv(carrier) | ndv(flight_num) | ndv(tail_num) | ndv(origin) | ndv(dest) | ++--------------+-----------------+---------------+-------------+-----------+ +| 29 | 9086 | 3 | 340 | 347 | ++--------------+-----------------+---------------+-------------+-----------+ +[localhost:21000] > select tail_num, count(*) as howmany from airlines_external + > group by tail_num; ++----------+-----------+ +| tail_num | howmany | ++----------+-----------+ +| 715 | 1 | +| 0 | 406405 | +| 112 | 6562 | +| NULL | 123122001 | ++----------+-----------+ +Fetched 1 row(s) in 5.18s +[localhost:21000] > select distinct dest from airlines_external + > where dest not in (select origin from airlines_external); ++------+ +| dest | ++------+ +| LBF | +| CBM | +| RCA | +| SKA | +| LAR | ++------+ +Fetched 5 row(s) in 39.64s +[localhost:21000] > select distinct dest from airlines_external + > where dest not in (select distinct origin from airlines_external); ++------+ +| dest | ++------+ +| LBF | +| RCA | +| CBM | +| SKA | +| LAR | ++------+ +Fetched 5 row(s) in 5.59s +[localhost:21000] > select distinct origin from airlines_external + > where origin not in (select distinct dest from airlines_external); +Fetched 0 row(s) in 5.37s +</code></pre> + +<p class="p"> Next, we try doing a simple calculation, with results broken down by year. + This reveals that some years have no data in the + <code class="ph codeph">AIRTIME</code> column. That means we might be able to use + that column in queries involving certain date ranges, but we cannot + count on it to always be reliable. The question of whether a column + contains any <code class="ph codeph">NULL</code> values, and if so what is their + number, proportion, and distribution, comes up again and again when + doing initial exploration of a data set. </p> + +<pre class="pre codeblock"><code>[localhost:21000] > select year, sum(airtime) from airlines_external + > group by year order by year desc; ++------+--------------+ +| year | sum(airtime) | ++------+--------------+ +| 2008 | 713050445 | +| 2007 | 748015545 | +| 2006 | 720372850 | +| 2005 | 708204026 | +| 2004 | 714276973 | +| 2003 | 665706940 | +| 2002 | 549761849 | +| 2001 | 590867745 | +| 2000 | 583537683 | +| 1999 | 561219227 | +| 1998 | 538050663 | +| 1997 | 536991229 | +| 1996 | 519440044 | +| 1995 | 513364265 | +| 1994 | NULL | +| 1993 | NULL | +| 1992 | NULL | +| 1991 | NULL | +| 1990 | NULL | +| 1989 | NULL | +| 1988 | NULL | +| 1987 | NULL | ++------+--------------+ +</code></pre> + +<p class="p"> +With the notion of <code class="ph codeph">NULL</code> values in mind, let's come back to the <code class="ph codeph">TAILNUM</code> +column that we discovered had a lot of <code class="ph codeph">NULL</code>s. +Let's quantify the <code class="ph codeph">NULL</code> and non-<code class="ph codeph">NULL</code> values in that column for better understanding. +First, we just count the overall number of rows versus the non-<code class="ph codeph">NULL</code> values in that column. +That initial result gives the appearance of relatively few non-<code class="ph codeph">NULL</code> values, but we can break +it down more clearly in a single query. +Once we have the <code class="ph codeph">COUNT(*)</code> and the <code class="ph codeph">COUNT(<var class="keyword varname">colname</var>)</code> numbers, +we can encode that initial query in a <code class="ph codeph">WITH</code> clause, then run a followon query that performs +multiple arithmetic operations on those values. +Seeing that only one-third of one percent of all rows have non-<code class="ph codeph">NULL</code> values for the +<code class="ph codeph">TAILNUM</code> column clearly illustrates that that column is not of much use. +</p> + +<pre class="pre codeblock"><code>[localhost:21000] > select count(*) as 'rows', count(tail_num) as 'non-null tail numbers' + > from airlines_external; ++-----------+-----------------------+ +| rows | non-null tail numbers | ++-----------+-----------------------+ +| 123534969 | 412968 | ++-----------+-----------------------+ +Fetched 1 row(s) in 1.51s +[localhost:21000] > with t1 as + > (select count(*) as 'rows', count(tail_num) as 'nonnull' + > from airlines_external) + > select `rows`, `nonnull`, `rows` - `nonnull` as 'nulls', + > (`nonnull` / `rows`) * 100 as 'percentage non-null' + > from t1; ++-----------+---------+-----------+---------------------+ +| rows | nonnull | nulls | percentage non-null | ++-----------+---------+-----------+---------------------+ +| 123534969 | 412968 | 123122001 | 0.3342923897119365 | ++-----------+---------+-----------+---------------------+ +</code></pre> + +<p class="p"> +By examining other columns using these techniques, we can form a mental picture of the way data is distributed +throughout the table, and which columns are most significant for query purposes. For this tutorial, we focus mostly on +the fields likely to hold discrete values, rather than columns such as <code class="ph codeph">ACTUAL_ELAPSED_TIME</code> +whose names suggest they hold measurements. We would dig deeper into those columns once we had a clear picture +of which questions were worthwhile to ask, and what kinds of trends we might look for. +For the final piece of initial exploration, let's look at the <code class="ph codeph">YEAR</code> column. +A simple <code class="ph codeph">GROUP BY</code> query shows that it has a well-defined range, a manageable number of +distinct values, and relatively even distribution of rows across the different years. +</p> + +<pre class="pre codeblock"><code>[localhost:21000] > select min(year), max(year), ndv(year) from airlines_external; ++-----------+-----------+-----------+ +| min(year) | max(year) | ndv(year) | ++-----------+-----------+-----------+ +| 1987 | 2008 | 22 | ++-----------+-----------+-----------+ +Fetched 1 row(s) in 2.03s +[localhost:21000] > select year, count(*) howmany from airlines_external + > group by year order by year desc; ++------+---------+ +| year | howmany | ++------+---------+ +| 2008 | 7009728 | +| 2007 | 7453215 | +| 2006 | 7141922 | +| 2005 | 7140596 | +| 2004 | 7129270 | +| 2003 | 6488540 | +| 2002 | 5271359 | +| 2001 | 5967780 | +| 2000 | 5683047 | +| 1999 | 5527884 | +| 1998 | 5384721 | +| 1997 | 5411843 | +| 1996 | 5351983 | +| 1995 | 5327435 | +| 1994 | 5180048 | +| 1993 | 5070501 | +| 1992 | 5092157 | +| 1991 | 5076925 | +| 1990 | 5270893 | +| 1989 | 5041200 | +| 1988 | 5202096 | +| 1987 | 1311826 | ++------+---------+ +Fetched 22 row(s) in 2.13s +</code></pre> + +<p class="p"> +We could go quite far with the data in this initial raw format, just as we downloaded it from the web. +If the data set proved to be useful and worth persisting in Impala for extensive queries, +we might want to copy it to an internal table, letting Impala manage the data files and perhaps +reorganizing a little for higher efficiency. +In this next stage of the tutorial, we copy the original data into a partitioned table, still in Parquet format. +Partitioning based on the <code class="ph codeph">YEAR</code> column lets us run queries with clauses such as <code class="ph codeph">WHERE year = 2001</code> +or <code class="ph codeph">WHERE year BETWEEN 1989 AND 1999</code>, which can dramatically cut down on I/O by +ignoring all the data from years outside the desired range. +R <TRUNCATED>
