http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_porting.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_porting.xml b/docs/topics/impala_porting.xml new file mode 100644 index 0000000..c9c8e52 --- /dev/null +++ b/docs/topics/impala_porting.xml @@ -0,0 +1,622 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="porting"> + + <title>Porting SQL from Other Database Systems to Impala</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Databases"/> + <data name="Category" value="Hive"/> + <data name="Category" value="Oracle"/> + <data name="Category" value="MySQL"/> + <data name="Category" value="PostgreSQL"/> + <data name="Category" value="Troubleshooting"/> + <data name="Category" value="Porting"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">porting</indexterm> + Although Impala uses standard SQL for queries, you might need to modify SQL source when bringing applications + to Impala, due to variations in data types, built-in functions, vendor language extensions, and + Hadoop-specific syntax. Even when SQL is working correctly, you might make further minor modifications for + best performance. + </p> + + <p outputclass="toc inpage"/> + </conbody> + + <concept id="porting_ddl_dml"> + + <title>Porting DDL and DML Statements</title> + + <conbody> + + <p> + When adapting SQL code from a traditional database system to Impala, expect to find a number of differences + in the DDL statements that you use to set up the schema. Clauses related to physical layout of files, + tablespaces, and indexes have no equivalent in Impala. You might restructure your schema considerably to + account for the Impala partitioning scheme and Hadoop file formats. + </p> + + <p> + Expect SQL queries to have a much higher degree of compatibility. With modest rewriting to address vendor + extensions and features not yet supported in Impala, you might be able to run identical or almost-identical + query text on both systems. + </p> + + <p> + Therefore, consider separating out the DDL into a separate Impala-specific setup script. Focus your reuse + and ongoing tuning efforts on the code for SQL queries. + </p> + </conbody> + </concept> + + <concept id="porting_data_types"> + + <title>Porting Data Types from Other Database Systems</title> + + <conbody> + + <ul> + <li> + <p> + Change any <codeph>VARCHAR</codeph>, <codeph>VARCHAR2</codeph>, and <codeph>CHAR</codeph> columns to + <codeph>STRING</codeph>. Remove any length constraints from the column declarations; for example, + change <codeph>VARCHAR(32)</codeph> or <codeph>CHAR(1)</codeph> to <codeph>STRING</codeph>. Impala is + very flexible about the length of string values; it does not impose any length constraints + or do any special processing (such as blank-padding) for <codeph>STRING</codeph> columns. + (In Impala 2.0 and higher, there are data types <codeph>VARCHAR</codeph> and <codeph>CHAR</codeph>, + with length constraints for both types and blank-padding for <codeph>CHAR</codeph>. + However, for performance reasons, it is still preferable to use <codeph>STRING</codeph> + columns where practical.) + </p> + </li> + + <li> + <p> + For national language character types such as <codeph>NCHAR</codeph>, <codeph>NVARCHAR</codeph>, or + <codeph>NCLOB</codeph>, be aware that while Impala can store and query UTF-8 character data, currently + some string manipulation operations only work correctly with ASCII data. See + <xref href="impala_string.xml#string"/> for details. + </p> + </li> + + <li> + <p> + Change any <codeph>DATE</codeph>, <codeph>DATETIME</codeph>, or <codeph>TIME</codeph> columns to + <codeph>TIMESTAMP</codeph>. Remove any precision constraints. Remove any timezone clauses, and make + sure your application logic or ETL process accounts for the fact that Impala expects all + <codeph>TIMESTAMP</codeph> values to be in + <xref href="http://en.wikipedia.org/wiki/Coordinated_Universal_Time" scope="external" format="html">Coordinated + Universal Time (UTC)</xref>. See <xref href="impala_timestamp.xml#timestamp"/> for information about + the <codeph>TIMESTAMP</codeph> data type, and + <xref href="impala_datetime_functions.xml#datetime_functions"/> for conversion functions for different + date and time formats. + </p> + <p> + You might also need to adapt date- and time-related literal values and format strings to use the + supported Impala date and time formats. If you have date and time literals with different separators or + different numbers of <codeph>YY</codeph>, <codeph>MM</codeph>, and so on placeholders than Impala + expects, consider using calls to <codeph>regexp_replace()</codeph> to transform those values to the + Impala-compatible format. See <xref href="impala_timestamp.xml#timestamp"/> for information about the + allowed formats for date and time literals, and + <xref href="impala_string_functions.xml#string_functions"/> for string conversion functions such as + <codeph>regexp_replace()</codeph>. + </p> + <p> + Instead of <codeph>SYSDATE</codeph>, call the function <codeph>NOW()</codeph>. + </p> + <p> + Instead of adding or subtracting directly from a date value to produce a value <varname>N</varname> + days in the past or future, use an <codeph>INTERVAL</codeph> expression, for example <codeph>NOW() + + INTERVAL 30 DAYS</codeph>. + </p> + </li> + + <li> + <p> + Although Impala supports <codeph>INTERVAL</codeph> expressions for datetime arithmetic, as shown in + <xref href="impala_timestamp.xml#timestamp"/>, <codeph>INTERVAL</codeph> is not available as a column + data type in Impala. For any <codeph>INTERVAL</codeph> values stored in tables, convert them to numeric + values that you can add or subtract using the functions in + <xref href="impala_datetime_functions.xml#datetime_functions"/>. For example, if you had a table + <codeph>DEADLINES</codeph> with an <codeph>INT</codeph> column <codeph>TIME_PERIOD</codeph>, you could + construct dates N days in the future like so: + </p> +<codeblock>SELECT NOW() + INTERVAL time_period DAYS from deadlines;</codeblock> + </li> + + <li> + <p> + For <codeph>YEAR</codeph> columns, change to the smallest Impala integer type that has sufficient + range. See <xref href="impala_datatypes.xml#datatypes"/> for details about ranges, casting, and so on + for the various numeric data types. + </p> + </li> + + <li> + <p> + Change any <codeph>DECIMAL</codeph> and <codeph>NUMBER</codeph> types. If fixed-point precision is not + required, you can use <codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph> on the Impala side depending on + the range of values. For applications that require precise decimal values, such as financial data, you + might need to make more extensive changes to table structure and application logic, such as using + separate integer columns for dollars and cents, or encoding numbers as string values and writing UDFs + to manipulate them. See <xref href="impala_datatypes.xml#datatypes"/> for details about ranges, + casting, and so on for the various numeric data types. + </p> + </li> + + <li> + <p> + <codeph>FLOAT</codeph>, <codeph>DOUBLE</codeph>, and <codeph>REAL</codeph> types are supported in + Impala. Remove any precision and scale specifications. (In Impala, <codeph>REAL</codeph> is just an + alias for <codeph>DOUBLE</codeph>; columns declared as <codeph>REAL</codeph> are turned into + <codeph>DOUBLE</codeph> behind the scenes.) See <xref href="impala_datatypes.xml#datatypes"/> for + details about ranges, casting, and so on for the various numeric data types. + </p> + </li> + + <li> + <p> + Most integer types from other systems have equivalents in Impala, perhaps under different names such as + <codeph>BIGINT</codeph> instead of <codeph>INT8</codeph>. For any that are unavailable, for example + <codeph>MEDIUMINT</codeph>, switch to the smallest Impala integer type that has sufficient range. + Remove any precision specifications. See <xref href="impala_datatypes.xml#datatypes"/> for details + about ranges, casting, and so on for the various numeric data types. + </p> + </li> + + <li> + <p> + Remove any <codeph>UNSIGNED</codeph> constraints. All Impala numeric types are signed. See + <xref href="impala_datatypes.xml#datatypes"/> for details about ranges, casting, and so on for the + various numeric data types. + </p> + </li> + + <li> + <p> + For any types holding bitwise values, use an integer type with enough range to hold all the relevant + bits within a positive integer. See <xref href="impala_datatypes.xml#datatypes"/> for details about + ranges, casting, and so on for the various numeric data types. + </p> + <p> + For example, <codeph>TINYINT</codeph> has a maximum positive value of 127, not 256, so to manipulate + 8-bit bitfields as positive numbers switch to the next largest type <codeph>SMALLINT</codeph>. + </p> +<codeblock>[localhost:21000] > select cast(127*2 as tinyint); ++--------------------------+ +| cast(127 * 2 as tinyint) | ++--------------------------+ +| -2 | ++--------------------------+ +[localhost:21000] > select cast(128 as tinyint); ++----------------------+ +| cast(128 as tinyint) | ++----------------------+ +| -128 | ++----------------------+ +[localhost:21000] > select cast(127*2 as smallint); ++---------------------------+ +| cast(127 * 2 as smallint) | ++---------------------------+ +| 254 | ++---------------------------+</codeblock> + <p> + Impala does not support notation such as <codeph>b'0101'</codeph> for bit literals. + </p> + </li> + + <li> + <p> + For BLOB values, use <codeph>STRING</codeph> to represent <codeph>CLOB</codeph> or + <codeph>TEXT</codeph> types (character based large objects) up to 32 KB in size. Binary large objects + such as <codeph>BLOB</codeph>, <codeph>RAW</codeph> <codeph>BINARY</codeph>, and + <codeph>VARBINARY</codeph> do not currently have an equivalent in Impala. + </p> + </li> + + <li> + <p> + For Boolean-like types such as <codeph>BOOL</codeph>, use the Impala <codeph>BOOLEAN</codeph> type. + </p> + </li> + + <li> + <p> + Because Impala currently does not support composite or nested types, any spatial data types in other + database systems do not have direct equivalents in Impala. You could represent spatial values in string + format and write UDFs to process them. See <xref href="impala_udf.xml#udfs"/> for details. Where + practical, separate spatial types into separate tables so that Impala can still work with the + non-spatial data. + </p> + </li> + + <li> + <p> + Take out any <codeph>DEFAULT</codeph> clauses. Impala can use data files produced from many different + sources, such as Pig, Hive, or MapReduce jobs. The fast import mechanisms of <codeph>LOAD DATA</codeph> + and external tables mean that Impala is flexible about the format of data files, and Impala does not + necessarily validate or cleanse data before querying it. When copying data through Impala + <codeph>INSERT</codeph> statements, you can use conditional functions such as <codeph>CASE</codeph> or + <codeph>NVL</codeph> to substitute some other value for <codeph>NULL</codeph> fields; see + <xref href="impala_conditional_functions.xml#conditional_functions"/> for details. + </p> + </li> + + <li> + <p> + Take out any constraints from your <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> + statements, for example <codeph>PRIMARY KEY</codeph>, <codeph>FOREIGN KEY</codeph>, + <codeph>UNIQUE</codeph>, <codeph>NOT NULL</codeph>, <codeph>UNSIGNED</codeph>, or + <codeph>CHECK</codeph> constraints. Impala can use data files produced from many different sources, + such as Pig, Hive, or MapReduce jobs. Therefore, Impala expects initial data validation to happen + earlier during the ETL or ELT cycle. After data is loaded into Impala tables, you can perform queries + to test for <codeph>NULL</codeph> values. When copying data through Impala <codeph>INSERT</codeph> + statements, you can use conditional functions such as <codeph>CASE</codeph> or <codeph>NVL</codeph> to + substitute some other value for <codeph>NULL</codeph> fields; see + <xref href="impala_conditional_functions.xml#conditional_functions"/> for details. + </p> + <p> + Do as much verification as practical before loading data into Impala. After data is loaded into Impala, + you can do further verification using SQL queries to check if values have expected ranges, if values + are <codeph>NULL</codeph> or not, and so on. If there is a problem with the data, you will need to + re-run earlier stages of the ETL process, or do an <codeph>INSERT ... SELECT</codeph> statement in + Impala to copy the faulty data to a new table and transform or filter out the bad values. + </p> + </li> + + <li> + <p> + Take out any <codeph>CREATE INDEX</codeph>, <codeph>DROP INDEX</codeph>, and <codeph>ALTER + INDEX</codeph> statements, and equivalent <codeph>ALTER TABLE</codeph> statements. Remove any + <codeph>INDEX</codeph>, <codeph>KEY</codeph>, or <codeph>PRIMARY KEY</codeph> clauses from + <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements. Impala is optimized for bulk + read operations for data warehouse-style queries, and therefore does not support indexes for its + tables. + </p> + </li> + + <li> + <p> + Calls to built-in functions with out-of-range or otherwise incorrect arguments, return + <codeph>NULL</codeph> in Impala as opposed to raising exceptions. (This rule applies even when the + <codeph>ABORT_ON_ERROR=true</codeph> query option is in effect.) Run small-scale queries using + representative data to doublecheck that calls to built-in functions are returning expected values + rather than <codeph>NULL</codeph>. For example, unsupported <codeph>CAST</codeph> operations do not + raise an error in Impala: + </p> +<codeblock>select cast('foo' as int); ++--------------------+ +| cast('foo' as int) | ++--------------------+ +| NULL | ++--------------------+</codeblock> + </li> + + <li> + <p> + For any other type not supported in Impala, you could represent their values in string format and write + UDFs to process them. See <xref href="impala_udf.xml#udfs"/> for details. + </p> + </li> + + <li> + <p> + To detect the presence of unsupported or unconvertable data types in data files, do initial testing + with the <codeph>ABORT_ON_ERROR=true</codeph> query option in effect. This option causes queries to + fail immediately if they encounter disallowed type conversions. See + <xref href="impala_abort_on_error.xml#abort_on_error"/> for details. For example: + </p> +<codeblock>set abort_on_error=true; +select count(*) from (select * from t1); +-- The above query will fail if the data files for T1 contain any +-- values that can't be converted to the expected Impala data types. +-- For example, if T1.C1 is defined as INT but the column contains +-- floating-point values like 1.1, the query will return an error.</codeblock> + </li> + </ul> + </conbody> + </concept> + + <concept id="porting_statements"> + + <title>SQL Statements to Remove or Adapt</title> + + <conbody> + + <p> + Some SQL statements or clauses that you might be familiar with are not currently supported in Impala: + </p> + + <ul> + <li> + <p> + Impala has no <codeph>DELETE</codeph> statement. Impala is intended for data warehouse-style operations + where you do bulk moves and transforms of large quantities of data. Instead of using + <codeph>DELETE</codeph>, use <codeph>INSERT OVERWRITE</codeph> to entirely replace the contents of a + table or partition, or use <codeph>INSERT ... SELECT</codeph> to copy a subset of data (everything but + the rows you intended to delete) from one table to another. See <xref href="impala_dml.xml#dml"/> for + an overview of Impala DML statements. + </p> + </li> + + <li> + <p> + Impala has no <codeph>UPDATE</codeph> statement. Impala is intended for data warehouse-style operations + where you do bulk moves and transforms of large quantities of data. Instead of using + <codeph>UPDATE</codeph>, do all necessary transformations early in the ETL process, such as in the job + that generates the original data, or when copying from one table to another to convert to a particular + file format or partitioning scheme. See <xref href="impala_dml.xml#dml"/> for an overview of Impala DML + statements. + </p> + </li> + + <li> + <p> + Impala has no transactional statements, such as <codeph>COMMIT</codeph> or <codeph>ROLLBACK</codeph>. + Impala effectively works like the <codeph>AUTOCOMMIT</codeph> mode in some database systems, where + changes take effect as soon as they are made. + </p> + </li> + + <li> + <p> + If your database, table, column, or other names conflict with Impala reserved words, use different + names or quote the names with backticks. See <xref href="impala_reserved_words.xml#reserved_words"/> + for the current list of Impala reserved words. + </p> + <p> + Conversely, if you use a keyword that Impala does not recognize, it might be interpreted as a table or + column alias. For example, in <codeph>SELECT * FROM t1 NATURAL JOIN t2</codeph>, Impala does not + recognize the <codeph>NATURAL</codeph> keyword and interprets it as an alias for the table + <codeph>t1</codeph>. If you experience any unexpected behavior with queries, check the list of reserved + words to make sure all keywords in join and <codeph>WHERE</codeph> clauses are recognized. + </p> + </li> + + <li> + <p> + Impala supports subqueries only in the <codeph>FROM</codeph> clause of a query, not within the + <codeph>WHERE</codeph> clauses. Therefore, you cannot use clauses such as <codeph>WHERE + <varname>column</varname> IN (<varname>subquery</varname>)</codeph>. Also, Impala does not allow + <codeph>EXISTS</codeph> or <codeph>NOT EXISTS</codeph> clauses (although <codeph>EXISTS</codeph> is a + reserved keyword). + </p> + </li> + + <li> + <p> + Impala supports <codeph>UNION</codeph> and <codeph>UNION ALL</codeph> set operators, but not + <codeph>INTERSECT</codeph>. <ph conref="../shared/impala_common.xml#common/union_all_vs_union"/> + </p> + </li> + + <li> + <p> + Within queries, Impala requires query aliases for any subqueries: + </p> +<codeblock>-- Without the alias 'contents_of_t1' at the end, query gives syntax error. +select count(*) from (select * from t1) contents_of_t1;</codeblock> + </li> + + <li> + <p> + When an alias is declared for an expression in a query, that alias cannot be referenced again within + the same query block: + </p> +<codeblock>-- Can't reference AVERAGE twice in the SELECT list where it's defined. +select avg(x) as average, average+1 from t1 group by x; +ERROR: AnalysisException: couldn't resolve column reference: 'average' + +-- Although it can be referenced again later in the same query. +select avg(x) as average from t1 group by x having average > 3;</codeblock> + <p> + For Impala, either repeat the expression again, or abstract the expression into a <codeph>WITH</codeph> + clause, creating named columns that can be referenced multiple times anywhere in the base query: + </p> +<codeblock>-- The following 2 query forms are equivalent. +select avg(x) as average, avg(x)+1 from t1 group by x; +with avg_t as (select avg(x) average from t1 group by x) select average, average+1 from avg_t;</codeblock> +<!-- An alternative bunch of queries to use in the example above. +[localhost:21000] > select x*x as x_squared from t1; + +[localhost:21000] > select x*x as x_squared from t1 where x_squared < 100; +ERROR: AnalysisException: couldn't resolve column reference: 'x_squared' +[localhost:21000] > select x*x as x_squared, x_squared * pi() as pi_x_squared from t1; +ERROR: AnalysisException: couldn't resolve column reference: 'x_squared' +[localhost:21000] > select x*x as x_squared from t1 group by x_squared; + +[localhost:21000] > select x*x as x_squared from t1 group by x_squared having x_squared < 100; +--> + </li> + + <li> + <p> + Impala does not support certain rarely used join types that are less appropriate for high-volume tables + used for data warehousing. In some cases, Impala supports join types but requires explicit syntax to + ensure you do not do inefficient joins of huge tables by accident. For example, Impala does not support + natural joins or anti-joins, and requires the <codeph>CROSS JOIN</codeph> operator for Cartesian + products. See <xref href="impala_joins.xml#joins"/> for details on the syntax for Impala join clauses. + </p> + </li> + + <li> + <p> + Impala has a limited choice of partitioning types. Partitions are defined based on each distinct + combination of values for one or more partition key columns. Impala does not redistribute or check data + to create evenly distributed partitions; you must choose partition key columns based on your knowledge + of the data volume and distribution. Adapt any tables that use range, list, hash, or key partitioning + to use the Impala partition syntax for <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> + statements. Impala partitioning is similar to range partitioning where every range has exactly one + value, or key partitioning where the hash function produces a separate bucket for every combination of + key values. See <xref href="impala_partitioning.xml#partitioning"/> for usage details, and + <xref href="impala_create_table.xml#create_table"/> and + <xref href="impala_alter_table.xml#alter_table"/> for syntax. + </p> + <note> + Because the number of separate partitions is potentially higher than in other database systems, keep a + close eye on the number of partitions and the volume of data in each one; scale back the number of + partition key columns if you end up with too many partitions with a small volume of data in each one. + Remember, to distribute work for a query across a cluster, you need at least one HDFS block per node. + HDFS blocks are typically multiple megabytes, <ph rev="parquet_block_size">especially</ph> for Parquet + files. Therefore, if each partition holds only a few megabytes of data, you are unlikely to see much + parallelism in the query because such a small amount of data is typically processed by a single node. + </note> + </li> + + <li> + <p> + For <q>top-N</q> queries, Impala uses the <codeph>LIMIT</codeph> clause rather than comparing against a + pseudocolumn named <codeph>ROWNUM</codeph> or <codeph>ROW_NUM</codeph>. See + <xref href="impala_limit.xml#limit"/> for details. + </p> + </li> + </ul> + </conbody> + </concept> + + <concept id="porting_antipatterns"> + + <title>SQL Constructs to Doublecheck</title> + + <conbody> + + <p> + Some SQL constructs that are supported have behavior or defaults more oriented towards convenience than + optimal performance. Also, sometimes machine-generated SQL, perhaps issued through JDBC or ODBC + applications, might have inefficiencies or exceed internal Impala limits. As you port SQL code, be alert + and change these things where appropriate: + </p> + + <ul> + <li> + <p> + A <codeph>CREATE TABLE</codeph> statement with no <codeph>STORED AS</codeph> clause creates data files + in plain text format, which is convenient for data interchange but not a good choice for high-volume + data with high-performance queries. See <xref href="impala_file_formats.xml#file_formats"/> for why and + how to use specific file formats for compact data and high-performance queries. Especially see + <xref href="impala_parquet.xml#parquet"/>, for details about the file format most heavily optimized for + large-scale data warehouse queries. + </p> + </li> + + <li> + <p> + A <codeph>CREATE TABLE</codeph> statement with no <codeph>PARTITIONED BY</codeph> clause stores all the + data files in the same physical location, which can lead to scalability problems when the data volume + becomes large. + </p> + <p> + On the other hand, adapting tables that were already partitioned in a different database system could + produce an Impala table with a high number of partitions and not enough data in each one, leading to + underutilization of Impala's parallel query features. + </p> + <p> + See <xref href="impala_partitioning.xml#partitioning"/> for details about setting up partitioning and + tuning the performance of queries on partitioned tables. + </p> + </li> + + <li> + <p> + The <codeph>INSERT ... VALUES</codeph> syntax is suitable for setting up toy tables with a few rows for + functional testing, but because each such statement creates a separate tiny file in HDFS, it is not a + scalable technique for loading megabytes or gigabytes (let alone petabytes) of data. Consider revising + your data load process to produce raw data files outside of Impala, then setting up Impala external + tables or using the <codeph>LOAD DATA</codeph> statement to use those data files instantly in Impala + tables, with no conversion or indexing stage. See <xref href="impala_tables.xml#external_tables"/> and + <xref href="impala_load_data.xml#load_data"/> for details about the Impala techniques for working with + data files produced outside of Impala; see <xref href="impala_tutorial.xml#tutorial_etl"/> for examples + of ETL workflow for Impala. + </p> + </li> + + <li> + <p> + If your ETL process is not optimized for Hadoop, you might end up with highly fragmented small data + files, or a single giant data file that cannot take advantage of distributed parallel queries or + partitioning. In this case, use an <codeph>INSERT ... SELECT</codeph> statement to copy the data into a + new table and reorganize into a more efficient layout in the same operation. See + <xref href="impala_insert.xml#insert"/> for details about the <codeph>INSERT</codeph> statement. + </p> + <p> + You can do <codeph>INSERT ... SELECT</codeph> into a table with a more efficient file format (see + <xref href="impala_file_formats.xml#file_formats"/>) or from an unpartitioned table into a partitioned + one (see <xref href="impala_partitioning.xml#partitioning"/>). + </p> + </li> + + <li> + <p> + The number of expressions allowed in an Impala query might be smaller than for some other database + systems, causing failures for very complicated queries (typically produced by automated SQL + generators). Where practical, keep the number of expressions in the <codeph>WHERE</codeph> clauses to + approximately 2000 or fewer. As a workaround, set the query option + <codeph>DISABLE_CODEGEN=true</codeph> if queries fail for this reason. See + <xref href="impala_disable_codegen.xml#disable_codegen"/> for details. + </p> + </li> + + <li> + <p> + If practical, rewrite <codeph>UNION</codeph> queries to use the <codeph>UNION ALL</codeph> operator + instead. <ph conref="../shared/impala_common.xml#common/union_all_vs_union"/> + </p> + </li> + </ul> + </conbody> + </concept> + + <concept id="porting_next"> + + <title>Next Porting Steps after Verifying Syntax and Semantics</title> + + <conbody> + + <p> + Throughout this section, some of the decisions you make during the porting process also have a substantial + impact on performance. After your SQL code is ported and working correctly, doublecheck the + performance-related aspects of your schema design, physical layout, and queries to make sure that the + ported application is taking full advantage of Impala's parallelism, performance-related SQL features, and + integration with Hadoop components. + </p> + + <ul> + <li> + Have you run the <codeph>COMPUTE STATS</codeph> statement on each table involved in join queries? Have + you also run <codeph>COMPUTE STATS</codeph> for each table used as the source table in an <codeph>INSERT + ... SELECT</codeph> or <codeph>CREATE TABLE AS SELECT</codeph> statement? + </li> + + <li> + Are you using the most efficient file format for your data volumes, table structure, and query + characteristics? + </li> + + <li> + Are you using partitioning effectively? That is, have you partitioned on columns that are often used for + filtering in <codeph>WHERE</codeph> clauses? Have you partitioned at the right granularity so that there + is enough data in each partition to parallelize the work for each query? + </li> + + <li> + Does your ETL process produce a relatively small number of multi-megabyte data files (good) rather than a + huge number of small files (bad)? + </li> + </ul> + + <p> + See <xref href="impala_performance.xml#performance"/> for details about the whole performance tuning + process. + </p> + </conbody> + </concept> +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_query_options.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_query_options.xml b/docs/topics/impala_query_options.xml new file mode 100644 index 0000000..1011746 --- /dev/null +++ b/docs/topics/impala_query_options.xml @@ -0,0 +1,75 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="query_options"> + + <title>Query Options for the SET Statement</title> + <prolog> + <metadata> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Impala"/> + <data name="Category" value="impala-shell"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Configuring"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + </metadata> + </prolog> + + <conbody> + + <p> + You can specify the following options using the <codeph>SET</codeph> statement, and those settings affect all + queries issued from that session. + </p> + + <p> + Some query options are useful in day-to-day operations for improving usability, performance, or flexibility. + </p> + + <p> + Other query options control special-purpose aspects of Impala operation and are intended primarily for + advanced debugging or troubleshooting. + </p> + + <p> + Options with Boolean parameters can be set to 1 or <codeph>true</codeph> to enable, or 0 or <codeph>false</codeph> + to turn off. + </p> + + <note rev="2.0.0"> + In Impala 2.0 and later, you can set query options directly through the JDBC and ODBC interfaces by using the + <codeph>SET</codeph> statement. Formerly, <codeph>SET</codeph> was only available as a command within the + <cmdname>impala-shell</cmdname> interpreter. + </note> + +<!-- This is the list including defaults from the pre-release 1.2 impala-shell: + ABORT_ON_DEFAULT_LIMIT_EXCEEDED: 0 + ABORT_ON_ERROR: 0 + ALLOW_UNSUPPORTED_FORMATS: 0 + BATCH_SIZE: 0 + DEBUG_ACTION: + DEFAULT_ORDER_BY_LIMIT: -1 + DISABLE_CODEGEN: 0 + HBASE_CACHE_BLOCKS: 0 + HBASE_CACHING: 0 + MAX_ERRORS: 0 + MAX_IO_BUFFERS: 0 + MAX_SCAN_RANGE_LENGTH: 0 + MEM_LIMIT: 0 + NUM_NODES: 0 + NUM_SCANNER_THREADS: 0 + PARQUET_COMPRESSION_CODEC: SNAPPY + PARQUET_FILE_SIZE: 0 + SUPPORT_START_OVER: false +--> + + <p outputclass="toc"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_set.xml#set"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_query_timeout_s.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_query_timeout_s.xml b/docs/topics/impala_query_timeout_s.xml new file mode 100644 index 0000000..41f2918 --- /dev/null +++ b/docs/topics/impala_query_timeout_s.xml @@ -0,0 +1,51 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="2.0.0" id="query_timeout_s"> + + <title>QUERY_TIMEOUT_S Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Querying"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">QUERY_TIMEOUT_S query option</indexterm> + Sets the idle query timeout value for the session, in seconds. Queries that sit idle for longer than the + timeout value are automatically cancelled. If the system administrator specified the + <codeph>--idle_query_timeout</codeph> startup option, <codeph>QUERY_TIMEOUT_S</codeph> must be smaller than + or equal to the <codeph>--idle_query_timeout</codeph> value. + </p> + + <note conref="../shared/impala_common.xml#common/timeout_clock_blurb"/> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>SET QUERY_TIMEOUT_S=<varname>seconds</varname>;</codeblock> + +<!-- Don't have a compelling example to show at this time because the 'idle' aspect only applies + when the client is careless and leaves the query open. Can't easily demonstrate in impala-shell. + + <p conref="/Content/impala_common_xi44078.xml#common/example_blurb"/> +--> + + <p> + <b>Type:</b> numeric + </p> + + <p> + <b>Default:</b> 0 (no timeout if <codeph>--idle_query_timeout</codeph> not in effect; otherwise, use + <codeph>--idle_query_timeout</codeph> value) + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_timeouts.xml#timeouts"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_real.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_real.xml b/docs/topics/impala_real.xml new file mode 100644 index 0000000..e6430e3 --- /dev/null +++ b/docs/topics/impala_real.xml @@ -0,0 +1,46 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="real"> + + <title>REAL Data Type</title> + <titlealts><navtitle>REAL</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Data Types"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Schemas"/> + </metadata> + </prolog> + + <conbody> + + <p> + An alias for the <codeph>DOUBLE</codeph> data type. See <xref href="impala_double.xml#double"/> for details. + </p> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + These examples show how you can use the type names <codeph>REAL</codeph> and <codeph>DOUBLE</codeph> + interchangeably, and behind the scenes Impala treats them always as <codeph>DOUBLE</codeph>. + </p> + +<codeblock>[localhost:21000] > create table r1 (x real); +[localhost:21000] > describe r1; ++------+--------+---------+ +| name | type | comment | ++------+--------+---------+ +| x | double | | ++------+--------+---------+ +[localhost:21000] > insert into r1 values (1.5), (cast (2.2 as double)); +[localhost:21000] > select cast (1e6 as real); ++---------------------------+ +| cast(1000000.0 as double) | ++---------------------------+ +| 1000000 | ++---------------------------+</codeblock> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_refresh.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_refresh.xml b/docs/topics/impala_refresh.xml new file mode 100644 index 0000000..ee022d5 --- /dev/null +++ b/docs/topics/impala_refresh.xml @@ -0,0 +1,234 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="refresh"> + + <title>REFRESH Statement</title> + <titlealts><navtitle>REFRESH</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="DDL"/> + <data name="Category" value="Tables"/> + <data name="Category" value="Hive"/> + <data name="Category" value="Metastore"/> + <data name="Category" value="ETL"/> + <data name="Category" value="Ingest"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">REFRESH statement</indexterm> + To accurately respond to queries, the Impala node that acts as the coordinator (the node to which you are + connected through <cmdname>impala-shell</cmdname>, JDBC, or ODBC) must have current metadata about those + databases and tables that are referenced in Impala queries. If you are not familiar with the way Impala uses + metadata and how it shares the same metastore database as Hive, see + <xref href="impala_hadoop.xml#intro_metastore"/> for background information. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>REFRESH [<varname>db_name</varname>.]<varname>table_name</varname></codeblock> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + Use the <codeph>REFRESH</codeph> statement to load the latest metastore metadata and block location data for + a particular table in these scenarios: + </p> + + <ul> + <li> + After loading new data files into the HDFS data directory for the table. (Once you have set up an ETL + pipeline to bring data into Impala on a regular basis, this is typically the most frequent reason why + metadata needs to be refreshed.) + </li> + + <li> + After issuing <codeph>ALTER TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, or other + table-modifying SQL statement in Hive. + </li> + </ul> + + <p> + You only need to issue the <codeph>REFRESH</codeph> statement on the node to which you connect to issue + queries. The coordinator node divides the work among all the Impala nodes in a cluster, and sends read + requests for the correct HDFS blocks without relying on the metadata on the other nodes. + </p> + + <p> + <codeph>REFRESH</codeph> reloads the metadata for the table from the metastore database, and does an + incremental reload of the low-level block location data to account for any new data files added to the HDFS + data directory for the table. It is a low-overhead, single-table operation, specifically tuned for the common + scenario where new data files are added to HDFS. + </p> + + <p> + Only the metadata for the specified table is flushed. The table must already exist and be known to Impala, + either because the <codeph>CREATE TABLE</codeph> statement was run in Impala rather than Hive, or because a + previous <codeph>INVALIDATE METADATA</codeph> statement caused Impala to reload its entire metadata catalog. + </p> + + <note> + <p rev="1.2"> + In Impala 1.2 and higher, the catalog service broadcasts any changed metadata as a result of Impala + <codeph>ALTER TABLE</codeph>, <codeph>INSERT</codeph> and <codeph>LOAD DATA</codeph> statements to all + Impala nodes. Thus, the <codeph>REFRESH</codeph> statement is only required if you load data through Hive + or by manipulating data files in HDFS directly. See <xref href="impala_components.xml#intro_catalogd"/> for + more information on the catalog service. + </p> + <p rev="1.2.1"> + In Impala 1.2.1 and higher, another way to avoid inconsistency across nodes is to enable the + <codeph>SYNC_DDL</codeph> query option before performing a DDL statement or an <codeph>INSERT</codeph> or + <codeph>LOAD DATA</codeph>. + </p> + <p> + The functionality of the <codeph>REFRESH</codeph> statement has changed in Impala 1.1 and higher. Now the + table name is a required parameter. To flush the metadata for all tables, use the + <codeph><xref href="impala_invalidate_metadata.xml#invalidate_metadata">INVALIDATE METADATA</xref></codeph> + command. + </p> + <draft-comment translate="no"> Almost-identical wording here, under INVALIDATE METADATA, and in Release Notes :: New Features. Makes sense to conref. </draft-comment> + <p> + Because <codeph>REFRESH <varname>table_name</varname></codeph> only works for tables that Impala is already + aware of, when you create a new table in the Hive shell, you must enter <codeph>INVALIDATE + METADATA</codeph> with no table parameter before you can see the new table in + <cmdname>impala-shell</cmdname>. Once the table is known to Impala, you can issue <codeph>REFRESH + <varname>table_name</varname></codeph> as needed after you add more data files for that table. + </p> + </note> + + <p conref="../shared/impala_common.xml#common/refresh_vs_invalidate"/> + + <p> + A metadata update for an <codeph>impalad</codeph> instance <b>is</b> required if: + </p> + + <ul> + <li> + A metadata change occurs. + </li> + + <li> + <b>and</b> the change is made through Hive. + </li> + + <li> + <b>and</b> the change is made to a database to which clients such as the Impala shell or ODBC directly + connect. + </li> + </ul> + + <p rev="1.2"> + A metadata update for an Impala node is <b>not</b> required after you run <codeph>ALTER TABLE</codeph>, + <codeph>INSERT</codeph>, or other table-modifying statement in Impala rather than Hive. Impala handles the + metadata synchronization automatically through the catalog service. + </p> + + <p> + Database and table metadata is typically modified by: + </p> + + <ul> + <li> + Hive - through <codeph>ALTER</codeph>, <codeph>CREATE</codeph>, <codeph>DROP</codeph> or + <codeph>INSERT</codeph> operations. + </li> + + <li> + Impalad - through <codeph>CREATE TABLE</codeph>, <codeph>ALTER TABLE</codeph>, and <codeph>INSERT</codeph> + operations. <ph rev="1.2">In Impala 1.2 and higher, such changes are propagated to all Impala nodes by the + Impala catalog service.</ph> + </li> + </ul> + + <p> + <codeph>REFRESH</codeph> causes the metadata for that table to be immediately reloaded. For a huge table, + that process could take a noticeable amount of time; but doing the refresh up front avoids an unpredictable + delay later, for example if the next reference to the table is during a benchmark test. + </p> + + <p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example shows how you might use the <codeph>REFRESH</codeph> statement after manually adding + new HDFS data files to the Impala data directory for a table: + </p> + +<codeblock>[impalad-host:21000] > refresh t1; +[impalad-host:21000] > refresh t2; +[impalad-host:21000] > select * from t1; +... +[impalad-host:21000] > select * from t2; +... </codeblock> + + <p> + For more examples of using <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> with a + combination of Impala and Hive operations, see <xref href="impala_tutorial.xml#tutorial_impala_hive"/>. + </p> + + <p> + <b>Related impalad options:</b> + </p> + + <p> + In Impala 1.0, the <codeph>-r</codeph> option of <cmdname>impala-shell</cmdname> issued + <codeph>REFRESH</codeph> to reload metadata for all tables. + </p> + + <p> + In Impala 1.1 and higher, this option issues <codeph>INVALIDATE METADATA</codeph> because + <codeph>REFRESH</codeph> now requires a table name parameter. Due to the expense of reloading the metadata + for all tables, the <cmdname>impala-shell</cmdname> <codeph>-r</codeph> option is not recommended for + day-to-day use in a production environment. + </p> + + <p rev="1.2"> + In Impala 1.2 and higher, the <codeph>-r</codeph> option is needed even less frequently, because metadata + changes caused by SQL statements in Impala are automatically broadcast to all Impala nodes. + </p> + + <p conref="../shared/impala_common.xml#common/permissions_blurb"/> + <p rev="CDH-19187"> + The user ID that the <cmdname>impalad</cmdname> daemon runs under, + typically the <codeph>impala</codeph> user, must have execute + permissions for all the relevant directories holding table data. + (A table could have data spread across multiple directories, + or in unexpected paths, if it uses partitioning or + specifies a <codeph>LOCATION</codeph> attribute for + individual partitions or the entire table.) + Issues with permissions might not cause an immediate error for this statement, + but subsequent statements such as <codeph>SELECT</codeph> + or <codeph>SHOW TABLE STATS</codeph> could fail. + </p> + + <p conref="../shared/impala_common.xml#common/hdfs_blurb"/> + + <p> + The <codeph>REFRESH</codeph> command checks HDFS permissions of the underlying data files and directories, + caching this information so that a statement can be cancelled immediately if for example the + <codeph>impala</codeph> user does not have permission to write to the data directory for the table. Impala + reports any lack of write permissions as an <codeph>INFO</codeph> message in the log file, in case that + represents an oversight. If you change HDFS permissions to make data readable or writeable by the Impala + user, issue another <codeph>REFRESH</codeph> to make Impala aware of the change. + </p> + + <note conref="../shared/impala_common.xml#common/compute_stats_next"/> + + <p conref="../shared/impala_common.xml#common/s3_blurb"/> + <p conref="../shared/impala_common.xml#common/s3_metadata"/> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + <xref href="impala_hadoop.xml#intro_metastore"/>, + <xref href="impala_invalidate_metadata.xml#invalidate_metadata"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_request_pool.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_request_pool.xml b/docs/topics/impala_request_pool.xml new file mode 100644 index 0000000..cf2a811 --- /dev/null +++ b/docs/topics/impala_request_pool.xml @@ -0,0 +1,45 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.3.0" id="request_pool"> + + <title>REQUEST_POOL Query Option</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Resource Management"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Admission Control"/> + <data name="Category" value="YARN"/> + <data name="Category" value="Llama"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">REQUEST_POOL query option</indexterm> + The pool or queue name that queries should be submitted to. Only applies when you enable the Impala admission + control feature (CDH 4 or CDH 5; see <xref href="impala_admission.xml#admission_control"/>), or the YARN + resource management feature (CDH 5 only; see + <xref href="impala_resource_management.xml#resource_management"/>). Specifies the name of the pool used by + requests from Impala to the resource manager. + </p> + + <p> + Formerly known as <codeph>YARN_POOL</codeph> during the CDH 5 beta period. Renamed to reflect that it can be + used both with YARN and with the lightweight admission control feature introduced in Impala 1.3. + </p> + + <p> + <b>Type:</b> <codeph>STRING</codeph> + </p> + + <p> + <b>Default:</b> empty (use the user-to-pool mapping defined by an <cmdname>impalad</cmdname> startup option + in the Impala configuration file) + </p> + +<!-- Worth adding a couple of related info links here. --> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_reservation_request_timeout.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_reservation_request_timeout.xml b/docs/topics/impala_reservation_request_timeout.xml new file mode 100644 index 0000000..0316e44 --- /dev/null +++ b/docs/topics/impala_reservation_request_timeout.xml @@ -0,0 +1,35 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="1.2" id="reservation_request_timeout"> + + <title>RESERVATION_REQUEST_TIMEOUT Query Option (CDH 5 only)</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Resource Management"/> + <data name="Category" value="YARN"/> + <data name="Category" value="Llama"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">RESERVATION_REQUEST_TIMEOUT query option</indexterm> + Maximum number of milliseconds Impala will wait for a reservation to be completely granted or denied. Used in + conjunction with the Impala resource management feature in Impala 1.2 and higher with CDH 5. + </p> + + <p> + <b>Type:</b> numeric + </p> + + <p> + <b>Default:</b> 300000 (5 minutes) + </p> + +<!-- Worth adding a couple of related info links here. --> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_revoke.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_revoke.xml b/docs/topics/impala_revoke.xml new file mode 100644 index 0000000..88fbbf9 --- /dev/null +++ b/docs/topics/impala_revoke.xml @@ -0,0 +1,96 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="2.0.0" id="revoke"> + + <title>REVOKE Statement (CDH 5.2 or higher only)</title> + <titlealts><navtitle>REVOKE (CDH 5.2 or higher only)</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="DDL"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Sentry"/> + <data name="Category" value="Roles"/> + <!-- Consider whether to go deeper into categories like Security for the Sentry-related statements. --> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">REVOKE statement</indexterm> +<!-- Copied from Sentry docs. Turn into conref. I did some rewording for clarity. --> + The <codeph>REVOKE</codeph> statement revokes roles or privileges on a specified object from groups. Only + Sentry administrative users can revoke the role from a group. The revocation has a cascading effect. For + example, revoking the <codeph>ALL</codeph> privilege on a database also revokes the same privilege for all + the tables in that database. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock rev="2.3.0 collevelauth">REVOKE ROLE <varname>role_name</varname> FROM GROUP <varname>group_name</varname> + +REVOKE <varname>privilege</varname> ON <varname>object_type</varname> <varname>object_name</varname> + FROM [ROLE] <varname>role_name</varname> + +<ph rev="2.3.0">privilege ::= SELECT | SELECT(<varname>column_name</varname>) | INSERT | ALL</ph> +object_type ::= TABLE | DATABASE | SERVER | URI +</codeblock> + + <p> + Typically, the object name is an identifier. For URIs, it is a string literal. + </p> + + <p rev="2.3.0 collevelauth"> + The ability to grant or revoke <codeph>SELECT</codeph> privilege on specific columns is available + in CDH 5.5 / Impala 2.3 and higher. See <xref href="sg_hive_sql.xml#concept_c2q_4qx_p4/col_level_auth_sentry"/> + for details. + </p> + + <p conref="../shared/impala_common.xml#common/privileges_blurb"/> + + <p> + Only administrative users (those with <codeph>ALL</codeph> privileges on the server, defined in the Sentry + policy file) can use this statement. + </p> + +<!-- Turn compatibility info into a conref or series of conrefs. (In both GRANT and REVOKE.) --> + + <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> + + <p> + <ul> + <li> + The Impala <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements are available in CDH 5.2 and + higher. + </li> + + <li> + In CDH 5.1 and higher, Impala makes use of any roles and privileges specified by the + <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements in Hive, when your system is configured to + use the Sentry service instead of the file-based policy mechanism. + </li> + + <li> + The Impala <codeph>GRANT</codeph> and <codeph>REVOKE</codeph> statements do not require the + <codeph>ROLE</codeph> keyword to be repeated before each role name, unlike the equivalent Hive + statements. + </li> + + <li conref="../shared/impala_common.xml#common/grant_revoke_single"/> + </ul> + </p> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/permissions_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + <xref href="impala_authorization.xml#authorization"/>, <xref href="impala_grant.xml#grant"/> + <xref href="impala_create_role.xml#create_role"/>, <xref href="impala_drop_role.xml#drop_role"/>, + <xref href="impala_show.xml#show"/> + </p> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_schema_objects.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_schema_objects.xml b/docs/topics/impala_schema_objects.xml new file mode 100644 index 0000000..d8abe12 --- /dev/null +++ b/docs/topics/impala_schema_objects.xml @@ -0,0 +1,57 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="schema_objects"> + + <title>Impala Schema Objects and Object Names</title> + <titlealts><navtitle>Schema Objects and Object Names</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Schemas"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">schema objects</indexterm> + With Impala, you work with schema objects that are familiar to database users: primarily databases, tables, views, + and functions. The SQL syntax to work with these objects is explained in + <xref href="impala_langref_sql.xml#langref_sql"/>. This section explains the conceptual knowledge you need to + work with these objects and the various ways to specify their names. + </p> + + <p> + Within a table, partitions can also be considered a kind of object. Partitioning is an important subject for + Impala, with its own documentation section covering use cases and performance considerations. See + <xref href="impala_partitioning.xml#partitioning"/> for details. + </p> + + <p> + Impala does not have a counterpart of the <q>tablespace</q> notion from some database systems. By default, + all the data files for a database, table, or partition are located within nested folders within the HDFS file + system. You can also specify a particular HDFS location for a given Impala table or partition. The raw data + for these objects is represented as a collection of data files, providing the flexibility to load data by + simply moving files into the expected HDFS location. + </p> + + <p> + Information about the schema objects is held in the + <xref href="impala_hadoop.xml#intro_metastore">metastore</xref> database. This database is shared between + Impala and Hive, allowing each to create, drop, and query each other's databases, tables, and so on. When + Impala makes a change to schema objects through a <codeph>CREATE</codeph>, <codeph>ALTER</codeph>, + <codeph>DROP</codeph>, <codeph>INSERT</codeph>, or <codeph>LOAD DATA</codeph> statement, it broadcasts those + changes to all nodes in the cluster through the <xref href="impala_components.xml#intro_catalogd">catalog + service</xref>. When you make such changes through Hive or directly through manipulating HDFS files, you use + the <xref href="impala_refresh.xml#refresh">REFRESH</xref> or + <xref href="impala_invalidate_metadata.xml#invalidate_metadata">INVALIDATE METADATA</xref> statements on the + Impala side to recognize the newly loaded data, new tables, and so on. + </p> + + <p outputclass="toc"/> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_select.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_select.xml b/docs/topics/impala_select.xml new file mode 100644 index 0000000..db63f71 --- /dev/null +++ b/docs/topics/impala_select.xml @@ -0,0 +1,203 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="select"> + + <title>SELECT Statement</title> + <titlealts><navtitle>SELECT</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Reports"/> + <data name="Category" value="Tables"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <!-- This is such an important statement, think if there are more applicable categories. --> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">SELECT statement</indexterm> + The <codeph>SELECT</codeph> statement performs queries, retrieving data from one or more tables and producing + result sets consisting of rows and columns. + </p> + + <p> + The Impala <codeph><xref href="impala_insert.xml#insert">INSERT</xref></codeph> statement also typically ends + with a <codeph>SELECT</codeph> statement, to define data to copy from one table to another. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>[WITH <i>name</i> AS (<i>select_expression</i>) [, ...] ] +SELECT + [ALL | DISTINCT] + [STRAIGHT_JOIN] + <i>expression</i> [, <i>expression</i> ...] +FROM <i>table_reference</i> [, <i>table_reference</i> ...] +[[FULL | [LEFT | RIGHT] INNER | [LEFT | RIGHT] OUTER | [LEFT | RIGHT] SEMI | [LEFT | RIGHT] ANTI | CROSS] + JOIN <i>table_reference</i> + [ON <i>join_equality_clauses</i> | USING (<varname>col1</varname>[, <varname>col2</varname> ...]] ... +WHERE <i>conditions</i> +GROUP BY { <i>column</i> | <i>expression</i> [ASC | DESC] [NULLS FIRST | NULLS LAST] [, ...] } +HAVING <codeph>conditions</codeph> +GROUP BY { <i>column</i> | <i>expression</i> [ASC | DESC] [, ...] } +LIMIT <i>expression</i> [OFFSET <i>expression</i>] +[UNION [ALL] <i>select_statement</i>] ...] +</codeblock> + + <p> + Impala <codeph>SELECT</codeph> queries support: + </p> + + <ul> + <li> + SQL scalar data types: <codeph><xref href="impala_boolean.xml#boolean">BOOLEAN</xref></codeph>, + <codeph><xref href="impala_tinyint.xml#tinyint">TINYINT</xref></codeph>, + <codeph><xref href="impala_smallint.xml#smallint">SMALLINT</xref></codeph>, + <codeph><xref href="impala_int.xml#int">INT</xref></codeph>, + <codeph><xref href="impala_bigint.xml#bigint">BIGINT</xref></codeph>, + <codeph><xref href="impala_decimal.xml#decimal">DECIMAL</xref></codeph> + <codeph><xref href="impala_float.xml#float">FLOAT</xref></codeph>, + <codeph><xref href="impala_double.xml#double">DOUBLE</xref></codeph>, + <codeph><xref href="impala_timestamp.xml#timestamp">TIMESTAMP</xref></codeph>, + <codeph><xref href="impala_string.xml#string">STRING</xref></codeph>, + <codeph><xref href="impala_varchar.xml#varchar">VARCHAR</xref></codeph>, + <codeph><xref href="impala_char.xml#char">CHAR</xref></codeph>. + </li> + +<!-- To do: Consider promoting 'querying complex types' to its own subtopic or pseudo-heading. --> + <li rev="2.3.0"> + The complex data types <codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>, + are available in CDH 5.5 / Impala 2.3 and higher. + Queries involving these types typically involve special qualified names + using dot notation for referring to the complex column fields, + and join clauses for bringing the complex columns into the result set. + See <xref href="impala_complex_types.xml#complex_types"/> for details. + </li> + + <li rev="1.1"> + An optional <xref href="impala_with.xml#with"><codeph>WITH</codeph> clause</xref> before the + <codeph>SELECT</codeph> keyword, to define a subquery whose name or column names can be referenced from + later in the main query. This clause lets you abstract repeated clauses, such as aggregation functions, + that are referenced multiple times in the same query. + </li> + + <li> + By default, one <codeph>DISTINCT</codeph> clause per query. See <xref href="impala_distinct.xml#distinct"/> + for details. See <xref href="impala_appx_count_distinct.xml#appx_count_distinct"/> for a query option to + allow multiple <codeph>COUNT(DISTINCT)</codeph> impressions in the same query. + </li> + + <li> + Subqueries in a <codeph>FROM</codeph> clause. In CDH 5.2 / Impala 2.0 and higher, + subqueries can also go in the <codeph>WHERE</codeph> clause, for example with the + <codeph>IN()</codeph>, <codeph>EXISTS</codeph>, and <codeph>NOT EXISTS</codeph> operators. + </li> + + <li> + <codeph>WHERE</codeph>, <codeph>GROUP BY</codeph>, <codeph>HAVING</codeph> clauses. + </li> + + <li rev="obwl"> + <codeph><xref href="impala_order_by.xml#order_by">ORDER BY</xref></codeph>. Prior to Impala 1.4.0, Impala + required that queries using an <codeph>ORDER BY</codeph> clause also include a + <codeph><xref href="impala_limit.xml#limit">LIMIT</xref></codeph> clause. In Impala 1.4.0 and higher, this + restriction is lifted; sort operations that would exceed the Impala memory limit automatically use a + temporary disk work area to perform the sort. + </li> + + <li> + <p conref="../shared/impala_common.xml#common/join_types"/> + <p> + See <xref href="impala_joins.xml#joins"/> for details and examples of join queries. + </p> + </li> + + <li> + <codeph>UNION ALL</codeph>. + </li> + + <li> + <codeph>LIMIT</codeph>. + </li> + + <li> + External tables. + </li> + + <li> + Relational operators such as greater than, less than, or equal to. + </li> + + <li> + Arithmetic operators such as addition or subtraction. + </li> + + <li> + Logical/Boolean operators <codeph>AND</codeph>, <codeph>OR</codeph>, and <codeph>NOT</codeph>. Impala does + not support the corresponding symbols <codeph>&&</codeph>, <codeph>||</codeph>, and + <codeph>!</codeph>. + </li> + + <li> + Common SQL built-in functions such as <codeph>COUNT</codeph>, <codeph>SUM</codeph>, <codeph>CAST</codeph>, + <codeph>LIKE</codeph>, <codeph>IN</codeph>, <codeph>BETWEEN</codeph>, and <codeph>COALESCE</codeph>. Impala + specifically supports built-ins described in <xref href="impala_functions.xml#builtins"/>. + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/ignore_file_extensions"/> + + <p conref="../shared/impala_common.xml#common/security_blurb"/> + <p conref="../shared/impala_common.xml#common/redaction_yes"/> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_yes"/> + + <p conref="../shared/impala_common.xml#common/permissions_blurb"/> + <p rev="CDH-19187"> + The user ID that the <cmdname>impalad</cmdname> daemon runs under, + typically the <codeph>impala</codeph> user, must have read + permissions for the files in all applicable directories in all source tables, + and read and execute permissions for the relevant data directories. + (A <codeph>SELECT</codeph> operation could read files from multiple different HDFS directories + if the source table is partitioned.) + If a query attempts to read a data file and is unable to because of an HDFS permission error, + the query halts and does not return any further results. + </p> + + <p outputclass="toc"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + The <codeph>SELECT</codeph> syntax is so extensive that it forms its own category of statements: queries. The + other major classifications of SQL statements are data definition language (see + <xref href="impala_ddl.xml#ddl"/>) and data manipulation language (see <xref href="impala_dml.xml#dml"/>). + </p> + + <p> + Because the focus of Impala is on fast queries with interactive response times over huge data sets, query + performance and scalability are important considerations. See + <xref href="impala_performance.xml#performance"/> and <xref href="impala_scalability.xml#scalability"/> for + details. + </p> + </conbody> + + <concept id="where" audience="Cloudera"> + +<!-- WHERE hidden for the moment until there's the chance to add some reasonably comprehensive content + + and make it its own file. --> + + <title>WHERE Clause</title> + + <conbody> + + <p/> + </conbody> + </concept> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/463ddf92/docs/topics/impala_set.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_set.xml b/docs/topics/impala_set.xml new file mode 100644 index 0000000..afa6777 --- /dev/null +++ b/docs/topics/impala_set.xml @@ -0,0 +1,90 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="2.0.0" id="set"> + + <title>SET Statement</title> + <titlealts><navtitle>SET</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Configuring"/> + </metadata> + </prolog> + + <conbody> + + <p> + <indexterm audience="Cloudera">SET statement</indexterm> + Specifies values for query options that control the runtime behavior of other statements within the same + session. + </p> + + <p conref="../shared/impala_common.xml#common/syntax_blurb"/> + +<codeblock>SET [<varname>query_option</varname>=<varname>option_value</varname>] +</codeblock> + + <p> + <codeph>SET</codeph> with no arguments returns a result set consisting of all available query options and + their current values. + </p> + + <p> + The query option name and any string argument values are case-insensitive. + </p> + + <p> + Each query option has a specific allowed notation for its arguments. Boolean options can be enabled and + disabled by assigning values of either <codeph>true</codeph> and <codeph>false</codeph>, or + <codeph>1</codeph> and <codeph>0</codeph>. Some numeric options accept a final character signifying the unit, + such as <codeph>2g</codeph> for 2 gigabytes or <codeph>100m</codeph> for 100 megabytes. See + <xref href="impala_query_options.xml#query_options"/> for the details of each query option. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + <codeph>MEM_LIMIT</codeph> is probably the most commonly used query option. You can specify a high value to + allow a resource-intensive query to complete. For testing how queries would work on memory-constrained + systems, you might specify an artificially low value. + </p> + + <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example sets some numeric and some Boolean query options to control usage of memory, disk + space, and timeout periods, then runs a query whose success could depend on the options in effect: + </p> + +<codeblock>set mem_limit=64g; +set DISABLE_UNSAFE_SPILLS=true; +set parquet_file_size=400m; +set RESERVATION_REQUEST_TIMEOUT=900000; +insert overwrite parquet_table select c1, c2, count(c3) from text_table group by c1, c2, c3; +</codeblock> + + <p conref="../shared/impala_common.xml#common/added_in_20"/> + + <p> + <codeph>SET</codeph> has always been available as an <cmdname>impala-shell</cmdname> command. Promoting it to + a SQL statement lets you use this feature in client applications through the JDBC and ODBC APIs. + </p> + +<!-- <p conref="/Content/impala_common_xi44078.xml#common/jdbc_blurb"/> --> + + <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/permissions_blurb_no"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + + <p> + See <xref href="impala_query_options.xml#query_options"/> for the query options you can adjust using this + statement. + </p> + </conbody> +</concept>
