http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/bb88fdc0/docs/topics/impala_known_issues.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_known_issues.xml b/docs/topics/impala_known_issues.xml new file mode 100644 index 0000000..7b9ec2b --- /dev/null +++ b/docs/topics/impala_known_issues.xml @@ -0,0 +1,1812 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="ver" id="known_issues"> + + <title><ph audience="standalone">Known Issues and Workarounds in Impala</ph><ph audience="integrated">Apache Impala (incubating) Known Issues</ph></title> + + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Release Notes"/> + <data name="Category" value="Known Issues"/> + <data name="Category" value="Troubleshooting"/> + <data name="Category" value="Upgrading"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + The following sections describe known issues and workarounds in Impala, as of the current production release. This page summarizes the + most serious or frequently encountered issues in the current release, to help you make planning decisions about installing and + upgrading. Any workarounds are listed here. The bug links take you to the Impala issues site, where you can see the diagnosis and + whether a fix is in the pipeline. + </p> + + <note> + The online issue tracking system for Impala contains comprehensive information and is updated in real time. To verify whether an issue + you are experiencing has already been reported, or which release an issue is fixed in, search on the + <xref href="https://issues.cloudera.org/" scope="external" format="html">issues.cloudera.org JIRA tracker</xref>. + </note> + + <p outputclass="toc inpage"/> + + <p> + For issues fixed in various Impala releases, see <xref href="impala_fixed_issues.xml#fixed_issues"/>. + </p> + +<!-- Use as a template for new issues. + <concept id=""> + <title></title> + <conbody> + <p> + </p> + <p><b>Bug:</b> <xref href="https://issues.cloudera.org/browse/" scope="external" format="html"></xref></p> + <p><b>Severity:</b> High</p> + <p><b>Resolution:</b> </p> + <p><b>Workaround:</b> </p> + </conbody> + </concept> + +--> + + </conbody> + +<!-- New known issues for CDH 5.5 / Impala 2.3. + +Title: Server-to-server SSL and Kerberos do not work together +Description: If server<->server SSL is enabled (with ssl_client_ca_certificate), and Kerberos auth is used between servers, the cluster will fail to start. +Upstream & Internal JIRAs: https://issues.cloudera.org/browse/IMPALA-2598 +Severity: Medium. Server-to-server SSL is practically unusable but this is a new feature. +Workaround: No known workaround. + +Title: Queries may hang on server-to-server exchange errors +Description: The DataStreamSender::Channel::CloseInternal() does not close the channel on an error. This will cause the node on the other side of the channel to wait indefinitely causing a hang. +Upstream & Internal JIRAs: https://issues.cloudera.org/browse/IMPALA-2592 +Severity: Low. This does not occur frequently. +Workaround: No known workaround. + +Title: Catalogd may crash when loading metadata for tables with many partitions, many columns and with incremental stats +Description: Incremental stats use up about 400 bytes per partition X column. So for a table with 20K partitions and 100 columns this is about 800 MB. When serialized this goes past the 2 GB Java array size limit and leads to a catalog crash. +Upstream & Internal JIRAs: https://issues.cloudera.org/browse/IMPALA-2648, IMPALA-2647, IMPALA-2649. +Severity: Low. This does not occur frequently. +Workaround: Reduce the number of partitions. + +More from: https://issues.cloudera.org/browse/IMPALA-2093?filter=11278&jql=project%20%3D%20IMPALA%20AND%20priority%20in%20(blocker%2C%20critical)%20AND%20status%20in%20(open%2C%20Reopened)%20AND%20labels%20%3D%20correctness%20ORDER%20BY%20priority%20DESC + +IMPALA-2093 +Wrong plan of NOT IN aggregate subquery when a constant is used in subquery predicate +IMPALA-1652 +Incorrect results with basic predicate on CHAR typed column. +IMPALA-1459 +Incorrect assignment of predicates through an outer join in an inline view. +IMPALA-2665 +Incorrect assignment of On-clause predicate inside inline view with an outer join. +IMPALA-2603 +Crash: impala::Coordinator::ValidateCollectionSlots +IMPALA-2375 +Fix issues with the legacy join and agg nodes using enable_partitioned_hash_join=false and enable_partitioned_aggregation=false +IMPALA-1862 +Invalid bool value not reported as a scanner error +IMPALA-1792 +ImpalaODBC: Can not get the value in the SQLGetData(m-x th column) after the SQLBindCol(m th column) +IMPALA-1578 +Impala incorrectly handles text data when the new line character \n\r is split between different HDFS block +IMPALA-2643 +Duplicated column in inline view causes dropping null slots during scan +IMPALA-2005 +A failed CTAS does not drop the table if the insert fails. +IMPALA-1821 +Casting scenarios with invalid/inconsistent results + +Another list from Alex, of correctness problems with predicates; might overlap with ones I already have: + +https://issues.cloudera.org/browse/IMPALA-2665 - Already have +https://issues.cloudera.org/browse/IMPALA-2643 - Already have +https://issues.cloudera.org/browse/IMPALA-1459 - Already have +https://issues.cloudera.org/browse/IMPALA-2144 - Don't have + +--> + + <concept id="known_issues_crash"> + + <title>Impala Known Issues: Crashes and Hangs</title> + + <conbody> + + <p> + These issues can cause Impala to quit or become unresponsive. + </p> + + </conbody> + + <concept id="IMPALA-3069" rev="IMPALA-3069"> + + <title>Setting BATCH_SIZE query option too large can cause a crash</title> + + <conbody> + + <p> + Using a value in the millions for the <codeph>BATCH_SIZE</codeph> query option, together with wide rows or large string values in + columns, could cause a memory allocation of more than 2 GB resulting in a crash. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-3069" scope="external" format="html">IMPALA-3069</xref> + </p> + + <p> + <b>Severity:</b> High + </p> + + <p><b>Resolution:</b> Fixed in CDH 5.9.0 / Impala 2.7.0.</p> + + </conbody> + + </concept> + + <concept id="IMPALA-3441" rev="IMPALA-3441"> + + <title></title> + + <conbody> + + <p> + Malformed Avro data, such as out-of-bounds integers or values in the wrong format, could cause a crash when queried. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-3441" scope="external" format="html">IMPALA-3441</xref> + </p> + + <p> + <b>Severity:</b> High + </p> + + <p><b>Resolution:</b> Fixed in CDH 5.9.0 / Impala 2.7.0 and CDH 5.8.2 / Impala 2.6.2.</p> + + </conbody> + + </concept> + + <concept id="IMPALA-2592" rev="IMPALA-2592"> + + <title>Queries may hang on server-to-server exchange errors</title> + + <conbody> + + <p> + The <codeph>DataStreamSender::Channel::CloseInternal()</codeph> does not close the channel on an error. This causes the node on + the other side of the channel to wait indefinitely, causing a hang. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-2592" scope="external" format="html">IMPALA-2592</xref> + </p> + + <p> + <b>Resolution:</b> Fixed in CDH 5.7.0 / Impala 2.5.0. + </p> + + </conbody> + + </concept> + + <concept id="IMPALA-2365" rev="IMPALA-2365"> + + <title>Impalad is crashing if udf jar is not available in hdfs location for first time</title> + + <conbody> + + <p> + If the JAR file corresponding to a Java UDF is removed from HDFS after the Impala <codeph>CREATE FUNCTION</codeph> statement is + issued, the <cmdname>impalad</cmdname> daemon crashes. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-2365" scope="external" format="html">IMPALA-2365</xref> + </p> + + <p><b>Resolution:</b> Fixed in CDH 5.7.0 / Impala 2.5.0.</p> + + </conbody> + + </concept> + + </concept> + + <concept id="known_issues_performance"> + + <title id="ki_performance">Impala Known Issues: Performance</title> + + <conbody> + + <p> + These issues involve the performance of operations such as queries or DDL statements. + </p> + + </conbody> + + <concept id="IMPALA-1480" rev="IMPALA-1480"> + +<!-- Not part of Alex's spreadsheet. Spreadsheet has IMPALA-1423 which mentions it's similar to this one but not a duplicate. --> + + <title>Slow DDL statements for tables with large number of partitions</title> + + <conbody> + + <p> + DDL statements for tables with a large number of partitions might be slow. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-1480" scope="external" format="html"></xref>IMPALA-1480 + </p> + + <p> + <b>Workaround:</b> Run the DDL statement in Hive if the slowness is an issue. + </p> + + <p><b>Resolution:</b> Fixed in CDH 5.7.0 / Impala 2.5.0.</p> + + </conbody> + + </concept> + + </concept> + + <concept id="known_issues_usability"> + + <title id="ki_usability">Impala Known Issues: Usability</title> + + <conbody> + + <p> + These issues affect the convenience of interacting directly with Impala, typically through the Impala shell or Hue. + </p> + + </conbody> + + <concept id="IMPALA-3133" rev="IMPALA-3133"> + + <title>Unexpected privileges in show output</title> + + <conbody> + + <p> + Due to a timing condition in updating cached policy data from Sentry, the <codeph>SHOW</codeph> statements for Sentry roles could + sometimes display out-of-date role settings. Because Impala rechecks authorization for each SQL statement, this discrepancy does + not represent a security issue for other statements. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-3133" scope="external" format="html">IMPALA-3133</xref> + </p> + + <p> + <b>Severity:</b> High + </p> + + <p> + <b>Resolution:</b> Fixes have been issued for some but not all CDH / Impala releases. Check the JIRA for details of fix releases. + </p> + + <p><b>Resolution:</b> Fixed in CDH 5.8.0 / Impala 2.6.0 and CDH 5.7.1 / Impala 2.5.1.</p> + + </conbody> + + </concept> + + <concept id="IMPALA-1776" rev="IMPALA-1776"> + + <title>Less than 100% progress on completed simple SELECT queries</title> + + <conbody> + + <p> + Simple <codeph>SELECT</codeph> queries show less than 100% progress even though they are already completed. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-1776" scope="external" format="html">IMPALA-1776</xref> + </p> + + </conbody> + + </concept> + + <concept id="concept_lmx_dk5_lx"> + + <title>Unexpected column overflow behavior with INT datatypes</title> + + <conbody> + + <p conref="../shared/impala_common.xml#common/int_overflow_behavior" /> + + <p> + <b>Bug:</b> + <xref href="https://issues.cloudera.org/browse/IMPALA-3123" + scope="external" format="html">IMPALA-3123</xref> + </p> + + </conbody> + + </concept> + + </concept> + + <concept id="known_issues_drivers"> + + <title id="ki_drivers">Impala Known Issues: JDBC and ODBC Drivers</title> + + <conbody> + + <p> + These issues affect applications that use the JDBC or ODBC APIs, such as business intelligence tools or custom-written applications + in languages such as Java or C++. + </p> + + </conbody> + + <concept id="IMPALA-1792" rev="IMPALA-1792"> + +<!-- Not part of Alex's spreadsheet --> + + <title>ImpalaODBC: Can not get the value in the SQLGetData(m-x th column) after the SQLBindCol(m th column)</title> + + <conbody> + + <p> + If the ODBC <codeph>SQLGetData</codeph> is called on a series of columns, the function calls must follow the same order as the + columns. For example, if data is fetched from column 2 then column 1, the <codeph>SQLGetData</codeph> call for column 1 returns + <codeph>NULL</codeph>. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-1792" scope="external" format="html">IMPALA-1792</xref> + </p> + + <p> + <b>Workaround:</b> Fetch columns in the same order they are defined in the table. + </p> + + </conbody> + + </concept> + + </concept> + + <concept id="known_issues_security"> + + <title id="ki_security">Impala Known Issues: Security</title> + + <conbody> + + <p> + These issues relate to security features, such as Kerberos authentication, Sentry authorization, encryption, auditing, and + redaction. + </p> + + </conbody> + +<!-- To do: Hiding for the moment. https://jira.cloudera.com/browse/CDH-38736 reports the issue is fixed. --> + + <concept id="impala-shell_ssl_dependency" audience="Cloudera" rev="impala-shell_ssl_dependency"> + + <title>impala-shell requires Python with ssl module</title> + + <conbody> + + <p> + On CentOS 5.10 and Oracle Linux 5.11 using the built-in Python 2.4, invoking the <cmdname>impala-shell</cmdname> with the + <codeph>--ssl</codeph> option might fail with the following error: + </p> + +<codeblock> +Unable to import the python 'ssl' module. It is required for an SSL-secured connection. +</codeblock> + +<!-- No associated IMPALA-* JIRA... It is the internal JIRA CDH-38736. --> + + <p> + <b>Severity:</b> Low, workaround available + </p> + + <p> + <b>Resolution:</b> Customers are less likely to experience this issue over time, because <codeph>ssl</codeph> module is included + in newer Python releases packaged with recent Linux releases. + </p> + + <p> + <b>Workaround:</b> To use SSL with <cmdname>impala-shell</cmdname> on these platform versions, install the <codeph>ssh</codeph> + Python module: + </p> + +<codeblock> +yum install python-ssl +</codeblock> + + <p> + Then <cmdname>impala-shell</cmdname> can run when using SSL. For example: + </p> + +<codeblock> +impala-shell -s impala --ssl --ca_cert /path_to_truststore/truststore.pem +</codeblock> + + </conbody> + + </concept> + + <concept id="renewable_kerberos_tickets"> + +<!-- Not part of Alex's spreadsheet. Not associated with a JIRA number AFAIK. --> + + <title>Kerberos tickets must be renewable</title> + + <conbody> + + <p> + In a Kerberos environment, the <cmdname>impalad</cmdname> daemon might not start if Kerberos tickets are not renewable. + </p> + + <p> + <b>Workaround:</b> Configure your KDC to allow tickets to be renewed, and configure <filepath>krb5.conf</filepath> to request + renewable tickets. + </p> + + </conbody> + + </concept> + +<!-- To do: Fixed in 2.5.0, 2.3.2. Commenting out until I see how it can fix into "known issues now fixed" convention. + That set of fix releases looks incomplete so probably have to do some detective work with the JIRA. + https://issues.cloudera.org/browse/IMPALA-2598 + <concept id="IMPALA-2598" rev="IMPALA-2598"> + + <title>Server-to-server SSL and Kerberos do not work together</title> + + <conbody> + + <p> + If SSL is enabled between internal Impala components (with <codeph>ssl_client_ca_certificate</codeph>), and Kerberos + authentication is used between servers, the cluster fails to start. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-2598" scope="external" format="html">IMPALA-2598</xref> + </p> + + <p> + <b>Workaround:</b> Do not use the new <codeph>ssl_client_ca_certificate</codeph> setting on Kerberos-enabled clusters until this + issue is resolved. + </p> + + <p><b>Resolution:</b> Fixed in CDH 5.7.0 / Impala 2.5.0 and CDH 5.5.2 / Impala 2.3.2.</p> + + </conbody> + + </concept> +--> + + </concept> + +<!-- + <concept id="known_issues_supportability"> + + <title id="ki_supportability">Impala Known Issues: Supportability</title> + + <conbody> + + <p> + These issues affect the ability to debug and troubleshoot Impala, such as incorrect output in query profiles or the query state + shown in monitoring applications. + </p> + + </conbody> + + </concept> +--> + + <concept id="known_issues_resources"> + + <title id="ki_resources">Impala Known Issues: Resources</title> + + <conbody> + + <p> + These issues involve memory or disk usage, including out-of-memory conditions, the spill-to-disk feature, and resource management + features. + </p> + + </conbody> + + <concept id="TSB-168"> + + <title>Impala catalogd heap issues when upgrading to 5.7</title> + + <conbody> + + <p> + The default heap size for Impala <cmdname>catalogd</cmdname> has changed in CDH 5.7 / Impala 2.5 and higher: + </p> + + <ul> + <li> + <p> + Before 5.7, by default <cmdname>catalogd</cmdname> was using the JVM's default heap size, which is the smaller of 1/4th of the + physical memory or 32 GB. + </p> + </li> + + <li> + <p> + Starting with CDH 5.7.0, the default <cmdname>catalogd</cmdname> heap size is 4 GB. + </p> + </li> + </ul> + + <p> + For example, on a host with 128GB physical memory this will result in catalogd heap decreasing from 32GB to 4GB. This can result + in out-of-memory errors in catalogd and leading to query failures. + </p> + + <p audience="Cloudera"> + <b>Bug:</b> <xref href="https://jira.cloudera.com/browse/TSB-168" scope="external" format="html">TSB-168</xref> + </p> + + <p> + <b>Severity:</b> High + </p> + + <p> + <b>Workaround:</b> Increase the <cmdname>catalogd</cmdname> memory limit as follows. +<!-- See <xref href="impala_scalability.xml#scalability_catalog"/> for the procedure. --> +<!-- Including full details here via conref, for benefit of PDF readers or anyone else + who might have trouble seeing or following the link. --> + </p> + + <p conref="../shared/impala_common.xml#common/increase_catalogd_heap_size"/> + + </conbody> + + </concept> + + <concept id="IMPALA-3509" rev="IMPALA-3509"> + + <title>Breakpad minidumps can be very large when the thread count is high</title> + + <conbody> + + <p> + The size of the breakpad minidump files grows linearly with the number of threads. By default, each thread adds 8 KB to the + minidump size. Minidump files could consume significant disk space when the daemons have a high number of threads. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-3509" scope="external" format="html">IMPALA-3509</xref> + </p> + + <p> + <b>Severity:</b> High + </p> + + <p> + <b>Workaround:</b> Add <codeph>--minidump_size_limit_hint_kb=<varname>size</varname></codeph> to set a soft upper limit on the + size of each minidump file. If the minidump file would exceed that limit, Impala reduces the amount of information for each thread + from 8 KB to 2 KB. (Full thread information is captured for the first 20 threads, then 2 KB per thread after that.) The minidump + file can still grow larger than the <q>hinted</q> size. For example, if you have 10,000 threads, the minidump file can be more + than 20 MB. + </p> + + </conbody> + + </concept> + + <concept id="IMPALA-3662" rev="IMPALA-3662"> + + <title>Parquet scanner memory increase after IMPALA-2736</title> + + <conbody> + + <p> + The initial release of CDH 5.8 / Impala 2.6 sometimes has a higher peak memory usage than in previous releases while reading + Parquet files. + </p> + + <p> + CDH 5.8 / Impala 2.6 addresses the issue IMPALA-2736, which improves the efficiency of Parquet scans by up to 2x. The faster scans + may result in a higher peak memory consumption compared to earlier versions of Impala due to the new column-wise row + materialization strategy. You are likely to experience higher memory consumption in any of the following scenarios: + <ul> + <li> + <p> + Very wide rows due to projecting many columns in a scan. + </p> + </li> + + <li> + <p> + Very large rows due to big column values, for example, long strings or nested collections with many items. + </p> + </li> + + <li> + <p> + Producer/consumer speed imbalances, leading to more rows being buffered between a scan (producer) and downstream (consumer) + plan nodes. + </p> + </li> + </ul> + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-3662" scope="external" format="html">IMPALA-3662</xref> + </p> + + <p> + <b>Severity:</b> High + </p> + + <p> + <b>Workaround:</b> The following query options might help to reduce memory consumption in the Parquet scanner: + <ul> + <li> + Reduce the number of scanner threads, for example: <codeph>set num_scanner_threads=30</codeph> + </li> + + <li> + Reduce the batch size, for example: <codeph>set batch_size=512</codeph> + </li> + + <li> + Increase the memory limit, for example: <codeph>set mem_limit=64g</codeph> + </li> + </ul> + </p> + + </conbody> + + </concept> + + <concept id="IMPALA-691" rev="IMPALA-691"> + + <title>Process mem limit does not account for the JVM's memory usage</title> + +<!-- Supposed to be resolved for Impala 2.3.0. --> + + <conbody> + + <p> + Some memory allocated by the JVM used internally by Impala is not counted against the memory limit for the + <cmdname>impalad</cmdname> daemon. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-691" scope="external" format="html">IMPALA-691</xref> + </p> + + <p> + <b>Workaround:</b> To monitor overall memory usage, use the <cmdname>top</cmdname> command, or add the memory figures in the + Impala web UI <uicontrol>/memz</uicontrol> tab to JVM memory usage shown on the <uicontrol>/metrics</uicontrol> tab. + </p> + + </conbody> + + </concept> + + <concept id="IMPALA-2375" rev="IMPALA-2375"> + +<!-- Not part of Alex's spreadsheet --> + + <title>Fix issues with the legacy join and agg nodes using --enable_partitioned_hash_join=false and --enable_partitioned_aggregation=false</title> + + <conbody> + + <p></p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-2375" scope="external" format="html">IMPALA-2375</xref> + </p> + + <p> + <b>Workaround:</b> Transition away from the <q>old-style</q> join and aggregation mechanism if practical. + </p> + + <p><b>Resolution:</b> Fixed in CDH 5.7.0 / Impala 2.5.0.</p> + + </conbody> + + </concept> + + </concept> + + <concept id="known_issues_correctness"> + + <title id="ki_correctness">Impala Known Issues: Correctness</title> + + <conbody> + + <p> + These issues can cause incorrect or unexpected results from queries. They typically only arise in very specific circumstances. + </p> + + </conbody> + + <concept id="IMPALA-3084" rev="IMPALA-3084"> + + <title>Incorrect assignment of NULL checking predicate through an outer join of a nested collection.</title> + + <conbody> + + <p> + A query could return wrong results (too many or too few <codeph>NULL</codeph> values) if it referenced an outer-joined nested + collection and also contained a null-checking predicate (<codeph>IS NULL</codeph>, <codeph>IS NOT NULL</codeph>, or the + <codeph><=></codeph> operator) in the <codeph>WHERE</codeph> clause. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-3084" scope="external" format="html">IMPALA-3084</xref> + </p> + + <p> + <b>Severity:</b> High + </p> + + <p><b>Resolution:</b> Fixed in CDH 5.9.0 / Impala 2.7.0.</p> + + </conbody> + + </concept> + + <concept id="IMPALA-3094" rev="IMPALA-3094"> + + <title>Incorrect result due to constant evaluation in query with outer join</title> + + <conbody> + + <p> + An <codeph>OUTER JOIN</codeph> query could omit some expected result rows due to a constant such as <codeph>FALSE</codeph> in + another join clause. For example: + </p> + +<codeblock><![CDATA[ +explain SELECT 1 FROM alltypestiny a1 + INNER JOIN alltypesagg a2 ON a1.smallint_col = a2.year AND false + RIGHT JOIN alltypes a3 ON a1.year = a1.bigint_col; ++---------------------------------------------------------+ +| Explain String | ++---------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=1.00KB VCores=1 | +| | +| 00:EMPTYSET | ++---------------------------------------------------------+ +]]> +</codeblock> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-3094" scope="external" format="html">IMPALA-3094</xref> + </p> + + <p> + <b>Severity:</b> High + </p> + + <p> + <b>Resolution:</b> + </p> + + <p> + <b>Workaround:</b> + </p> + + </conbody> + + </concept> + + <concept id="IMPALA-3126" rev="IMPALA-3126"> + + <title>Incorrect assignment of an inner join On-clause predicate through an outer join.</title> + + <conbody> + + <p> + Impala may return incorrect results for queries that have the following properties: + </p> + + <ul> + <li> + <p> + There is an INNER JOIN following a series of OUTER JOINs. + </p> + </li> + + <li> + <p> + The INNER JOIN has an On-clause with a predicate that references at least two tables that are on the nullable side of the + preceding OUTER JOINs. + </p> + </li> + </ul> + + <p> + The following query demonstrates the issue: + </p> + +<codeblock> +select 1 from functional.alltypes a left outer join + functional.alltypes b on a.id = b.id left outer join + functional.alltypes c on b.id = c.id right outer join + functional.alltypes d on c.id = d.id inner join functional.alltypes e +on b.int_col = c.int_col; +</codeblock> + + <p> + The following listing shows the incorrect <codeph>EXPLAIN</codeph> plan: + </p> + +<codeblock><![CDATA[ ++-----------------------------------------------------------+ +| Explain String | ++-----------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=480.04MB VCores=4 | +| | +| 14:EXCHANGE [UNPARTITIONED] | +| | | +| 08:NESTED LOOP JOIN [CROSS JOIN, BROADCAST] | +| | | +| |--13:EXCHANGE [BROADCAST] | +| | | | +| | 04:SCAN HDFS [functional.alltypes e] | +| | partitions=24/24 files=24 size=478.45KB | +| | | +| 07:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED] | +| | hash predicates: c.id = d.id | +| | runtime filters: RF000 <- d.id | +| | | +| |--12:EXCHANGE [HASH(d.id)] | +| | | | +| | 03:SCAN HDFS [functional.alltypes d] | +| | partitions=24/24 files=24 size=478.45KB | +| | | +| 06:HASH JOIN [LEFT OUTER JOIN, PARTITIONED] | +| | hash predicates: b.id = c.id | +| | other predicates: b.int_col = c.int_col <--- incorrect placement; should be at node 07 or 08 +| | runtime filters: RF001 <- c.int_col | +| | | +| |--11:EXCHANGE [HASH(c.id)] | +| | | | +| | 02:SCAN HDFS [functional.alltypes c] | +| | partitions=24/24 files=24 size=478.45KB | +| | runtime filters: RF000 -> c.id | +| | | +| 05:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED] | +| | hash predicates: b.id = a.id | +| | runtime filters: RF002 <- a.id | +| | | +| |--10:EXCHANGE [HASH(a.id)] | +| | | | +| | 00:SCAN HDFS [functional.alltypes a] | +| | partitions=24/24 files=24 size=478.45KB | +| | | +| 09:EXCHANGE [HASH(b.id)] | +| | | +| 01:SCAN HDFS [functional.alltypes b] | +| partitions=24/24 files=24 size=478.45KB | +| runtime filters: RF001 -> b.int_col, RF002 -> b.id | ++-----------------------------------------------------------+ +]]> +</codeblock> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-3126" scope="external" format="html">IMPALA-3126</xref> + </p> + + <p> + <b>Severity:</b> High + </p> + + <p> + <b>Workaround:</b> High + </p> + + <p> + For some queries, this problem can be worked around by placing the problematic <codeph>ON</codeph> clause predicate in the + <codeph>WHERE</codeph> clause instead, or changing the preceding <codeph>OUTER JOIN</codeph>s to <codeph>INNER JOIN</codeph>s (if + the <codeph>ON</codeph> clause predicate would discard <codeph>NULL</codeph>s). For example, to fix the problematic query above: + </p> + +<codeblock><![CDATA[ +select 1 from functional.alltypes a + left outer join functional.alltypes b + on a.id = b.id + left outer join functional.alltypes c + on b.id = c.id + right outer join functional.alltypes d + on c.id = d.id + inner join functional.alltypes e +where b.int_col = c.int_col + ++-----------------------------------------------------------+ +| Explain String | ++-----------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=480.04MB VCores=4 | +| | +| 14:EXCHANGE [UNPARTITIONED] | +| | | +| 08:NESTED LOOP JOIN [CROSS JOIN, BROADCAST] | +| | | +| |--13:EXCHANGE [BROADCAST] | +| | | | +| | 04:SCAN HDFS [functional.alltypes e] | +| | partitions=24/24 files=24 size=478.45KB | +| | | +| 07:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED] | +| | hash predicates: c.id = d.id | +| | other predicates: b.int_col = c.int_col <-- correct assignment +| | runtime filters: RF000 <- d.id | +| | | +| |--12:EXCHANGE [HASH(d.id)] | +| | | | +| | 03:SCAN HDFS [functional.alltypes d] | +| | partitions=24/24 files=24 size=478.45KB | +| | | +| 06:HASH JOIN [LEFT OUTER JOIN, PARTITIONED] | +| | hash predicates: b.id = c.id | +| | | +| |--11:EXCHANGE [HASH(c.id)] | +| | | | +| | 02:SCAN HDFS [functional.alltypes c] | +| | partitions=24/24 files=24 size=478.45KB | +| | runtime filters: RF000 -> c.id | +| | | +| 05:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED] | +| | hash predicates: b.id = a.id | +| | runtime filters: RF001 <- a.id | +| | | +| |--10:EXCHANGE [HASH(a.id)] | +| | | | +| | 00:SCAN HDFS [functional.alltypes a] | +| | partitions=24/24 files=24 size=478.45KB | +| | | +| 09:EXCHANGE [HASH(b.id)] | +| | | +| 01:SCAN HDFS [functional.alltypes b] | +| partitions=24/24 files=24 size=478.45KB | +| runtime filters: RF001 -> b.id | ++-----------------------------------------------------------+ +]]> +</codeblock> + + </conbody> + + </concept> + + <concept id="IMPALA-3006" rev="IMPALA-3006"> + + <title>Impala may use incorrect bit order with BIT_PACKED encoding</title> + + <conbody> + + <p> + Parquet <codeph>BIT_PACKED</codeph> encoding as implemented by Impala is LSB first. The parquet standard says it is MSB first. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-3006" scope="external" format="html">IMPALA-3006</xref> + </p> + + <p> + <b>Severity:</b> High, but rare in practice because BIT_PACKED is infrequently used, is not written by Impala, and is deprecated + in Parquet 2.0. + </p> + + </conbody> + + </concept> + + <concept id="IMPALA-3082" rev="IMPALA-3082"> + + <title>BST between 1972 and 1995</title> + + <conbody> + + <p> + The calculation of start and end times for the BST (British Summer Time) time zone could be incorrect between 1972 and 1995. + Between 1972 and 1995, BST began and ended at 02:00 GMT on the third Sunday in March (or second Sunday when Easter fell on the + third) and fourth Sunday in October. For example, both function calls should return 13, but actually return 12, in a query such + as: + </p> + +<codeblock> +select + extract(from_utc_timestamp(cast('1970-01-01 12:00:00' as timestamp), 'Europe/London'), "hour") summer70start, + extract(from_utc_timestamp(cast('1970-12-31 12:00:00' as timestamp), 'Europe/London'), "hour") summer70end; +</codeblock> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-3082" scope="external" format="html">IMPALA-3082</xref> + </p> + + <p> + <b>Severity:</b> High + </p> + + </conbody> + + </concept> + + <concept id="IMPALA-1170" rev="IMPALA-1170"> + + <title>parse_url() returns incorrect result if @ character in URL</title> + + <conbody> + + <p> + If a URL contains an <codeph>@</codeph> character, the <codeph>parse_url()</codeph> function could return an incorrect value for + the hostname field. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-1170" scope="external" format="html"></xref>IMPALA-1170 + </p> + + <p><b>Resolution:</b> Fixed in CDH 5.7.0 / Impala 2.5.0 and CDH 5.5.4 / Impala 2.3.4.</p> + + </conbody> + + </concept> + + <concept id="IMPALA-2422" rev="IMPALA-2422"> + + <title>% escaping does not work correctly when occurs at the end in a LIKE clause</title> + + <conbody> + + <p> + If the final character in the RHS argument of a <codeph>LIKE</codeph> operator is an escaped <codeph>\%</codeph> character, it + does not match a <codeph>%</codeph> final character of the LHS argument. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-2422" scope="external" format="html">IMPALA-2422</xref> + </p> + + </conbody> + + </concept> + + <concept id="IMPALA-397" rev="IMPALA-397"> + + <title>ORDER BY rand() does not work.</title> + + <conbody> + + <p> + Because the value for <codeph>rand()</codeph> is computed early in a query, using an <codeph>ORDER BY</codeph> expression + involving a call to <codeph>rand()</codeph> does not actually randomize the results. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-397" scope="external" format="html">IMPALA-397</xref> + </p> + + </conbody> + + </concept> + + <concept id="IMPALA-2643" rev="IMPALA-2643"> + + <title>Duplicated column in inline view causes dropping null slots during scan</title> + + <conbody> + + <p> + If the same column is queried twice within a view, <codeph>NULL</codeph> values for that column are omitted. For example, the + result of <codeph>COUNT(*)</codeph> on the view could be less than expected. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-2643" scope="external" format="html">IMPALA-2643</xref> + </p> + + <p> + <b>Workaround:</b> Avoid selecting the same column twice within an inline view. + </p> + + <p><b>Resolution:</b> Fixed in CDH 5.7.0 / Impala 2.5.0, CDH 5.5.2 / Impala 2.3.2, and CDH 5.4.10 / Impala 2.2.10.</p> + + </conbody> + + </concept> + + <concept id="IMPALA-1459" rev="IMPALA-1459"> + +<!-- Not part of Alex's spreadsheet --> + + <title>Incorrect assignment of predicates through an outer join in an inline view.</title> + + <conbody> + + <p> + A query involving an <codeph>OUTER JOIN</codeph> clause where one of the table references is an inline view might apply predicates + from the <codeph>ON</codeph> clause incorrectly. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-1459" scope="external" format="html">IMPALA-1459</xref> + </p> + + <p><b>Resolution:</b> Fixed in CDH 5.7.0 / Impala 2.5.0, CDH 5.5.2 / Impala 2.3.2, and CDH 5.4.9 / Impala 2.2.9.</p> + + </conbody> + + </concept> + + <concept id="IMPALA-2603" rev="IMPALA-2603"> + + <title>Crash: impala::Coordinator::ValidateCollectionSlots</title> + + <conbody> + + <p> + A query could encounter a serious error if includes multiple nested levels of <codeph>INNER JOIN</codeph> clauses involving + subqueries. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-2603" scope="external" format="html">IMPALA-2603</xref> + </p> + + </conbody> + + </concept> + + <concept id="IMPALA-2665" rev="IMPALA-2665"> + + <title>Incorrect assignment of On-clause predicate inside inline view with an outer join.</title> + + <conbody> + + <p> + A query might return incorrect results due to wrong predicate assignment in the following scenario: + </p> + + <ol> + <li> + There is an inline view that contains an outer join + </li> + + <li> + That inline view is joined with another table in the enclosing query block + </li> + + <li> + That join has an On-clause containing a predicate that only references columns originating from the outer-joined tables inside + the inline view + </li> + </ol> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-2665" scope="external" format="html">IMPALA-2665</xref> + </p> + + <p><b>Resolution:</b> Fixed in CDH 5.7.0 / Impala 2.5.0, CDH 5.5.2 / Impala 2.3.2, and CDH 5.4.9 / Impala 2.2.9.</p> + + </conbody> + + </concept> + + <concept id="IMPALA-2144" rev="IMPALA-2144"> + + <title>Wrong assignment of having clause predicate across outer join</title> + + <conbody> + + <p> + In an <codeph>OUTER JOIN</codeph> query with a <codeph>HAVING</codeph> clause, the comparison from the <codeph>HAVING</codeph> + clause might be applied at the wrong stage of query processing, leading to incorrect results. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-2144" scope="external" format="html">IMPALA-2144</xref> + </p> + + <p><b>Resolution:</b> Fixed in CDH 5.7.0 / Impala 2.5.0.</p> + + </conbody> + + </concept> + + <concept id="IMPALA-2093" rev="IMPALA-2093"> + + <title>Wrong plan of NOT IN aggregate subquery when a constant is used in subquery predicate</title> + + <conbody> + + <p> + A <codeph>NOT IN</codeph> operator with a subquery that calls an aggregate function, such as <codeph>NOT IN (SELECT + SUM(...))</codeph>, could return incorrect results. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-2093" scope="external" format="html">IMPALA-2093</xref> + </p> + + <p><b>Resolution:</b> Fixed in CDH 5.7.0 / Impala 2.5.0 and CDH 5.5.4 / Impala 2.3.4.</p> + + </conbody> + + </concept> + + </concept> + + <concept id="known_issues_metadata"> + + <title id="ki_metadata">Impala Known Issues: Metadata</title> + + <conbody> + + <p> + These issues affect how Impala interacts with metadata. They cover areas such as the metastore database, the <codeph>COMPUTE + STATS</codeph> statement, and the Impala <cmdname>catalogd</cmdname> daemon. + </p> + + </conbody> + + <concept id="IMPALA-2648" rev="IMPALA-2648"> + + <title>Catalogd may crash when loading metadata for tables with many partitions, many columns and with incremental stats</title> + + <conbody> + + <p> + Incremental stats use up about 400 bytes per partition for each column. For example, for a table with 20K partitions and 100 + columns, the memory overhead from incremental statistics is about 800 MB. When serialized for transmission across the network, + this metadata exceeds the 2 GB Java array size limit and leads to a <codeph>catalogd</codeph> crash. + </p> + + <p> + <b>Bugs:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-2647" scope="external" format="html">IMPALA-2647</xref>, + <xref href="https://issues.cloudera.org/browse/IMPALA-2648" scope="external" format="html">IMPALA-2648</xref>, + <xref href="https://issues.cloudera.org/browse/IMPALA-2649" scope="external" format="html">IMPALA-2649</xref> + </p> + + <p> + <b>Workaround:</b> If feasible, compute full stats periodically and avoid computing incremental stats for that table. The + scalability of incremental stats computation is a continuing work item. + </p> + + </conbody> + + </concept> + + <concept id="IMPALA-1420" rev="IMPALA-1420 2.0.0"> + +<!-- Not part of Alex's spreadsheet --> + + <title>Can't update stats manually via alter table after upgrading to CDH 5.2</title> + + <conbody> + + <p></p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-1420" scope="external" format="html">IMPALA-1420</xref> + </p> + + <p> + <b>Workaround:</b> On CDH 5.2, when adjusting table statistics manually by setting the <codeph>numRows</codeph>, you must also + enable the Boolean property <codeph>STATS_GENERATED_VIA_STATS_TASK</codeph>. For example, use a statement like the following to + set both properties with a single <codeph>ALTER TABLE</codeph> statement: + </p> + +<codeblock>ALTER TABLE <varname>table_name</varname> SET TBLPROPERTIES('numRows'='<varname>new_value</varname>', 'STATS_GENERATED_VIA_STATS_TASK' = 'true');</codeblock> + + <p> + <b>Resolution:</b> The underlying cause is the issue + <xref href="https://issues.apache.org/jira/browse/HIVE-8648" scope="external" format="html">HIVE-8648</xref> that affects the + metastore in Hive 0.13. The workaround is only needed until the fix for this issue is incorporated into a CDH release. + </p> + + </conbody> + + </concept> + + </concept> + + <concept id="known_issues_interop"> + + <title id="ki_interop">Impala Known Issues: Interoperability</title> + + <conbody> + + <p> + These issues affect the ability to interchange data between Impala and other database systems. They cover areas such as data types + and file formats. + </p> + + </conbody> + +<!-- Opened based on CDH-41605. Not part of Alex's spreadsheet AFAIK. --> + + <concept id="CDH-41605"> + + <title>DESCRIBE FORMATTED gives error on Avro table</title> + + <conbody> + + <p> + This issue can occur either on old Avro tables (created prior to Hive 1.1 / CDH 5.4) or when changing the Avro schema file by + adding or removing columns. Columns added to the schema file will not show up in the output of the <codeph>DESCRIBE + FORMATTED</codeph> command. Removing columns from the schema file will trigger a <codeph>NullPointerException</codeph>. + </p> + + <p> + As a workaround, you can use the output of <codeph>SHOW CREATE TABLE</codeph> to drop and recreate the table. This will populate + the Hive metastore database with the correct column definitions. + </p> + + <note type="warning"> + Only use this for external tables, or Impala will remove the data files. In case of an internal table, set it to external first: +<codeblock> +ALTER TABLE table_name SET TBLPROPERTIES('EXTERNAL'='TRUE'); +</codeblock> + (The part in parentheses is case sensitive.) Make sure to pick the right choice between internal and external when recreating the + table. See <xref href="impala_tables.xml#tables"/> for the differences between internal and external tables. + </note> + + <p audience="Cloudera"> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/CDH-41605" scope="external" format="html">CDH-41605</xref> + </p> + + <p> + <b>Severity:</b> High + </p> + + </conbody> + + </concept> + + <concept id="IMP-469"> + +<!-- Not part of Alex's spreadsheet. Perhaps it really is a permanent limitation and nobody is tracking it? --> + + <title>Deviation from Hive behavior: Impala does not do implicit casts between string and numeric and boolean types.</title> + + <conbody> + + <p audience="Cloudera"> + <b>Cloudera Bug:</b> <xref href="https://jira.cloudera.com/browse/IMP-469" scope="external" format="html"/>; KI added 0.1 + <i>Cloudera internal only</i> + </p> + + <p> + <b>Anticipated Resolution</b>: None + </p> + + <p> + <b>Workaround:</b> Use explicit casts. + </p> + + </conbody> + + </concept> + + <concept id="IMP-175"> + +<!-- Not part of Alex's spreadsheet. Perhaps it really is a permanent limitation and nobody is tracking it? --> + + <title>Deviation from Hive behavior: Out of range values float/double values are returned as maximum allowed value of type (Hive returns NULL)</title> + + <conbody> + + <p> + Impala behavior differs from Hive with respect to out of range float/double values. Out of range values are returned as maximum + allowed value of type (Hive returns NULL). + </p> + + <p audience="Cloudera"> + <b>Cloudera Bug:</b> <xref href="https://jira.cloudera.com/browse/IMP-175" scope="external" format="html">IMPALA-175</xref> ; KI + added 0.1 <i>Cloudera internal only</i> + </p> + + <p> + <b>Workaround:</b> None + </p> + + </conbody> + + </concept> + + <concept id="CDH-13199"> + +<!-- Not part of Alex's spreadsheet. The CDH- prefix makes it an oddball. --> + + <title>Configuration needed for Flume to be compatible with Impala</title> + + <conbody> + + <p> + For compatibility with Impala, the value for the Flume HDFS Sink <codeph>hdfs.writeFormat</codeph> must be set to + <codeph>Text</codeph>, rather than its default value of <codeph>Writable</codeph>. The <codeph>hdfs.writeFormat</codeph> setting + must be changed to <codeph>Text</codeph> before creating data files with Flume; otherwise, those files cannot be read by either + Impala or Hive. + </p> + + <p> + <b>Resolution:</b> This information has been requested to be added to the upstream Flume documentation. + </p> + + </conbody> + + </concept> + + <concept id="IMPALA-635" rev="IMPALA-635"> + +<!-- Not part of Alex's spreadsheet --> + + <title>Avro Scanner fails to parse some schemas</title> + + <conbody> + + <p> + Querying certain Avro tables could cause a crash or return no rows, even though Impala could <codeph>DESCRIBE</codeph> the table. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-635" scope="external" format="html">IMPALA-635</xref> + </p> + + <p> + <b>Workaround:</b> Swap the order of the fields in the schema specification. For example, <codeph>["null", "string"]</codeph> + instead of <codeph>["string", "null"]</codeph>. + </p> + + <p> + <b>Resolution:</b> Not allowing this syntax agrees with the Avro specification, so it may still cause an error even when the + crashing issue is resolved. + </p> + + </conbody> + + </concept> + + <concept id="IMPALA-1024" rev="IMPALA-1024"> + +<!-- Not part of Alex's spreadsheet --> + + <title>Impala BE cannot parse Avro schema that contains a trailing semi-colon</title> + + <conbody> + + <p> + If an Avro table has a schema definition with a trailing semicolon, Impala encounters an error when the table is queried. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-1024" scope="external" format="html">IMPALA-1024</xref> + </p> + + <p> + <b>Severity:</b> Remove trailing semicolon from the Avro schema. + </p> + + </conbody> + + </concept> + + <concept id="IMPALA-2154" rev="IMPALA-2154"> + +<!-- Not part of Alex's spreadsheet --> + + <title>Fix decompressor to allow parsing gzips with multiple streams</title> + + <conbody> + + <p> + Currently, Impala can only read gzipped files containing a single stream. If a gzipped file contains multiple concatenated + streams, the Impala query only processes the data from the first stream. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-2154" scope="external" format="html">IMPALA-2154</xref> + </p> + + <p> + <b>Workaround:</b> Use a different gzip tool to compress file to a single stream file. + </p> + + <p><b>Resolution:</b> Fixed in CDH 5.7.0 / Impala 2.5.0.</p> + + </conbody> + + </concept> + + <concept id="IMPALA-1578" rev="IMPALA-1578"> + +<!-- Not part of Alex's spreadsheet --> + + <title>Impala incorrectly handles text data when the new line character \n\r is split between different HDFS block</title> + + <conbody> + + <p> + If a carriage return / newline pair of characters in a text table is split between HDFS data blocks, Impala incorrectly processes + the row following the <codeph>\n\r</codeph> pair twice. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-1578" scope="external" format="html">IMPALA-1578</xref> + </p> + + <p> + <b>Workaround:</b> Use the Parquet format for large volumes of data where practical. + </p> + + <p><b>Resolution:</b> Fixed in CDH 5.8.0 / Impala 2.6.0.</p> + + </conbody> + + </concept> + + <concept id="IMPALA-1862" rev="IMPALA-1862"> + +<!-- Not part of Alex's spreadsheet --> + + <title>Invalid bool value not reported as a scanner error</title> + + <conbody> + + <p> + In some cases, an invalid <codeph>BOOLEAN</codeph> value read from a table does not produce a warning message about the bad value. + The result is still <codeph>NULL</codeph> as expected. Therefore, this is not a query correctness issue, but it could lead to + overlooking the presence of invalid data. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-1862" scope="external" format="html">IMPALA-1862</xref> + </p> + + </conbody> + + </concept> + + <concept id="IMPALA-1652" rev="IMPALA-1652"> + +<!-- To do: Isn't this more a correctness issue? --> + + <title>Incorrect results with basic predicate on CHAR typed column.</title> + + <conbody> + + <p> + When comparing a <codeph>CHAR</codeph> column value to a string literal, the literal value is not blank-padded and so the + comparison might fail when it should match. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-1652" scope="external" format="html">IMPALA-1652</xref> + </p> + + <p> + <b>Workaround:</b> Use the <codeph>RPAD()</codeph> function to blank-pad literals compared with <codeph>CHAR</codeph> columns to + the expected length. + </p> + + </conbody> + + </concept> + + </concept> + + <concept id="known_issues_limitations"> + + <title>Impala Known Issues: Limitations</title> + + <conbody> + + <p> + These issues are current limitations of Impala that require evaluation as you plan how to integrate Impala into your data management + workflow. + </p> + + </conbody> + + <concept id="IMPALA-77" rev="IMPALA-77"> + +<!-- Not part of Alex's spreadsheet. Perhaps it really is a permanent limitation and nobody is tracking it? --> + + <title>Impala does not support running on clusters with federated namespaces</title> + + <conbody> + + <p> + Impala does not support running on clusters with federated namespaces. The <codeph>impalad</codeph> process will not start on a + node running such a filesystem based on the <codeph>org.apache.hadoop.fs.viewfs.ViewFs</codeph> class. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-77" scope="external" format="html">IMPALA-77</xref> + </p> + + <p> + <b>Anticipated Resolution:</b> Limitation + </p> + + <p> + <b>Workaround:</b> Use standard HDFS on all Impala nodes. + </p> + + </conbody> + + </concept> + + </concept> + + <concept id="known_issues_misc"> + + <title>Impala Known Issues: Miscellaneous / Older Issues</title> + + <conbody> + + <p> + These issues do not fall into one of the above categories or have not been categorized yet. + </p> + + </conbody> + + <concept id="IMPALA-2005" rev="IMPALA-2005"> + +<!-- Not part of Alex's spreadsheet --> + + <title>A failed CTAS does not drop the table if the insert fails.</title> + + <conbody> + + <p> + If a <codeph>CREATE TABLE AS SELECT</codeph> operation successfully creates the target table but an error occurs while querying + the source table or copying the data, the new table is left behind rather than being dropped. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-2005" scope="external" format="html">IMPALA-2005</xref> + </p> + + <p> + <b>Workaround:</b> Drop the new table manually after a failed <codeph>CREATE TABLE AS SELECT</codeph>. + </p> + + </conbody> + + </concept> + + <concept id="IMPALA-1821" rev="IMPALA-1821"> + +<!-- Not part of Alex's spreadsheet --> + + <title>Casting scenarios with invalid/inconsistent results</title> + + <conbody> + + <p> + Using a <codeph>CAST()</codeph> function to convert large literal values to smaller types, or to convert special values such as + <codeph>NaN</codeph> or <codeph>Inf</codeph>, produces values not consistent with other database systems. This could lead to + unexpected results from queries. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-1821" scope="external" format="html">IMPALA-1821</xref> + </p> + +<!-- <p><b>Workaround:</b> Doublecheck that <codeph>CAST()</codeph> operations work as expect. The issue applies to expressions involving literals, not values read from table columns.</p> --> + + </conbody> + + </concept> + + <concept id="IMPALA-1619" rev="IMPALA-1619"> + +<!-- Not part of Alex's spreadsheet --> + + <title>Support individual memory allocations larger than 1 GB</title> + + <conbody> + + <p> + The largest single block of memory that Impala can allocate during a query is 1 GiB. Therefore, a query could fail or Impala could + crash if a compressed text file resulted in more than 1 GiB of data in uncompressed form, or if a string function such as + <codeph>group_concat()</codeph> returned a value greater than 1 GiB. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-1619" scope="external" format="html">IMPALA-1619</xref> + </p> + + <p><b>Resolution:</b> Fixed in CDH 5.9.0 / Impala 2.7.0 and CDH 5.8.3 / Impala 2.6.3.</p> + + </conbody> + + </concept> + + <concept id="IMPALA-941" rev="IMPALA-941"> + +<!-- Not part of Alex's spreadsheet. Maybe this is interop? --> + + <title>Impala Parser issue when using fully qualified table names that start with a number.</title> + + <conbody> + + <p> + A fully qualified table name starting with a number could cause a parsing error. In a name such as <codeph>db.571_market</codeph>, + the decimal point followed by digits is interpreted as a floating-point number. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-941" scope="external" format="html">IMPALA-941</xref> + </p> + + <p> + <b>Workaround:</b> Surround each part of the fully qualified name with backticks (<codeph>``</codeph>). + </p> + + </conbody> + + </concept> + + <concept id="IMPALA-532" rev="IMPALA-532"> + +<!-- Not part of Alex's spreadsheet. Perhaps it really is a permanent limitation and nobody is tracking it? --> + + <title>Impala should tolerate bad locale settings</title> + + <conbody> + + <p> + If the <codeph>LC_*</codeph> environment variables specify an unsupported locale, Impala does not start. + </p> + + <p> + <b>Bug:</b> <xref href="https://issues.cloudera.org/browse/IMPALA-532" scope="external" format="html">IMPALA-532</xref> + </p> + + <p> + <b>Workaround:</b> Add <codeph>LC_ALL="C"</codeph> to the environment settings for both the Impala daemon and the Statestore + daemon. See <xref href="impala_config_options.xml#config_options"/> for details about modifying these environment settings. + </p> + + <p> + <b>Resolution:</b> Fixing this issue would require an upgrade to Boost 1.47 in the Impala distribution. + </p> + + </conbody> + + </concept> + + <concept id="IMP-1203"> + +<!-- Not part of Alex's spreadsheet. Perhaps it really is a permanent limitation and nobody is tracking it? --> + + <title>Log Level 3 Not Recommended for Impala</title> + + <conbody> + + <p> + The extensive logging produced by log level 3 can cause serious performance overhead and capacity issues. + </p> + + <p> + <b>Workaround:</b> Reduce the log level to its default value of 1, that is, <codeph>GLOG_v=1</codeph>. See + <xref href="impala_logging.xml#log_levels"/> for details about the effects of setting different logging levels. + </p> + + </conbody> + + </concept> + + </concept> + +</concept>
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/bb88fdc0/docs/topics/impala_max_block_mgr_memory.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_max_block_mgr_memory.xml b/docs/topics/impala_max_block_mgr_memory.xml new file mode 100644 index 0000000..3bf8ac8 --- /dev/null +++ b/docs/topics/impala_max_block_mgr_memory.xml @@ -0,0 +1,30 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="2.1.0" id="max_block_mgr_memory"> + + <title>MAX_BLOCK_MGR_MEMORY</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Memory"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p rev="2.1.0"> + <indexterm audience="Cloudera">MAX_BLOCK_MGR_MEMORY query option</indexterm> + </p> + + <p></p> + + <p> + <b>Default:</b> + </p> + + <p conref="../shared/impala_common.xml#common/added_in_20"/> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/bb88fdc0/docs/topics/impala_max_num_runtime_filters.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_max_num_runtime_filters.xml b/docs/topics/impala_max_num_runtime_filters.xml new file mode 100644 index 0000000..90e91dc --- /dev/null +++ b/docs/topics/impala_max_num_runtime_filters.xml @@ -0,0 +1,61 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="max_num_runtime_filters" rev="2.5.0"> + + <title>MAX_NUM_RUNTIME_FILTERS Query Option (CDH 5.7 or higher only)</title> + <titlealts audience="PDF"><navtitle>MAX_NUM_RUNTIME_FILTERS</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Performance"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p rev="2.5.0"> + <indexterm audience="Cloudera">MAX_NUM_RUNTIME_FILTERS query option</indexterm> + The <codeph>MAX_NUM_RUNTIME_FILTERS</codeph> query option + sets an upper limit on the number of runtime filters that can be produced for each query. + </p> + + <p conref="../shared/impala_common.xml#common/type_integer"/> + + <p> + <b>Default:</b> 10 + </p> + + <p conref="../shared/impala_common.xml#common/added_in_250"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + Each runtime filter imposes some memory overhead on the query. + Depending on the setting of the <codeph>RUNTIME_BLOOM_FILTER_SIZE</codeph> + query option, each filter might consume between 1 and 16 megabytes + per plan fragment. There are typically 5 or fewer filters per plan fragment. + </p> + + <p> + Impala evaluates the effectiveness of each filter, and keeps the + ones that eliminate the largest number of partitions or rows. + Therefore, this setting can protect against + potential problems due to excessive memory overhead for filter production, + while still allowing a high level of optimization for suitable queries. + </p> + + <p conref="../shared/impala_common.xml#common/runtime_filtering_option_caveat"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + <xref href="impala_runtime_filtering.xml"/>, + <!-- <xref href="impala_partitioning.xml#dynamic_partition_pruning"/>, --> + <xref href="impala_runtime_bloom_filter_size.xml#runtime_bloom_filter_size"/>, + <xref href="impala_runtime_filter_mode.xml#runtime_filter_mode"/> + </p> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/bb88fdc0/docs/topics/impala_optimize_partition_key_scans.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_optimize_partition_key_scans.xml b/docs/topics/impala_optimize_partition_key_scans.xml new file mode 100644 index 0000000..60635ff --- /dev/null +++ b/docs/topics/impala_optimize_partition_key_scans.xml @@ -0,0 +1,180 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept rev="2.5.0 IMPALA-2499" id="optimize_partition_key_scans"> + + <title>OPTIMIZE_PARTITION_KEY_SCANS Query Option (CDH 5.7 or higher only)</title> + <titlealts audience="PDF"><navtitle>OPTIMIZE_PARTITION_KEY_SCANS</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Performance"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p rev="2.5.0 IMPALA-2499"> + <indexterm audience="Cloudera">OPTIMIZE_PARTITION_KEY_SCANS query option</indexterm> + Enables a fast code path for queries that apply simple aggregate functions to partition key + columns: <codeph>MIN(<varname>key_column</varname>)</codeph>, <codeph>MAX(<varname>key_column</varname>)</codeph>, + or <codeph>COUNT(DISTINCT <varname>key_column</varname>)</codeph>. + </p> + + <p conref="../shared/impala_common.xml#common/type_boolean"/> + <p conref="../shared/impala_common.xml#common/default_false_0"/> + + <note conref="../shared/impala_common.xml#common/one_but_not_true"/> + + <p conref="../shared/impala_common.xml#common/added_in_250"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + + <p> + This optimization speeds up common <q>introspection</q> operations when using queries + to calculate the cardinality and range for partition key columns. + </p> + + <p> + This optimization does not apply if the queries contain any <codeph>WHERE</codeph>, + <codeph>GROUP BY</codeph>, or <codeph>HAVING</codeph> clause. The relevant queries + should only compute the minimum, maximum, or number of distinct values for the + partition key columns across the whole table. + </p> + + <p> + This optimization is enabled by a query option because it skips some consistency checks + and therefore can return slightly different partition values if partitions are in the + process of being added, dropped, or loaded outside of Impala. Queries might exhibit different + behavior depending on the setting of this option in the following cases: + </p> + + <ul> + <li> + <p> + If files are removed from a partition using HDFS or other non-Impala operations, + there is a period until the next <codeph>REFRESH</codeph> of the table where regular + queries fail at run time because they detect the missing files. With this optimization + enabled, queries that evaluate only the partition key column values (not the contents of + the partition itself) succeed, and treat the partition as if it still exists. + </p> + </li> + <li> + <p> + If a partition contains any data files, but the data files do not contain any rows, + a regular query considers that the partition does not exist. With this optimization + enabled, the partition is treated as if it exists. + </p> + <p> + If the partition includes no files at all, this optimization does not change the query + behavior: the partition is considered to not exist whether or not this optimization is enabled. + </p> + </li> + </ul> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + <p> + The following example shows initial schema setup and the default behavior of queries that + return just the partition key column for a table: + </p> + +<codeblock> +-- Make a partitioned table with 3 partitions. +create table t1 (s string) partitioned by (year int); +insert into t1 partition (year=2015) values ('last year'); +insert into t1 partition (year=2016) values ('this year'); +insert into t1 partition (year=2017) values ('next year'); + +-- Regardless of the option setting, this query must read the +-- data files to know how many rows to return for each year value. +explain select year from t1; ++-----------------------------------------------------+ +| Explain String | ++-----------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=0B VCores=0 | +| | +| F00:PLAN FRAGMENT [UNPARTITIONED] | +| 00:SCAN HDFS [key_cols.t1] | +| partitions=3/3 files=4 size=40B | +| table stats: 3 rows total | +| column stats: all | +| hosts=3 per-host-mem=unavailable | +| tuple-ids=0 row-size=4B cardinality=3 | ++-----------------------------------------------------+ + +-- The aggregation operation means the query does not need to read +-- the data within each partition: the result set contains exactly 1 row +-- per partition, derived from the partition key column value. +-- By default, Impala still includes a 'scan' operation in the query. +explain select distinct year from t1; ++------------------------------------------------------------------------------------+ +| Explain String | ++------------------------------------------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=0B VCores=0 | +| | +| 01:AGGREGATE [FINALIZE] | +| | group by: year | +| | | +| 00:SCAN HDFS [key_cols.t1] | +| partitions=0/0 files=0 size=0B | ++------------------------------------------------------------------------------------+ +</codeblock> + + <p> + The following examples show how the plan is made more efficient when the + <codeph>OPTIMIZE_PARTITION_KEY_SCANS</codeph> option is enabled: + </p> + +<codeblock> +set optimize_partition_key_scans=1; +OPTIMIZE_PARTITION_KEY_SCANS set to 1 + +-- The aggregation operation is turned into a UNION internally, +-- with constant values known in advance based on the metadata +-- for the partitioned table. +explain select distinct year from t1; ++-----------------------------------------------------+ +| Explain String | ++-----------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=0B VCores=0 | +| | +| F00:PLAN FRAGMENT [UNPARTITIONED] | +| 01:AGGREGATE [FINALIZE] | +| | group by: year | +| | hosts=1 per-host-mem=unavailable | +| | tuple-ids=1 row-size=4B cardinality=3 | +| | | +| 00:UNION | +| constant-operands=3 | +| hosts=1 per-host-mem=unavailable | +| tuple-ids=0 row-size=4B cardinality=3 | ++-----------------------------------------------------+ + +-- The same optimization applies to other aggregation queries +-- that only return values based on partition key columns: +-- MIN, MAX, COUNT(DISTINCT), and so on. +explain select min(year) from t1; ++-----------------------------------------------------+ +| Explain String | ++-----------------------------------------------------+ +| Estimated Per-Host Requirements: Memory=0B VCores=0 | +| | +| F00:PLAN FRAGMENT [UNPARTITIONED] | +| 01:AGGREGATE [FINALIZE] | +| | output: min(year) | +| | hosts=1 per-host-mem=unavailable | +| | tuple-ids=1 row-size=4B cardinality=1 | +| | | +| 00:UNION | +| constant-operands=3 | +| hosts=1 per-host-mem=unavailable | +| tuple-ids=0 row-size=4B cardinality=3 | ++-----------------------------------------------------+ +</codeblock> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/bb88fdc0/docs/topics/impala_parquet_annotate_strings_utf8.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_parquet_annotate_strings_utf8.xml b/docs/topics/impala_parquet_annotate_strings_utf8.xml new file mode 100644 index 0000000..cd5b578 --- /dev/null +++ b/docs/topics/impala_parquet_annotate_strings_utf8.xml @@ -0,0 +1,50 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="parquet_annotate_strings_utf8" rev="2.6.0 IMPALA-2069"> + + <title>PARQUET_ANNOTATE_STRINGS_UTF8 Query Option (CDH 5.8 or higher only)</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Parquet"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p rev="2.6.0 IMPALA-2069"> + <indexterm audience="Cloudera">PARQUET_ANNOTATE_STRINGS_UTF8 query option</indexterm> + Causes Impala <codeph>INSERT</codeph> and <codeph>CREATE TABLE AS SELECT</codeph> statements + to write Parquet files that use the UTF-8 annotation for <codeph>STRING</codeph> columns. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + <p> + By default, Impala represents a <codeph>STRING</codeph> column in Parquet as an unannotated binary field. + </p> + <p> + Impala always uses the UTF-8 annotation when writing <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> + columns to Parquet files. An alternative to using the query option is to cast <codeph>STRING</codeph> + values to <codeph>VARCHAR</codeph>. + </p> + <p> + This option is to help make Impala-written data more interoperable with other data processing engines. + Impala itself currently does not support all operations on UTF-8 data. + Although data processed by Impala is typically represented in ASCII, it is valid to designate the + data as UTF-8 when storing on disk, because ASCII is a subset of UTF-8. + </p> + <p conref="../shared/impala_common.xml#common/type_boolean"/> + <p conref="../shared/impala_common.xml#common/default_false_0"/> + + <p conref="../shared/impala_common.xml#common/added_in_260"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + <xref href="impala_parquet.xml#parquet"/> + </p> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/bb88fdc0/docs/topics/impala_parquet_fallback_schema_resolution.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_parquet_fallback_schema_resolution.xml b/docs/topics/impala_parquet_fallback_schema_resolution.xml new file mode 100644 index 0000000..06b1a28 --- /dev/null +++ b/docs/topics/impala_parquet_fallback_schema_resolution.xml @@ -0,0 +1,49 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="parquet_fallback_schema_resolution" rev="2.6.0 IMPALA-2835 CDH-33330"> + + <title>PARQUET_FALLBACK_SCHEMA_RESOLUTION Query Option (CDH 5.8 or higher only)</title> + <titlealts audience="PDF"><navtitle>PARQUET_FALLBACK_SCHEMA_RESOLUTION</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Parquet"/> + <data name="Category" value="Schemas"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p rev="2.6.0 IMPALA-2835 CDH-33330"> + <indexterm audience="Cloudera">PARQUET_FALLBACK_SCHEMA_RESOLUTION query option</indexterm> + Allows Impala to look up columns within Parquet files by column name, rather than column order, + when necessary. + </p> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + <p> + By default, Impala looks up columns within a Parquet file based on + the order of columns in the table. + The <codeph>name</codeph> setting for this option enables behavior for Impala queries + similar to the Hive setting <codeph>parquet.column.index.access=false</codeph>. + It also allows Impala to query Parquet files created by Hive with the + <codeph>parquet.column.index.access=false</codeph> setting in effect. + </p> + + <p> + <b>Type:</b> integer or string. + Allowed values are 0 or <codeph>position</codeph> (default), 1 or <codeph>name</codeph>. + </p> + + <p conref="../shared/impala_common.xml#common/added_in_260"/> + + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + <xref href="impala_parquet.xml#parquet_schema_evolution"/> + </p> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/bb88fdc0/docs/topics/impala_perf_ddl.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_perf_ddl.xml b/docs/topics/impala_perf_ddl.xml new file mode 100644 index 0000000..d075cd2 --- /dev/null +++ b/docs/topics/impala_perf_ddl.xml @@ -0,0 +1,42 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="perf_ddl"> + + <title>Performance Considerations for DDL Statements</title> + <prolog> + <metadata> + <data name="Category" value="Performance"/> + <data name="Category" value="Impala"/> + <data name="Category" value="DDL"/> + <data name="Category" value="SQL"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + These tips and guidelines apply to the Impala DDL statements, which are listed in + <xref href="impala_ddl.xml#ddl"/>. + </p> + + <p> + Because Impala DDL statements operate on the metastore database, the performance considerations for those + statements are totally different than for distributed queries that operate on HDFS + <ph rev="2.2.0">or S3</ph> data files, or on HBase tables. + </p> + + <p> + Each DDL statement makes a relatively small update to the metastore database. The overhead for each statement + is proportional to the overall number of Impala and Hive tables, and (for a partitioned table) to the overall + number of partitions in that table. Issuing large numbers of DDL statements (such as one for each table or + one for each partition) also has the potential to encounter a bottleneck with access to the metastore + database. Therefore, for efficient DDL, try to design your application logic and ETL pipeline to avoid a huge + number of tables and a huge number of partitions within each table. In this context, <q>huge</q> is in the + range of tens of thousands or hundreds of thousands. + </p> + + <note conref="../shared/impala_common.xml#common/add_partition_set_location"/> + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/bb88fdc0/docs/topics/impala_prefetch_mode.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_prefetch_mode.xml b/docs/topics/impala_prefetch_mode.xml new file mode 100644 index 0000000..30dd116 --- /dev/null +++ b/docs/topics/impala_prefetch_mode.xml @@ -0,0 +1,49 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="prefetch_mode" rev="2.6.0 IMPALA-3286"> + + <title>PREFETCH_MODE Query Option (CDH 5.8 or higher only)</title> + <titlealts audience="PDF"><navtitle>PREFETCH_MODE</navtitle></titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Performance"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p rev="2.6.0 IMPALA-3286"> + <indexterm audience="Cloudera">PREFETCH_MODE query option</indexterm> + Determines whether the prefetching optimization is applied during + join query processing. + </p> + + <p> + <b>Type:</b> numeric (0, 1) + or corresponding mnemonic strings (<codeph>NONE</codeph>, <codeph>HT_BUCKET</codeph>). + </p> + + <p> + <b>Default:</b> 1 (equivalent to <codeph>HT_BUCKET</codeph>) + </p> + + <p conref="../shared/impala_common.xml#common/added_in_260"/> + + <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> + <p> + The default mode is 1, which means that hash table buckets are + prefetched during join query processing. + </p> + + <p conref="../shared/impala_common.xml#common/related_info"/> + <p> + <xref href="impala_joins.xml#joins"/>, + <xref href="impala_perf_joins.xml#perf_joins"/>. + </p> + + </conbody> +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/bb88fdc0/docs/topics/impala_query_lifetime.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_query_lifetime.xml b/docs/topics/impala_query_lifetime.xml new file mode 100644 index 0000000..2f46d21 --- /dev/null +++ b/docs/topics/impala_query_lifetime.xml @@ -0,0 +1,31 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="query_lifetime"> + + <title>Impala Query Lifetime</title> + + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Concepts"/> + <data name="Category" value="Querying"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + Impala queries progress through a series of stages from the time they are initiated to the time + they are completed. A query can also be cancelled before it is entirely finished, either + because of an explicit cancellation, or because of a timeout, out-of-memory, or other error condition. + Understanding the query lifecycle can help you manage the throughput and resource usage of Impala + queries, especially in a high-concurrency or multi-workload environment. + </p> + + <p outputclass="toc"/> + </conbody> + + +</concept> http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/bb88fdc0/docs/topics/impala_relnotes.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_relnotes.xml b/docs/topics/impala_relnotes.xml new file mode 100644 index 0000000..5c53a21 --- /dev/null +++ b/docs/topics/impala_relnotes.xml @@ -0,0 +1,34 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="relnotes" audience="standalone"> + + <title>Impala Release Notes</title> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Release Notes"/> + <data name="Category" value="Administrators"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody id="relnotes_intro"> + + <p> + These release notes provide information on the <xref href="impala_new_features.xml#new_features">new + features</xref> and <xref href="impala_known_issues.xml#known_issues">known issues and limitations</xref> for + Impala versions up to <ph conref="../shared/ImpalaVariables.xml#impala_vars/ReleaseVersion"/>. For users + upgrading from earlier Impala releases, or using Impala in combination with specific versions of other + Cloudera software, <xref href="impala_incompatible_changes.xml#incompatible_changes"/> lists any changes to + file formats, SQL syntax, or software dependencies to take into account. + </p> + + <p> + Once you are finished reviewing these release notes, for more information about using Impala, see + <xref audience="integrated" href="impala.xml"/><xref audience="standalone" href="http://www.cloudera.com/documentation/enterprise/latest/topics/impala.html" scope="external" format="html"/>. + </p> + + <p outputclass="toc"/> + </conbody> +</concept>
