This is an automated email from the ASF dual-hosted git repository. asherman pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/impala.git
commit c353d69cbdf3141509382fb74dd141f8a936fba0 Author: Shajini Thayasingh <[email protected]> AuthorDate: Fri Mar 24 09:40:10 2023 -0700 IMPALA-11985: [DOCS] Support for Kudu's multi-rows transaction Fixed some typos and made final changes. Clarified some questions that were raised as comments. Incorporated some minor comments. Documented the support for Kudu's multi-rows transaction. Change-Id: Ic226679d83d7221f843994ead11cb2bc9e971882 Reviewed-on: http://gerrit.cloudera.org:8080/19651 Tested-by: Impala Public Jenkins <[email protected]> Reviewed-by: Alexey Serbin <[email protected]> Reviewed-by: Wenzhe Zhou <[email protected]> --- docs/topics/impala_kudu.xml | 133 ++++++++++++++++++++++++++++++-------------- 1 file changed, 90 insertions(+), 43 deletions(-) diff --git a/docs/topics/impala_kudu.xml b/docs/topics/impala_kudu.xml index 8f9fbf194..3c2ceee96 100644 --- a/docs/topics/impala_kudu.xml +++ b/docs/topics/impala_kudu.xml @@ -1388,6 +1388,65 @@ kudu.table_name | impala::some_database.table_name_demo </conbody> </concept> + <concept id="multi_rows_transaction"> + <title>Multi-row Transactions for Kudu Tables</title> + <conbody> + <p> When you use Impala to query Kudu tables, you can insert multiple rows into a Kudu table + in a single transaction. This broader transactional support between Kudu and Impala is + available to you at a query level and at a session level.</p></conbody> + </concept> + <concept id="using_multi_row_transaction"> + <title>Using Multi-row Transaction Capability</title> + <conbody> + <p>You can control this multi-row transaction feature by using the following query option. You + may set this option at per-query or per-session level. When the option is enabled for a + session, Impala will open one Kudu transaction for each INSERT or CTAS statement.</p> + <codeblock>set ENABLE_KUDU_TRANSACTION=true</codeblock> + <p>The following example shows how to insert three rows into a table in a single + transaction.</p> + <p><b>Example:</b></p> + <p><ol> + <li>Create table kudu-test-tbl-1. + <codeblock>create table kudu-test-tbl-1 (a int primary key, b string) partition by hash(a) partitions 8 stored as kudu;</codeblock></li> + <li>Enable the multi-row transaction feature at the query + level.<codeblock>set ENABLE_KUDU_TRANSACTION=true;</codeblock></li> + <li>Insert three rows into the newly created table in a single transaction. + <codeblock>insert into kudu-test-tbl-1 values (0, 'a'), (1, 'b'), (2, 'c');</codeblock></li> + <li>Verify the number of rows of this table. + <codeblock>select count(*) from kudu-test-tbl-1;</codeblock></li> + </ol></p> + <p><b>Note:</b></p> + <p>If you insert multiple rows with duplicate keys into a table, the transaction is aborted. + To ignore the conflicts with duplicate keys during the transaction, start Impala daemons + with the flag <codeph>--kudu_ignore_conflicts_in_transaction=true</codeph>. This flag is set + to False by default. Note that this flag takes effect only if the flag + <codeph>--kudu_ignore_conflicts</codeph> is set as True. The flag + <codeph>--kudu_ignore_conflicts</codeph> is set to True by default.</p> + <p>When you enable the option <codeph>ENABLE_KUDU_TRANSACTION</codeph>, each Impala statement + is executed with a new opened transaction. If the statement is executed successfully, then + the Impala Coordinator commits the transaction. If there is an error returned by Kudu, then + Impala aborts the transaction.</p> + <p>This applies to the following statements:</p> + <p><ul> + <li>INSERT</li> + <li>CREATE TABLE AS SELECT</li> + </ul></p> + </conbody> + </concept> + <concept id="advantages"> + <title>Advantages of Using This Capability</title> + <conbody> + <p>You can now easily build and manage Kudu applications, especially when Impala is used to + interact with the data in the Kudu table. With multi-row transaction, you can atomically + ingest large number of rows into a Kudu table with INSERT-SELECT or CTAS statement.</p></conbody> + </concept> + <concept id="limitation"> + <title>Limitation</title> + <conbody> + <p>INSERT and CTAS statements are supported for Kudu tables in the context of a multi-row + transaction, but UPDATE/UPSERT/DELETE statements are not supported in multi-row transaction + as of now.</p></conbody> + </concept> <concept id="kudu_consistency"> @@ -1395,49 +1454,37 @@ kudu.table_name | impala::some_database.table_name_demo <conbody> - <p> - Kudu tables have consistency characteristics such as uniqueness, controlled by the - primary key columns, and non-nullable columns. The emphasis for consistency is on - preventing duplicate or incomplete data from being stored in a table. - </p> - - <p> - Currently, Kudu does not enforce strong consistency for order of operations, total - success or total failure of a multi-row statement, or data that is read while a write - operation is in progress. Changes are applied atomically to each row, but not applied - as a single unit to all rows affected by a multi-row DML statement. That is, Kudu does - not currently have atomic multi-row statements or isolation between statements. - </p> - - <p> - If some rows are rejected during a DML operation because of a mismatch with duplicate - primary key values, <codeph>NOT NULL</codeph> constraints, and so on, the statement - succeeds with a warning. Impala still inserts, deletes, or updates the other rows that - are not affected by the constraint violation. - </p> - - <p> - Consequently, the number of rows affected by a DML operation on a Kudu table might be - different than you expect. - </p> - - <p> - Because there is no strong consistency guarantee for information being inserted into, - deleted from, or updated across multiple tables simultaneously, consider denormalizing - the data where practical. That is, if you run separate <codeph>INSERT</codeph> - statements to insert related rows into two different tables, one <codeph>INSERT</codeph> - might fail while the other succeeds, leaving the data in an inconsistent state. Even if - both inserts succeed, a join query might happen during the interval between the - completion of the first and second statements, and the query would encounter incomplete - inconsistent data. Denormalizing the data into a single wide table can reduce the - possibility of inconsistency due to multi-table operations. - </p> - - <p> - Information about the number of rows affected by a DML operation is reported in - <cmdname>impala-shell</cmdname> output, and in the <codeph>PROFILE</codeph> output, but - is not currently reported to HiveServer2 clients such as JDBC or ODBC applications. - </p> + <p>Kudu tables have consistency characteristics such as uniqueness, controlled by the primary + key columns, and non-nullable columns. The emphasis for consistency is on preventing + duplicate or incomplete data from being stored in a table. </p> + + <p>Currently, Kudu does not enforce strong consistency for order of operations, or data that + is read while a write operation is in progress. If multi-rows transaction is enabled, + insertion of multiple rows in one insertion statement will be atomic, i.e. total success or + total failure. But if multi-row transaction is not enabled, changes are applied atomically + to each row, not applied as a single unit to all rows affected by a multi-row DML statement. </p> + + <p>When multi-row transaction is not enabled and if some rows are rejected during a DML + operation because of a mismatch with duplicate primary key values, <codeph>NOT NULL</codeph> + constraints, and so on, the statement succeeds with a warning. Impala still inserts, + deletes, or updates the other rows that are not affected by the constraint violation. </p> + + <p>Consequently, the number of rows affected by a DML operation on a Kudu table might be + different than you expect. </p> + + <p>Because there is no strong consistency guarantee for information being inserted into with + separate INSERT statements, deleted from, or updated across multiple tables simultaneously, + consider denormalizing the data where practical. That is, if you run separate + <codeph>INSERT</codeph> statements to insert related rows into two different tables, one + <codeph>INSERT</codeph> might fail while the other succeeds, leaving the data in an + inconsistent state. Even if both inserts succeed, a join query might happen during the + interval between the completion of the first and second statements, and the query would + encounter incomplete inconsistent data. Denormalizing the data into a single wide table can + reduce the possibility of inconsistency due to multi-table operations. </p> + + <p>Information about the number of rows affected by a DML operation is reported in + <cmdname>impala-shell</cmdname> output, and in the <codeph>PROFILE</codeph> output, but is + not currently reported to HiveServer2 clients such as JDBC or ODBC applications. </p> </conbody>
