[ 
https://issues.apache.org/jira/browse/IMPALA-12588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931391#comment-17931391
 ] 

ASF subversion and git services commented on IMPALA-12588:
----------------------------------------------------------

Commit 79f51b018f374fb08c5c8e2a52251b6cbe14fd18 in impala's branch 
refs/heads/master from Noemi Pap-Takacs
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=79f51b018 ]

IMPALA-12588: Don't UPDATE rows that already have the desired value

When UPDATEing an Iceberg or Kudu table, we should change as few rows
as possible. In case of Iceberg tables it means writing as few new
data records and delete records as possible.
Therefore, if rows already have the new values we should just ignore
them. One way to achieve this is to add extra predicates, e.g.:

  UPDATE tbl SET k = 3 WHERE i > 4;
    ==>
  UPDATE tbl SET k = 3 WHERE i > 4 AND k IS DISTINCT FROM 3;

So we won't write new data/delete records for the rows that already have
the desired value.

Explanation on how to create extra predicates to filter out these rows:

If there are multiple assignments in the SET list, we can only skip
updating a row if all the mentioned values are already equal.
If either of the values needs to be updated, the entire row does.
Therefore we can think of the SET list as predicates connected with AND
and all of them need to be taken into consideration.
To negate this SET list, we have to negate the individual SET
assignments and connect them with OR.
Then add this new compound predicate to the original where predicates
with an AND (if there were none, just create a WHERE predicate from it).

                AND
              /     \
      original        OR
 WHERE predicate    /    \
                  !a       OR
                         /    \
                       !b     !c

This simple graph illustrates how the where predicate is rewritten.
(Considering an UPDATE statement that sets 3 columns.)
'!a', '!b' and '!c' are the negations of the individual assignments in
the SET list. So the extended WHERE predicate is:
(original WHERE predicate) AND (!a OR !b OR !c)
To handle NULL values correctly, we use IS DISTINCT FROM instead of
simply negating the assignment with operator '!='.

If the assignments contain UDFs, the result might be inconsistent
because of possible non-deterministic values or state in the UDFs,
therefore we should not rewrite the WHERE predicate at all.

Evaluating expressions can be expensive, therefore this optimization
can be limited or switched off entirely using the Query Option
SKIP_UNNEEDED_UPDATES_COL_LIMIT. By default, there is no filtering
if more than 10 assignments are in the SET list.

-------------------------------------------------------------------
Some performance measurements on a tpch lineitem table:

- predicates in HASH join, all updates can be skipped
Q1/[Q2] (Similar, but Q2 adds extra 4 items to the SET list):
update t set t.l_suppkey = s.l_suppkey,
[ t.l_partkey=s.l_partkey,
  t.l_quantity=s.l_quantity,
  t.l_returnflag=s.l_returnflag,
  t.l_shipmode=s.l_shipmode ]
from ice_lineitem t join ice_lineitem s
on t.l_orderkey=s.l_orderkey and t.l_linenumber=s.l_linenumber;

- predicates in HASH join, all rows need to be updated
Q3: update t set
 t.l_suppkey = s.l_suppkey,
 t.l_partkey=s.l_partkey,
 t.l_quantity=s.l_quantity,
 t.l_returnflag=s.l_returnflag,
 t.l_shipmode=concat(s.l_shipmode,' ')
 from ice_lineitem t join ice_lineitem s
 on t.l_orderkey=s.l_orderkey and t.l_linenumber=s.l_linenumber;

- predicates pushed down to the scanner, all rows updated
Q4/[Q5] (Similar, but Q5 adds extra 8 items to the SET ist):
update ice_lineitem set
[ l_suppkey = l_suppkey + 0,
  l_partkey=l_partkey + 0,
  l_quantity=l_quantity,
  l_returnflag=l_returnflag,
  l_tax = l_tax,
  l_discount= l_discount,
  l_comment = l_comment,
  l_receiptdate = l_receiptdate, ]
l_shipmode=concat(l_shipmode,' ');

+=======+============+==========+======+
| Query | unfiltered | filtered | diff |
+=======+============+==========+======+
| Q1    |       4.1s |     1.9s | -54% |
+-------+------------+----------+------+
| Q2    |       4.2s |     2.1s | -50% |
+-------+------------+----------+------+
| Q3    |       4.3s |     4.7s | +10% |
+-------+------------+----------+------+
| Q4    |       3.0s |     3.0s | +0%  |
+-------+------------+----------+------+
| Q5    |       3.1s |     3.1s | +0%  |
+-------+------------+----------+------+

The results show that in the best case (we can skip all rows)
this change can cause significant perf improvement ~50%, since
0 rows were written. See Q1 and Q2.
If the predicates are evaluated in the join operator, but there were
no matches (worst case scenario) we can lose about 10%. (Q3)
If all the predicates can be pushed down to the scanners, the change
does not seem to cause significant difference (~0% in Q4 and Q5)
even if all rows have to be updated.

Testing:
 - Analysis
 - Planner
 - E2E
 - Kudu
 - Iceberg
 - testing the new query option: SKIP_UNNEEDED_UPDATES_COL_LIMIT
Change-Id: I926c80e8110de5a4615a3624a81a330f54317c8b
Reviewed-on: http://gerrit.cloudera.org:8080/22407
Reviewed-by: Zoltan Borok-Nagy <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Don't update rows that already have the desired value
> -----------------------------------------------------
>
>                 Key: IMPALA-12588
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12588
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Noemi Pap-Takacs
>            Priority: Major
>              Labels: impala-iceberg
>
> When UPDATEing an Iceberg table, we should write as few new data records and 
> delete records as possible.
> Therefore, if rows already have the new values we should just ignore them.
> One way to achieve this is to add extra predicates, e.g.:
> {noformat}
>   UPDATE tbl SET k = 3 WHERE i > 4;
>     ==>
>   UPDATE tbl SET k = 3 WHERE i > 4 AND k != 3;
> {noformat}
> So we won't write new data/delete records for the rows that already have the 
> desired value.
> For some cases in can be trickier (e.g. UPDATE FROM), those cases could be 
> handled more easily by the new MERGE statement when we have it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to