John Sherman created HIVE-26472:
-----------------------------------
Summary: Concurrent UPDATEs can cause duplicate rows
Key: HIVE-26472
URL: https://issues.apache.org/jira/browse/HIVE-26472
Project: Hive
Issue Type: Bug
Components: HiveServer2
Affects Versions: 4.0.0-alpha-1
Reporter: John Sherman
Attachments: debug.diff
Concurrent UPDATEs to the same table can cause duplicate rows when the
following occurs:
Two UPDATEs get assigned txnIds and writeIds like this:
UPDATE #1 = txnId: 100 writeId: 50 <--- commits first
UPDATE #2 = txnId: 101 writeId: 49
To replicate the issue:
I applied the attach debug.diff patch which adds hive.lock.sleep.writeid (which
controls the amount to sleep before acquiring a writeId) and
hive.lock.sleep.post.writeid (which controls the amount to sleep after
acquiring a writeId).
{code:java}
CREATE TABLE test_update(i int) STORED AS ORC
TBLPROPERTIES('transactional'="true");
INSERT INTO test_update VALUES (1);
Start two beeline connections.
In connection #1 - run:
set hive.driver.parallel.compilation = true;
set hive.lock.sleep.writeid=5s;
update test_update set i = 1 where i = 1;
Wait one second and in connection #2 - run:
set hive.driver.parallel.compilation = true;
set hive.lock.sleep.post.writeid=10s;
update test_update set i = 1 where i = 1;
After both updates complete - it is likely that test_update contains two rows
now.
{code}
HIVE-24211 seems to address the case when:
UPDATE #1 = txnId: 100 writeId: 50
UPDATE #2 = txnId: 101 writeId: 49 <--- commits first (I think this causes
UPDATE #1 to detect the snapshot is out of date because commitedTxn > UPDATE
#1s txnId)
A possible work around is to set hive.driver.parallel.compilation = false, but
this would only help in cases there is only one HS2 instance.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)