[
https://issues.apache.org/jira/browse/HIVE-11317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14658546#comment-14658546
]
Alan Gates commented on HIVE-11317:
-----------------------------------
Why did you decide to go with a separate thread rather than integrating this
with the initiator or the cleaner? The functionality here is pretty simple and
it seems like it would be easy to integrate with either of those.
TxnHandler line 1730 (in heartbeatTxn) you added code to check if the heartbeat
failed because the txn was already committed. A comment to make clear what
you're checking for here would be helpful.
TxnHandler, new method performTimeouts. You run a query with a hard coded
limit (of 2500) and then have do{}while loop to add those values to the list to
be deleted until you've reached your batch size. Once you reach the batch size
you call abortTxns, and then go rerun the query. So why the limit clause and
the do/while loop. Why not just ask up front for the number of entries in
batch with the limit clause?
Tests in general: I have found tests that rely on sleeps to be flaky. They
will usually work locally, but placed on an EC2 box as part of the auto-patch
testing they fail because the box is so busy the timeouts are no longer large
enough. In the other compactor threads I've put in flags to make sure the
thread ran once rather than relying on timeouts. This has produced much more
reliable results.
> ACID: Improve transaction Abort logic due to timeout
> ----------------------------------------------------
>
> Key: HIVE-11317
> URL: https://issues.apache.org/jira/browse/HIVE-11317
> Project: Hive
> Issue Type: Bug
> Components: Metastore, Transactions
> Affects Versions: 1.0.0
> Reporter: Eugene Koifman
> Assignee: Eugene Koifman
> Labels: triage
> Attachments: HIVE-11317.2.patch, HIVE-11317.patch
>
>
> the logic to Abort transactions that have stopped heartbeating is in
> TxnHandler.timeOutTxns()
> This is only called when DbTxnManger.getValidTxns() is called.
> So if there is a lot of txns that need to be timed out and the there are not
> SQL clients talking to the system, there is nothing to abort dead
> transactions, and thus compaction can't clean them up so garbage accumulates
> in the system.
> Also, streaming api doesn't call DbTxnManager at all.
> Need to move this logic into Initiator (or some other metastore side thread).
> Also, make sure it is broken up into multiple small(er) transactions against
> metastore DB.
> Also more timeOutLocks() locks there as well.
> see about adding TXNS.COMMENT field which can be used for "Auto aborted due
> to timeout" for example.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)