[GitHub] [iceberg] pPanda-beta opened a new issue #2301: Lock remains forever if HiveTableOperations gets killed (direct process shutdown - no signals) after lock is acquired

GitBox Sat, 06 Mar 2021 03:00:38 -0800


pPanda-beta opened a new issue #2301:
URL: https://github.com/apache/iceberg/issues/2301

Although unlock is kept inside the finally block,

https://github.com/apache/iceberg/blob/a7901992c252bb28d10686ef24c1de788c90d663/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java#L220

### YET INFINITE LOCK MAY HAPPEN

Steps to reproduce :

1. Start a new commit
2. Let icberg acquire the lock
3. Kill the job (signal 9 / 19 ) or disconnect network. (See below to know
more about actual scenarios that happened)
4. Restart the job, a fresh new commit
5. Thats it, it will never be able to acquire a new lock on same resource
again

### Lets make it more spicy

Consider it as the table create operation 🥰 , i.e. the actual "table" does
not exist yet and before creation it will be killed.

Steps:

1. Start creating the table
2. Let it acquire the lock on the non-existent table resource (HMS allows
that)
3. Kill the job before the table is even created
4. Go cry yourself, cause you can not remove this lock manually using
beeline / hive-cli since table doesn’t exist.

### What is this "Killed" thingy?

Well here in rapido we use GCP preemptible VMs for low cost. Now on such
kind of infra, VM may be preempted at any point of time with very short notice
period (15 sec).
**Consider this as a network cable unplug**. We will never get a chance for
InterruptedException and do clean up operations.

### Why hive never faced this issue?
Well hive has expiry of its locks which works as a TTL. So system recovers
eventually.

### Disaster Recoveries that we tried so far:
1. Ensure no jobs are running for that table (which may not have been
created yet)
2. Use java api to connect to the hive metastore
3. Delete all locks for that table

## Suggestions

1. Before acquiring locks, delete all locks that have not received any
heartbeats since last 'x' minutes. (configurable)
2. After acquiring lock keep sending heart beats to HMS from a different
daemon thread. This will ensure concurrent writes are protected
3. After operation finishes (success or fail), unlock and cancel the
heartbeat thread.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] pPanda-beta opened a new issue #2301: Lock remains forever if HiveTableOperations gets killed (direct process shutdown - no signals) after lock is acquired

Reply via email to