pPanda-beta opened a new issue #2301:
URL: https://github.com/apache/iceberg/issues/2301


   Although unlock is kept inside the finally block,
   
   
https://github.com/apache/iceberg/blob/a7901992c252bb28d10686ef24c1de788c90d663/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java#L220
   
   ### YET INFINITE LOCK MAY HAPPEN
   
   Steps to reproduce : 
   
   1. Start a new commit
   2. Let icberg acquire the lock
   3. Kill the job (signal 9 / 19 ) or disconnect network. (See below to know 
more about actual scenarios that happened)
   4. Restart the job, a fresh new commit
   5. Thats it, it will never be able to acquire a new lock on same resource 
again
   
   ### Lets make it more spicy
   
   Consider it as the table create operation 🥰  , i.e. the actual "table" does 
not exist yet and before creation it will be killed.
   
   Steps: 
   
   1. Start creating the table
   2. Let it acquire the lock on the non-existent table resource (HMS allows 
that)
   3. Kill the job before the table is even created
   4. Go cry yourself, cause you can not remove this lock manually using 
beeline / hive-cli since table doesn’t exist.
   
   ### What is this "Killed" thingy?
   
   Well here in rapido we use GCP preemptible VMs for low cost. Now on such 
kind of infra, VM may be preempted at any point of time with very short notice 
period (15 sec). 
   **Consider this as a network cable unplug**. We will never get a chance for 
InterruptedException and do clean up operations. 
   
   
   ### Why hive never faced this issue?
   Well hive has expiry of its locks which works as a TTL. So system recovers 
eventually.
   
   ### Disaster Recoveries that we tried so far:
   1. Ensure no jobs are running for that table (which may not have been 
created yet)
   2. Use java api to connect to the hive metastore
   3. Delete all locks for that table 
   
   
   ## Suggestions
   
   1. Before acquiring locks, delete all locks that have not received any 
heartbeats since last 'x' minutes. (configurable)
   2. After acquiring lock keep sending heart beats to HMS from a different 
daemon thread. This will ensure concurrent writes are protected
   3. After operation finishes (success or fail), unlock and cancel the 
heartbeat thread.
   
   
   
   
   
    
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to