[ 
https://issues.apache.org/jira/browse/HIVE-21469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16799772#comment-16799772
 ] 

Hive QA commented on HIVE-21469:
--------------------------------



Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12963513/HIVE-21469.7.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 15806 tests 
executed
*Failed tests:*
{noformat}
TestMiniLlapLocalCliDriver - did not produce a TEST-*.xml file (likely timed 
out) (batchId=167)
        
[mapjoin_emit_interval.q,vector_auto_smb_mapjoin_14.q,deleteAnalyze.q,database.q,vector_bround.q,smb_mapjoin_6.q,vector_reduce_groupby_decimal.q,vectorized_dynamic_partition_pruning.q,cbo_views.q,vectorization_part.q,schema_evol_text_vecrow_part_all_primitive_llap_io.q,dynamic_partition_pruning.q,cte_mat_1.q,cluster.q,vector_char_mapjoin1.q,schema_evol_text_nonvec_part_llap_io.q,cte_5.q,subquery_shared_alias.q,vector_decimal_2.q,vector_outer_reference_windowed.q,bucketmapjoin7.q,vector_string_concat.q,subquery_nested_subquery.q,limit_pushdown3.q,vector_outer_join2.q,vectorized_context.q,metadata_only_queries.q,union6.q,cbo_subq_in.q,vectorized_timestamp_funcs.q]
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/16655/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/16655/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-16655/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12963513 - PreCommit-HIVE-Build

> Review of ZooKeeperHiveLockManager
> ----------------------------------
>
>                 Key: HIVE-21469
>                 URL: https://issues.apache.org/jira/browse/HIVE-21469
>             Project: Hive
>          Issue Type: Improvement
>          Components: Locking
>    Affects Versions: 4.0.0, 3.2.0
>            Reporter: David Mollitor
>            Assignee: David Mollitor
>            Priority: Major
>         Attachments: HIVE-21469.1.patch, HIVE-21469.2.patch, 
> HIVE-21469.3.patch, HIVE-21469.4.patch, HIVE-21469.5.patch, 
> HIVE-21469.6.patch, HIVE-21469.7.patch
>
>
> A lot of sins in this class to resolve:
> {code:java}
>   @Override
>   public void setContext(HiveLockManagerCtx ctx) throws LockException {
>  try {
>       curatorFramework = CuratorFrameworkSingleton.getInstance(conf);
>       parent = conf.getVar(HiveConf.ConfVars.HIVE_ZOOKEEPER_NAMESPACE);
>       try{
>         curatorFramework.create().withMode(CreateMode.PERSISTENT).forPath("/" 
> +  parent, new byte[0]);
>       } catch (Exception e) {
>         // ignore if the parent already exists
>         if (!(e instanceof KeeperException) || ((KeeperException)e).code() != 
> KeeperException.Code.NODEEXISTS) {
>           LOG.warn("Unexpected ZK exception when creating parent node /" + 
> parent, e);
>         }
>       }
> {code}
> Every time a new session is created and this {{setContext}} method is called, 
> it attempts to create the root node.  I have seen that, even though the root 
> node exists, an create node action is written to the ZK logs.  Check first if 
> the node exists before trying to create it.
> {code:java}
>       try {
>         curatorFramework.delete().forPath(zLock.getPath());
>       } catch (InterruptedException ie) {
>         curatorFramework.delete().forPath(zLock.getPath());
>       }
> {code}
> There has historically been a quite a few bugs regarding leaked locks.  The 
> Driver will signal the session {{Thread}} by performing an interrupt.  That 
> interrupt can happen any time and it can kill a create/delete action within 
> the ZK framework.  We can see one example of workaround for this.  If the ZK 
> action is interrupted, simply do it again.  Well, what if it's interrupted 
> yet again?  The lock will be leaked.  Also, when the {{InterruptedException}} 
> is caught in the try block, the thread's interrupted flag is cleared.  The 
> flag is not reset in this code and therefore we lose the fact that this 
> thread has been interrupted.  This flag should be preserved so that other 
> code paths will know that it's time to exit.
> {code:java}
>         if (tryNum > 1) {
>           Thread.sleep(sleepTime);
>         }
>         unlockPrimitive(hiveLock, parent, curatorFramework);
>         break;
>       } catch (Exception e) {
>         if (tryNum >= numRetriesForUnLock) {
>           String name = ((ZooKeeperHiveLock)hiveLock).getPath();
>           throw new LockException("Node " + name + " can not be deleted after 
> " + numRetriesForUnLock + " attempts.",
>               e);
>         }
>       }
> {code}
> ... related... the sleep here may be interrupted, but we still need to delete 
> the lock (again, for fear of leaking it).  This sleep should be 
> uninterruptible.  If we need to get the lock deleted, and there's a problem, 
> interrupting the sleep will cause the code to eventually exit and locks will 
> be leaked.
> It also requires a bunch more TLC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to