[ 
https://issues.apache.org/jira/browse/HADOOP-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12603851#action_12603851
 ] 

Karam Singh commented on HADOOP-3523:
-------------------------------------

To check the issue, did the following -:
1. 
   a. Allocate hod cluster with --ringmaster.idleness-limit=240. 
   b. Waited for 4 mins. 
   c .verified  the cluster dead from hod list and qstat. 
   d. Restarted torque. ran qstat to verify that it does return anything. 
   e. ran hod allocate using hod without patch using same cluster dir, hod 
thows error. 
   f. Again ran hod allocate using patched hod. Allocation was successful

2. 
  a. Allocate hod cluster with --ringmaster.idleness-limit=240. 
   b. Waited for 4 mins. 
   c .verified  the cluster dead from hod list and qstat. 
   d. Stopped torque
   e. ran hod allocate using hod without patch using same cluster dir, hod 
thows error. 
   . Again ran hod allocate using patched hod. hod allocation fails with 
following error -:
    [
        WARNING/30 torque:96 - qstat error: exit code: 255 | signal: False | 
core False. 
       CRITICAL/50 hod:310 - Found a previously allocated cluster at cluster 
directory '~/c_dirn'. Deallocate the cluster first.
    ]
3.  Also hod behavior when hod list shows clsuter as dead/mapred dead/hdfs dead 
but actually cluster is alive (related torque job status is R)..
4. Normal re allocation of dead cluster 

> [HOD] If a job does not exist in Torque's list of jobs, HOD allocate on 
> previously allocated directory fails.
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3523
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3523
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.18.0
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>            Priority: Blocker
>             Fix For: 0.18.0
>
>         Attachments: 3523.patch
>
>
> HADOOP-3483 addressed the issue where a dead cluster could be reallocated 
> without having to issue warnings to users to clean up the directory 
> themselves, provided the job is completed. It missed one case, where the job 
> no longer exists in the Torque queue. When tried in that case, HOD fails with 
> a bad error message:
> ERROR - qstat error: exit code: 153 | signal: False | core False
> CRITICAL - op: allocate hod-clusters/test 3 failed: <type 
> 'exceptions.TypeError'> 'NoneType' object is unsubscriptable
> This should be addressed to avoid user concerns.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to