[
https://issues.apache.org/jira/browse/HADOOP-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hemanth Yamijala updated HADOOP-3523:
-------------------------------------
Attachment: 3523.patch
The attached patch fixes the issue described above. We now check for the exit
code from qstat indicating that the job id is invalid (error code = 153) and
treat that as equivalent to completed. By doing so, a previously allocated
cluster who's cluster id is no longer present with Torque will continue to be
auto-deallocated and allocated again.
However, if any other torque error occurs, we treat that as an unknown case,
and let the user handle the deallocation himself.
> [HOD] If a job does not exist in Torque's list of jobs, HOD allocate on
> previously allocated directory fails.
> -------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-3523
> URL: https://issues.apache.org/jira/browse/HADOOP-3523
> Project: Hadoop Core
> Issue Type: Bug
> Components: contrib/hod
> Affects Versions: 0.18.0
> Reporter: Hemanth Yamijala
> Assignee: Hemanth Yamijala
> Priority: Blocker
> Fix For: 0.18.0
>
> Attachments: 3523.patch
>
>
> HADOOP-3483 addressed the issue where a dead cluster could be reallocated
> without having to issue warnings to users to clean up the directory
> themselves, provided the job is completed. It missed one case, where the job
> no longer exists in the Torque queue. When tried in that case, HOD fails with
> a bad error message:
> ERROR - qstat error: exit code: 153 | signal: False | core False
> CRITICAL - op: allocate hod-clusters/test 3 failed: <type
> 'exceptions.TypeError'> 'NoneType' object is unsubscriptable
> This should be addressed to avoid user concerns.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.