[jira] [Commented] (HIVE-20506) HOS times out when cluster is full while Hive-on-MR waits

Brock Noland (JIRA) Fri, 07 Sep 2018 12:57:59 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-20506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607602#comment-16607602
 ]


Brock Noland commented on HIVE-20506:
-------------------------------------

bq. Is that correct?
Correct!
bq. Rather than extending the timeout, why not just create two separate ones? 
.. We probably don't want to change the meaning of the current timeout for 
backwards compatibility, so maybe we could deprecate the existing one and 
replace it with two new ones?
Hmm this seems complex.100% of my HOS customers that are complaining about this 
behavior so I don't see it as a backwards incompatible change but an 
improvement on the existing timeout.
bq. Is there any way to avoid creating a YarnClient? I guess this is mitigated 
slightly by the fact that you only create the client if the timeout is triggered
I don't think creating a YarnClient once a minute is a huge deal. We could of 
course share a YarnClient. Perhaps that could be a follow-on patch?
bq. would this work on a secure cluster
My test above is on a kerberos + TLS cluster. Worked great.
bq. perhaps we can parse the state from the logs
In the patch we only parse the output after spark-submit has exited so I don't 
think spark-submit sticks around forever like it used to.
bq. Can we move all the changes in RpcServer to a separate class? That class is 
really meant to act as a generic RPC framework that is relatively independent 
of the HoS logic
We could move the {{YarnApplicationStateFinder}} out of there, but otherwise I 
don't see how we can implement this without some change to {{RPCServer}}. Maybe 
you could take that on as follow-on change?



> HOS times out when cluster is full while Hive-on-MR waits
> ---------------------------------------------------------
>
>                 Key: HIVE-20506
>                 URL: https://issues.apache.org/jira/browse/HIVE-20506
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Brock Noland
>            Assignee: Brock Noland
>            Priority: Major
>         Attachments: HIVE-20506-CDH5.14.2.patch, HIVE-20506.1.patch, Screen 
> Shot 2018-09-07 at 8.10.37 AM.png
>
>
> My understanding is as follows:
> Hive-on-MR when the cluster is full will wait for resources to be available 
> before submitting a job. This is because the hadoop jar command is the 
> primary mechanism Hive uses to know if a job is complete or failed.
>  
> Hive-on-Spark will timeout after {{SPARK_RPC_CLIENT_CONNECT_TIMEOUT}} because 
> the RPC client in the AppMaster doesn't connect back to the RPC Server in 
> HS2. 
> This is a behavior difference it'd be great to close.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20506) HOS times out when cluster is full while Hive-on-MR waits

Reply via email to