[
https://issues.apache.org/jira/browse/HIVE-20506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607431#comment-16607431
]
Sahil Takiar commented on HIVE-20506:
-------------------------------------
The general idea makes sense to me. To confirm my understanding this change
will essentially do the following:
* Parse the {{spark-submit}} logs and look for the YARN application id
* Create a {{YarnClient}} and check the state of the YARN app
* If the app is in {{ACCEPTED}} state (which means it has been acknowledged by
YARN, but hasn't actually been started yet)
* As long as the app is in {{ACCEPTED}} state, extend the timeout until it
transitions out of this state
Is that correct?
If thats the case, then I just have a few comments:
* Rather than extending the timeout, why not just create two separate ones? One
timeout for launching {{bin/spark-submit}} --> app = ACCEPTED and another from
app = RUNNING --> connection established.
** We probably don't want to change the meaning of the current timeout for
backwards compatibility, so maybe we could deprecate the existing one and
replace it with two new ones?
* Is there any way to avoid creating a {{YarnClient}}? I guess this is
mitigated slightly by the fact that you only create the client if the timeout
is triggered
** Just concerned about the overhead of creating a {{YarnClient}} + would this
work on a secure cluster?
** {{bin/spark-submit}} should print out something like {{Application report
for ... (state: ACCEPTED)}} perhaps we can parse the state from the logs?
* Can we move all the changes in {{RpcServer}} to a separate class? That class
is really meant to act as a generic RPC framework that is relatively
independent of the HoS logic
> HOS times out when cluster is full while Hive-on-MR waits
> ---------------------------------------------------------
>
> Key: HIVE-20506
> URL: https://issues.apache.org/jira/browse/HIVE-20506
> Project: Hive
> Issue Type: Improvement
> Reporter: Brock Noland
> Assignee: Brock Noland
> Priority: Major
> Attachments: HIVE-20506-CDH5.14.2.patch, HIVE-20506.1.patch, Screen
> Shot 2018-09-07 at 8.10.37 AM.png
>
>
> My understanding is as follows:
> Hive-on-MR when the cluster is full will wait for resources to be available
> before submitting a job. This is because the hadoop jar command is the
> primary mechanism Hive uses to know if a job is complete or failed.
>
> Hive-on-Spark will timeout after {{SPARK_RPC_CLIENT_CONNECT_TIMEOUT}} because
> the RPC client in the AppMaster doesn't connect back to the RPC Server in
> HS2.
> This is a behavior difference it'd be great to close.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)