[
https://issues.apache.org/jira/browse/HIVE-11276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630682#comment-14630682
]
Chengxiang Li commented on HIVE-11276:
--------------------------------------
[~xuefuz], I review the the code in RemoteHiveSparkClient, the reason why it
need to invoke refreshLocalResources() for every job submission is that Hive
user may use "ADD \[FILE|JAR|ARCHIVE\] <value>" command to add resources on
runtime, so spark client need to upload these resources to spark cluster before
job execution. RemoteHiveSparkClient have a list which records all the
resources it has uploaded to spark cluster, and use it to filter out already
uploaded jars during refreshLocalResources(), only new added jars would be
uploaded to spark cluster, and the list should have a quite small size at most
time, so i think it should not has performance issue here.
> Optimization around job submission and adding jars [Spark Branch]
> -----------------------------------------------------------------
>
> Key: HIVE-11276
> URL: https://issues.apache.org/jira/browse/HIVE-11276
> Project: Hive
> Issue Type: Sub-task
> Components: Spark
> Affects Versions: 1.1.0
> Reporter: Xuefu Zhang
> Assignee: Chengxiang Li
>
> It seems that Hive on Spark has some room for performance improvement on job
> submission. Specifically, we are calling refreshLocalResources() for every
> job submission despite there is are no changes in the jar list. Since Hive on
> Spark is reusing the containers in the whole user session, we might be able
> to optimize that.
> We do need to take into consideration the case of dynamic allocation, in
> which new executors might be added.
> This task is some R&D in this area.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)