[ 
https://issues.apache.org/jira/browse/HIVE-17291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125439#comment-16125439
 ] 

Peter Vary commented on HIVE-17291:
-----------------------------------

[~lirui]: You might be using black magic, or similar :)
In the middle of the night I woke up, that something is not good with the patch 
- some of the query output should be changed as a result of the change, so I 
decided that check out some stuff, and saw your comment which immediately 
answered my question. What I am not sure of is wether you were using black 
magic to wake me up, because my patch was not good, or you were using white 
magic to answer my unasked question :) :) :)

I found this problem with the help of the qtest outputs, but I think this 
signals a more important usage problem. Maybe you have more experience how 
users use HoS, and how they tune their queries, but when my wife optimizes an 
Oracle query she often iterates through explain / hint change / explain loops. 
I imagine with hive it looks like this: explain / change config / explain. When 
we change the spark configuration the RpcServer is killed, and a new one is 
started (I might be wrong here, tell me if it is not like this). The result of 
this is that the first explain is different than the next one this defeats the 
whole purpose of the optimization.

What happens when a query uses a wrong number of reducers? The query will run, 
but will result in slower execution? Out of memory?

Also the test results of HIVE-17292 showed me, that with multiple reducers 
configured the output of the {{hiveSparkClient.getExecutorCount()}} is even 
less reliable. The possible outcomes are:
- *0* - if only the Application Master is started
- *1* - if the AM, and 1 executor is started
- *2* - if the AM, and both executors are started

So we should not base any decision on it neither in QTestUtil, nor in 
{{sparkSession.getMemoryAndCores()}}, only if we made sure that all of the 
executors are stared which will be running at the time of the query. I starting 
to understand the wisdom of [~xuefuz] statement:

??I also had some doubts on the original idea which was to automatically and 
dynamically deciding the parallelism based on available memory/cores. Maybe we 
should back to the basis, where the number of reducers is solely determined 
(statically) by the total shuffled data size (stats) divided by the 
configuration "bytes per reducer". I'm open to all proposals, including doing 
this for dynamic allocation and using spark.executor.instances for static 
allocation.??

In the light of these findings I think we should use the proposed algorithm to 
calculate the parallelism (as proposed by [~xuefuz]):
- for dynamic allocation - number of reducers is solely determined (statically) 
by the total shuffled data size (stats) divided by the configuration "bytes per 
reducer"
- for static allocation - using spark.executor.instances

What do you think?

Sorry for dragging you through the whole thinking process :(

Thanks,
Peter

> Set the number of executors based on config if client does not provide 
> information
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-17291
>                 URL: https://issues.apache.org/jira/browse/HIVE-17291
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>    Affects Versions: 3.0.0
>            Reporter: Peter Vary
>            Assignee: Peter Vary
>         Attachments: HIVE-17291.1.patch
>
>
> When calculating the memory and cores and the client does not provide 
> information we should try to use the one provided by default. This can happen 
> on startup, when {{spark.dynamicAllocation.enabled}} is not enabled



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to