[ 
https://issues.apache.org/jira/browse/SPARK-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-2773:
----------------------------

    Description: Right now, Spark uses the total usage of "shuffle" memory of 
each thread to predict if need to spill. I think it is not very reasonable. For 
example, there are two threads pulling "shuffle" data. The total memory used to 
buffer data is 21G. The first time to trigger spilling it when one thread has 
used 7G memory to buffer "shuffle" data, here I assume another one has used the 
same size. Unfortunately, I still have remaining 7G to use. So, I think current 
prediction mode is too conservation, and can not maximize the usage of 
"shuffle" memory. In my solution, I use the growth rate of "shuffle" memory. 
Again, the growth of each time is limited, maybe 10K * 1024(my assumption), 
then the first time to trigger spilling is when the remaining "shuffle" memory 
is less than threads * growth * 2, i.e. 2 * 10M * 2. I think it can maximize 
the usage of "shuffle" memory. Any suggestion?  (was: Right now, Spark uses the 
total usage of "shuffle" memory of each thread to predict if need to spill. I 
think it is not very reasonable. For example, there are two threads pulling 
"shuffle" data. The total memory used to buffer data is 21G. The first time to 
trigger spilling it when one thread has used 7G memory to buffer "shuffle" 
data, here I assume another one has used the same size. Unfortunately, I still 
have remaining 7G to use. So, I think current prediction mode is too 
conservation, and can not maximize the usage of "shuffle" memory. In my 
solution, I use the growth rate of "shuffle" memory. Again, the growth of each 
time is limited, maybe 10K * 1024(my assumption), then the first time to 
trigger spilling is when the remaining "shuffle" memory is less than threads * 
growth * 2, i.e. 2 * 2 * 10M. I think it can maximize the usage of "shuffle" 
memory. Any suggestion?)

> Shuffle:use growth rate to predict if need to spill
> ---------------------------------------------------
>
>                 Key: SPARK-2773
>                 URL: https://issues.apache.org/jira/browse/SPARK-2773
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 0.9.0, 1.0.0
>            Reporter: uncleGen
>            Priority: Minor
>
> Right now, Spark uses the total usage of "shuffle" memory of each thread to 
> predict if need to spill. I think it is not very reasonable. For example, 
> there are two threads pulling "shuffle" data. The total memory used to buffer 
> data is 21G. The first time to trigger spilling it when one thread has used 
> 7G memory to buffer "shuffle" data, here I assume another one has used the 
> same size. Unfortunately, I still have remaining 7G to use. So, I think 
> current prediction mode is too conservation, and can not maximize the usage 
> of "shuffle" memory. In my solution, I use the growth rate of "shuffle" 
> memory. Again, the growth of each time is limited, maybe 10K * 1024(my 
> assumption), then the first time to trigger spilling is when the remaining 
> "shuffle" memory is less than threads * growth * 2, i.e. 2 * 10M * 2. I think 
> it can maximize the usage of "shuffle" memory. Any suggestion?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to