[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16401300#comment-16401300
 ] 

Jo Desmet commented on SPARK-8008:
----------------------------------

Too bad that this issue is not considered high priority. Too many times I come 
to the problem that I need to process billions of records. So the only way to 
handle this is to create a huge amount of partitions, and then throttle usingĀ 
spark.executor.cores. However this setting effectively throttles my entire RDD, 
not just the portion that loads from database. It would be hugely beneficial 
that I can not only restrict the number of partitions at any time, but also the 
task concurrency at any point in my RDD.

> JDBC data source can overload the external database system due to high 
> concurrency
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-8008
>                 URL: https://issues.apache.org/jira/browse/SPARK-8008
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Rene Treffer
>            Priority: Major
>
> Spark tries to load as many partitions as possible in parallel, which can in 
> turn overload the database although it would be possible to load all 
> partitions given a lower concurrency.
> It would be nice to either limit the maximum concurrency or to at least warn 
> about this behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to