[ 
https://issues.apache.org/jira/browse/LIVY-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zef Wolffs updated LIVY-767:
----------------------------
    Summary: Spark-Cassandra-Connector's joinWithCassandraTable makes spark use 
less efficient DAG  (was: Spark-Cassandra-Connector's joinWithCassandraTable 
makes spark use only two executors)

> Spark-Cassandra-Connector's joinWithCassandraTable makes spark use less 
> efficient DAG
> -------------------------------------------------------------------------------------
>
>                 Key: LIVY-767
>                 URL: https://issues.apache.org/jira/browse/LIVY-767
>             Project: Livy
>          Issue Type: Bug
>          Components: Interpreter
>    Affects Versions: 0.7.0
>         Environment: Spark and Livy running in containers on Kubernetes. 
> Deployed using Microsofts Helm chart: 
> https://hub.helm.sh/charts/microsoft/spark.
>            Reporter: Zef Wolffs
>            Priority: Major
>
> I have a script that uses joinWithCassandraTable, a method available through 
> the spark-cassandra-connector. This script usually claims all available 
> executors (about 25 in my setup) for this method when ran from anywhere 
> directly through spark-submit. However, when I run it through Livy, spark 
> only claims two executors which makes the job much slower. 
>  
> The script is the following: 
>  
> {noformat}
> case class TimeSeriesContainerID(timeseriescontainerid: Int) // Defines 
> partition key
> val timeSeriesContainerIDofInterest = sc.parallelize(1 to 
> 20).map(TimeSeriesContainerID(_)) 
> val rdd =  timeSeriesContainerIDofInterest.joinWithCassandraTable[(Timestamp, 
> Int)]("test_ts_partitions", "ts_0")
> val aggregatedRDD = rdd.select("timestamp", "tsvalue").map(s => 
> s._2).reduceByKey((a, b) => a + b).collect.head{noformat}
>  
>  
> I am happy to provide more information as requested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to