[jira] [Updated] (LIVY-767) Spark-Cassandra-Connector's joinWithCassandraTable makes spark use less efficient DAG

Gyorgy Gal (Jira) Mon, 10 Nov 2025 09:14:54 -0800


     [ 
https://issues.apache.org/jira/browse/LIVY-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gyorgy Gal updated LIVY-767:
----------------------------
    Fix Version/s: 0.10.0
                       (was: 0.9.0)

This issue has been moved to the 0.10.0 release as part of a bulk update. If 
you feel this is moved out inappropriately, feel free to provide justification 
and reset the Fix Version to 0.9.0.

> Spark-Cassandra-Connector's joinWithCassandraTable makes spark use less 
> efficient DAG
> -------------------------------------------------------------------------------------
>
>                 Key: LIVY-767
>                 URL: https://issues.apache.org/jira/browse/LIVY-767
>             Project: Livy
>          Issue Type: Bug
>          Components: Interpreter
>    Affects Versions: 0.7.0
>         Environment: Spark and Livy running in containers on Kubernetes. 
> Deployed using Microsofts Helm chart: 
> https://hub.helm.sh/charts/microsoft/spark.
>            Reporter: Zef Wolffs
>            Priority: Major
>             Fix For: 0.10.0
>
>
> I have a script that uses joinWithCassandraTable, a method available through 
> the spark-cassandra-connector. This script usually claims all available 
> executors (about 25 in my setup) for this method when ran from anywhere 
> directly through spark-submit. It also creates many tasks. However, when I 
> run it through Livy, spark only claims two executors, and only creates two 
> tasks, which makes the job much slower. 
>  
> The script is the following: 
>  
> {noformat}
> case class TimeSeriesContainerID(timeseriescontainerid: Int) // Defines 
> partition key
> val timeSeriesContainerIDofInterest = sc.parallelize(1 to 
> 20).map(TimeSeriesContainerID(_)) 
> val rdd =  timeSeriesContainerIDofInterest.joinWithCassandraTable[(Timestamp, 
> Int)]("test_ts_partitions", "ts_0")
> val aggregatedRDD = rdd.select("timestamp", "tsvalue").map(s => 
> s._2).reduceByKey((a, b) => a + b).collect.head{noformat}
>  
>   
>  I am happy to provide more information as requested.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (LIVY-767) Spark-Cassandra-Connector's joinWithCassandraTable makes spark use less efficient DAG

Reply via email to