[
https://issues.apache.org/jira/browse/LIVY-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyorgy Gal updated LIVY-767:
----------------------------
Fix Version/s: 0.10.0
(was: 0.9.0)
This issue has been moved to the 0.10.0 release as part of a bulk update. If
you feel this is moved out inappropriately, feel free to provide justification
and reset the Fix Version to 0.9.0.
> Spark-Cassandra-Connector's joinWithCassandraTable makes spark use less
> efficient DAG
> -------------------------------------------------------------------------------------
>
> Key: LIVY-767
> URL: https://issues.apache.org/jira/browse/LIVY-767
> Project: Livy
> Issue Type: Bug
> Components: Interpreter
> Affects Versions: 0.7.0
> Environment: Spark and Livy running in containers on Kubernetes.
> Deployed using Microsofts Helm chart:
> https://hub.helm.sh/charts/microsoft/spark.
> Reporter: Zef Wolffs
> Priority: Major
> Fix For: 0.10.0
>
>
> I have a script that uses joinWithCassandraTable, a method available through
> the spark-cassandra-connector. This script usually claims all available
> executors (about 25 in my setup) for this method when ran from anywhere
> directly through spark-submit. It also creates many tasks. However, when I
> run it through Livy, spark only claims two executors, and only creates two
> tasks, which makes the job much slower.
>
> The script is the following:
>
> {noformat}
> case class TimeSeriesContainerID(timeseriescontainerid: Int) // Defines
> partition key
> val timeSeriesContainerIDofInterest = sc.parallelize(1 to
> 20).map(TimeSeriesContainerID(_))
> val rdd = timeSeriesContainerIDofInterest.joinWithCassandraTable[(Timestamp,
> Int)]("test_ts_partitions", "ts_0")
> val aggregatedRDD = rdd.select("timestamp", "tsvalue").map(s =>
> s._2).reduceByKey((a, b) => a + b).collect.head{noformat}
>
>
> I am happy to provide more information as requested.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)