Update - the answer was spark.cassandra.input.split.sizeInMB. The
default value is 512MBytes. Setting this to 50 resulted in a lot more
splits and the job ran in under 11 minutes; no timeout errors. In this
case the job was a simple count. 10 minutes 48 seconds for over 8.2
billion rows.
Update - I believe that for large tables, the
spark.cassandra.read.timeoutMS needs to be very long; like 4 hours or
longer. The job now runs much longer, but still doesn't complete. I'm
now facing this all too familiar error:
com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException:
Some more info. Tried different GC strategies - no luck.
It only happens on large tables (more than 1 billion rows). Works fine
on a 300million row table. There is very high CPU usage during the run.
I've tried setting spark.dse.continuousPagingEnabled to false and I've
tried setting
I've tried several different GC settings - but still getting timeouts.
Using openJDK 11 with:
-XX:+UseG1GC
-XX:+ParallelRefProcEnabled
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:MaxGCPauseMillis=500
-XX:InitiatingHeapOccupancyPercent=70
-XX:ParallelGCThreads=24
-XX:ConcGCThreads=24
Machine has 40
Still no go. Oddly, I can use trino and do a count OK, but with spark I
get the timeouts. I don't believe tombstones are an issue:
nodetool cfstats doc.doc
Total number of tables: 82
Keyspace : doc
Read Count: 1514288521
Read Latency: 0.5080819034089475 ms
It maybe the case you have lots of tombstones in this table which is making
reads slow and timeouts during bulk reads.
On Fri, Feb 4, 2022, 03:23 Joe Obernberger
wrote:
> So it turns out that number after PT is increments of 60 seconds. I
> changed the timeout to 96, and now I get PT16M
So it turns out that number after PT is increments of 60 seconds. I
changed the timeout to 96, and now I get PT16M (96/6).
Since I'm still getting timeouts, something else must be wrong.
Exception in thread "main" org.apache.spark.SparkException: Job aborted
due to stage
I did find this:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md
And "spark.cassandra.read.timeoutMS" is set to 12.
Running a test now, and I think that is it. Thank you Scott.
-Joe
On 2/3/2022 3:19 PM, Joe Obernberger wrote:
Thank you Scott!
I am
Thank you Scott!
I am using the spark cassandra connector. Code:
SparkSession spark = SparkSession
.builder()
.appName("SparkCassandraApp")
.config("spark.cassandra.connection.host", "chaos")
Hi Joe, it looks like "PT2M" may refer to a timeout value that could be set by your Spark job's
initialization of the client. I don't see a string matching this in the Cassandra codebase itself, but I do see
that this is parseable as a Duration.```jshell>
10 matches
Mail list logo