[
https://issues.apache.org/jira/browse/SPARK-46702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-46702:
----------------------------------
Priority: Critical (was: Blocker)
> Spark Cluster Crashing
> ----------------------
>
> Key: SPARK-46702
> URL: https://issues.apache.org/jira/browse/SPARK-46702
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, Spark Docker
> Affects Versions: 3.4.0, 3.5.0
> Reporter: Mohamad Haidar
> Priority: Critical
> Labels: databricks
> Attachments: CV62A4~1.LOG, cveshv-events-streaming-TRACE (2).zip,
> cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log,
> image-2024-01-12-10-44-45-717.png, image-2024-01-12-10-45-18-905.png,
> image-2024-01-12-10-45-30-398.png, image-2024-01-12-10-45-40-397.png,
> image-2024-01-12-10-45-50-427.png,
> logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log
>
>
> h3. Description:
> * We have a spark cluster installed over a k8s cluster with one driver and
> multiple executors (120).
> * We configure our batch duration to 30 seconds.
> * The Spark Cluster is reading from a 120 partition topic at Kafka and
> writing to an hourly index at ElasticSearch.
> * ES has 30 DataNodes, 1 shard per DataNode for each index.
> * Configuration of Driver STS is in Appendix.
> * The driver is observed periodically restarting every 10 mins, although the
> restart do not necessarily occur each 10mins, but when it happens it happens
> each 10 mins.
> * The restarts frequency increase with the increase of the throughput.
> * When the restarts are happening, we see OptionalDataException, attached
> “logs_cveshv-events-streaming-core-cp-type2-filter-driver-0_051023_1500_prev.log”
> is the log resulting in a restart of the driver.
> h3. Analysis:
> # We’ve done a test with 250 K Records/second, and the processing was good
> between 15 and 20 seconds.
> # We were able to avoid all the restarts by simply disabling liveness checks.
> # This resulted in NO RESTARTS to Streaming Core, we tried the above with
> two scenarios:
> * Speculation Disabled --> After 10 to 20 minutes the batch duration
> increased to minutes and eventually processing was very slow, during which,
> main error logs observed are about {*}The executor with id 7 exited with exit
> code 50(Uncaught exception).{*}, logs at WARN level and TRACE level were
> collected:
> * {*}WARN{*}: Logs attached
> “cveshv-events-streaming-core-cp-type2-filter-driver-0_liveness_300000_failed_120124_0336_2.log”
> * {*}TRACE{*}: Logs attached “cveshv-events-streaming-TRACE (2).zip”
> * Speculation Enabled --> the batch duration increased to minutes (big lag)
> only after around 2 hours, logs related are
> “cveshv-events-streaming-core-cp-type2-filter-driver-0_zrdm71bnrt201_1201240818.log”.
> h3. Conclusion:
> * The liveness check is failing and thus causing the restarts.
> * The logs indicates that there are some unhandled exceptions to executors.
> * Issue can be somewhere else as well, below is the liveness check that was
> disabled and that was causing the restarts initially every 10 mins after 3
> occurrences.
>
> h3. !image-2024-01-12-10-44-45-717.png!
> h3. Next Action:
> * Please help us identify the RC of the issue, we’ve tried too many
> configurations and with 2 different spark versions 3.4 and 3.5 and we’re not
> able to avoid the issue.
>
> Appendix:
>
> !image-2024-01-12-10-45-18-905.png!
> !image-2024-01-12-10-45-30-398.png!
> !image-2024-01-12-10-45-40-397.png!
> !image-2024-01-12-10-45-50-427.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]