[
https://issues.apache.org/jira/browse/SPARK-30462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun resolved SPARK-30462.
-----------------------------------
Fix Version/s: 3.1.0
Resolution: Fixed
Issue resolved by pull request 28904
[https://github.com/apache/spark/pull/28904]
> Structured Streaming _spark_metadata fills up Spark Driver memory when having
> lots of objects
> ---------------------------------------------------------------------------------------------
>
> Key: SPARK-30462
> URL: https://issues.apache.org/jira/browse/SPARK-30462
> Project: Spark
> Issue Type: Bug
> Components: Structured Streaming
> Affects Versions: 2.4.3, 2.4.4, 3.0.0
> Reporter: Vladimir Yankov
> Assignee: Jungtaek Lim
> Priority: Major
> Fix For: 3.1.0
>
>
> Hi,
> With the current implementation of the Spark Structured Streaming it does not
> seem to be possible to have a constantly running stream, writing millions of
> files, without increasing the spark driver's memory to dozens of GB's.
> In our scenario we are using Spark structured streaming to consume messages
> from a Kafka cluster, transform them, and write them as compressed Parquet
> files in an S3 Objectstore Service.
> Each 30 seconds a new batch of the spark-streaming is writing hundreds of
> objects, which respectively results within time to millions of objects in S3.
> As all written objects are recorded in the _spark_metadata, the size of the
> compact files there grows to GB's that eventually fill up the Spark Driver's
> memory and lead to OOM errors.
> We need the functionality to configure the spark structured streaming to run
> without loading all the historically accumulated metadata in its memory.
> Regularly resetting the _spark_metadata and the checkpoint folders is not an
> option in our use-case, as we are using the information from the
> _spark_metadata to have a register of the objects for faster querying and
> search of the written objects.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]