[
https://issues.apache.org/jira/browse/SPARK-17513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15498425#comment-15498425
]
Apache Spark commented on SPARK-17513:
--------------------------------------
User 'petermaxlee' has created a pull request for this issue:
https://github.com/apache/spark/pull/15126
> StreamExecution should discard unneeded metadata
> ------------------------------------------------
>
> Key: SPARK-17513
> URL: https://issues.apache.org/jira/browse/SPARK-17513
> Project: Spark
> Issue Type: Sub-task
> Components: Streaming
> Reporter: Frederick Reiss
>
> The StreamExecution maintains a write-ahead log of batch metadata in order to
> allow repeating previously in-flight batches if the driver is restarted.
> StreamExecution does not garbage-collect or compact this log in any way.
> Since the log is implemented with HDFSMetadataLog, these files will consume
> memory on the HDFS NameNode. Specifically, each log file will consume about
> 300 bytes of NameNode memory (150 bytes for the inode and 150 bytes for the
> block of file contents; see
> [https://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html].
> An application with a 100 msec batch interval will increase the NameNode's
> heap usage by about 250MB per day.
> There is also the matter of recovery. StreamExecution reads its entire log
> when restarting. This read operation will be very expensive if the log
> contains millions of entries spread over millions of files.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]