Frederick Reiss created SPARK-17513:
---------------------------------------
Summary: StreamExecution should discard unneeded metadata
Key: SPARK-17513
URL: https://issues.apache.org/jira/browse/SPARK-17513
Project: Spark
Issue Type: Sub-task
Components: Streaming
Reporter: Frederick Reiss
The StreamExecution maintains a write-ahead log of batch metadata in order to
allow repeating previously in-flight batches if the driver is restarted.
StreamExecution does not garbage-collect or compact this log in any way.
Since the log is implemented with HDFSMetadataLog, these files will consume
memory on the HDFS NameNode. Specifically, each log file will consume about 300
bytes of NameNode memory (150 bytes for the inode and 150 bytes for the block
of file contents; see
[https://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html].
An application with a 100 msec batch interval will increase the NameNode's
heap usage by about 250MB per day.
There is also the matter of recovery. StreamExecution reads its entire log when
restarting. This read operation will be very expensive if the log contains
millions of entries spread over millions of files.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]