Ooops - linked the wrong JIRA ticket: (that other one is related) https://issues.apache.org/jira/browse/SPARK-28025
On Wed, Jun 12, 2019 at 1:21 PM Gerard Maas <gerard.m...@gmail.com> wrote: > Hi! > I would like to socialize this issue we are currently facing: > The Structured Streaming default CheckpointFileManager leaks .crc files by > leaving them behind after users of this class (like > HDFSBackedStateStoreProvider) apply their cleanup methods. > > This results in an unbounded creation of tiny files that eat away storage > by the block and, in our case, deteriorates the file system performance. > > We correlated the processedRowsPerSecond reported by the > StreamingQueryProgress against a count of the .crc files in the storage > directory (checkpoint + state store). The performance impact we observe is > dramatic. > > We are running on Kubernetes, using GlusterFS as the shared storage > provider. > [image: out processedRowsPerSecond vs. files in storage_process.png] > I have created a JIRA ticket with additional detail: > > https://issues.apache.org/jira/browse/SPARK-17475 > > This is also related to an earlier discussion about the state store > unbounded disk-size growth, which was left unresolved back then: > > http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-State-Store-storage-behavior-for-the-Stream-Deduplication-function-td34883.html > > If there's any additional detail I should add/research, please let me know. > > kind regards, Gerard. > > >