Gerard Maas created SPARK-28025:
-----------------------------------
Summary: HDFSBackedStateStoreProvider leaks .crc files
Key: SPARK-28025
URL: https://issues.apache.org/jira/browse/SPARK-28025
Project: Spark
Issue Type: Bug
Components: Structured Streaming
Affects Versions: 2.4.3
Environment: Spark 2.4.3
Kubernetes 1.11(?) (OpenShift)
StateStore storage on a mounted PVC. Viewed as a local filesystem by the
`FileContextBasedCheckpointFileManager` :
{noformat}
scala> glusterfm.isLocal
res17: Boolean = true{noformat}
Reporter: Gerard Maas
The HDFSBackedStateStoreProvider when using the default CheckpointFileManager
is leaving '.crc' files behind. There's a .crc file created for each
`atomicFile` operation of the CheckpointFileManager.
Over time, the number of files becomes very large. It makes the state store
file system constantly increase in size and, in our case, deteriorates the file
system performance.
Here's a sample of one of our spark storage volumes after 2 days of execution
(4 stateful streaming jobs, each on a different sub-dir):
#
{noformat}
Total files in PVC (used for checkpoints and state store)
$find . | wc -l
431796
# .crc files
$find . -name "*.crc" | wc -l
418053{noformat}
With each .crc file taking one storage block, the used storage runs into the
GBs of data.
These jobs are running on Kubernetes. Our shared storage provider, GlusterFS,
shows serious performance deterioration with this large number of files:
{noformat}
DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]