yaooqinn commented on a change in pull request #815: URL: https://github.com/apache/incubator-kyuubi/pull/815#discussion_r671046917
########## File path: docs/tools/spark_block_cleaner.md ########## @@ -4,90 +4,105 @@ </div> -# Kubernetes tools spark-block-cleaner +# Kubernetes Tools Spark Block Cleaner ## Requirements You'd better have cognition upon the following things when you want to use spark-block-cleaner. * Read this article * An active Kubernetes cluster * [Kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) +* [Docker](https://www.docker.com/) ## Purpose +During using Spark On Kubernetes with client deploy-mode, we encountered this scenario that the disk responsible for storing Shuffle data accumulates so many files cause the disk overflows. -When running Spark On Kubernetes, we encountered such a situation that after using hostPath volume local-dir, the usage rate of the directory storing shuffle files remained high. +Therefore, we chose to use Spark Block Cleaner to clear the block files accumulated by Spark. -So an additional tool is needed to clean up the accumulated block files. +## Scenes +When you're using Spark On Kubernetes with Client mode and don't use `emptyDir` for Spark `local-dir` type, you may face the same scenario that executor pods deleted without clean all the Block files. + +## Principle +When deploying Spark Block Cleaner, we will configure volumes for the destination folder. Spark Block Cleaner will perceive the folder by the parameter `CACHE_DIRS`. + +Spark Block Cleaner will clear the perceived folder in a fixed loop(which can be configured by `SCHEDULE_INTERVAL`). And Spark Block Cleaner will select folder start with `blockmgr` and `spark` for deletion using the logic Spark uses to create those folders. + +Before deleting those files, Spark Block Cleaner will determine whether it is a recently modified file(depending on whether the file has not been acted on within the specified time which configured by `FILE_EXPIRED_TIME`). Only delete files those beyond that time interval. + +And Spark Block Cleaner will check the disk utilization after clean, if the remaining space is less than the specified value(control by `FREE_SPACE_THRESHOLD`), will trigger deep clean(which file expired time control by `DEEP_CLEAN_FILE_EXPIRED_TIME`). ## Usage +Before you start using Spark Block Cleaner, you should build its docker images or using official images(TODO). Review comment: ~~or using official images(TODO)~~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
