Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12987 )
Change subject: IMPALA-8341: Data cache for remote reads ...................................................................... IMPALA-8341: Data cache for remote reads This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/ This change implements an IO data cache which is backed by local storage. It implicitly relies on the OS page cache management to shuffle data between memory and the storage device. This is useful for caching data read from remote filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS). A data cache is divided into one or more partitions based on the configuration string which is a list of directories, separated by comma, followed by the storage capacity per directory. An example configuration string is like the following: --data_cache_config=/data/0,/data/1:150GB In the configuration above, the cache may use up to 300GB of storage space, with 150GB max for /data/0 and /data/1 respectively. Each partition has a meta-data cache which tracks the mappings of cache keys to the locations of the cached data. A cache key is a tuple of (file's name, file's modification time, file offset) and a cache entry is a tuple of (backing file, offset in the backing file, length of the cached data, optional checksum). Note that the cache currently doesn't support overlapping ranges. In other words, if the cache contains an entry of a file for range [m, m+4MB), a lookup for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen this as a problem but this may require further evaluation in the future. Each partition stores its set of cached data in backing files created on local storage. When inserting new data into the cache, the data is appended to the current backing file in use. The storage consumption of each cache entry counts towards the quota of that partition. When a partition reaches its capacity, the least recently used (LRU) data in that partition is evicted. Evicted data is removed from the underlying storage by punching holes in the backing file it's stored in. As a backing file reaches a certain size (by default 4TB), new data will stop being appended to it and a new file will be created instead. Note that due to hole punching, the backing file is actually sparse. When the number of backing files per partition exceeds, --data_cache_max_files_per_partition, files are deleted in the order in which they are created. Stale cache entries referencing deleted files are erased lazily or evicted due to inactivity. Optionally, checksumming can be enabled to verify read from the cache is consistent with what was inserted and to verify that multiple attempted insertions with the same cache key have the same cache content. Checksumming is enabled by default for debug builds. To probe for cached data in the cache, the interface Lookup() is used; To insert data into the cache, the interface Store() is used. Please note that eviction happens inline currently during Store(). This patch also added two startup flags for start-impala-cluster.py: '--data_cache_dir' specifies the base directory in which each Impalad creates the caching directory '--data_cache_size' specifies the capacity string for each cache directory. Testing done: - added a new BE and EE test - exhaustive (debug, release) builds with cache enabled - core ASAN build with cache enabled Perf: - 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement over runs without the cache. Each node has a cache size of 150GB per node. The performance is at parity with a configuration of a HDFS cluster using EBS as the storage. Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Reviewed-on: http://gerrit.cloudera.org:8080/12987 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/runtime/io/CMakeLists.txt A be/src/runtime/io/data-cache-test.cc A be/src/runtime/io/data-cache.cc A be/src/runtime/io/data-cache.h M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/runtime/io/hdfs-file-reader.h M be/src/runtime/io/request-context.h M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/impalad-metrics.cc M be/src/util/impalad-metrics.h M bin/start-impala-cluster.py M common/thrift/metrics.json A testdata/workloads/functional-query/queries/QueryTest/data-cache.test M tests/common/custom_cluster_test_suite.py M tests/common/impala_test_suite.py A tests/custom_cluster/test_data_cache.py M tests/custom_cluster/test_krpc_metrics.py 23 files changed, 2,150 insertions(+), 55 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 13 Gerrit-Owner: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: David Rorke <dro...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Thomas Marshall <tmarsh...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org>