[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. IMPALA-8341: Data cache for remote reads This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/ This change implements an IO data cache which is backed by local storage. It implicitly relies on the OS page cache management to shuffle data between memory and the storage device. This is useful for caching data read from remote filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS). A data cache is divided into one or more partitions based on the configuration string which is a list of directories, separated by comma, followed by the storage capacity per directory. An example configuration string is like the following: --data_cache_config=/data/0,/data/1:150GB In the configuration above, the cache may use up to 300GB of storage space, with 150GB max for /data/0 and /data/1 respectively. Each partition has a meta-data cache which tracks the mappings of cache keys to the locations of the cached data. A cache key is a tuple of (file's name, file's modification time, file offset) and a cache entry is a tuple of (backing file, offset in the backing file, length of the cached data, optional checksum). Note that the cache currently doesn't support overlapping ranges. In other words, if the cache contains an entry of a file for range [m, m+4MB), a lookup for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen this as a problem but this may require further evaluation in the future. Each partition stores its set of cached data in backing files created on local storage. When inserting new data into the cache, the data is appended to the current backing file in use. The storage consumption of each cache entry counts towards the quota of that partition. When a partition reaches its capacity, the least recently used (LRU) data in that partition is evicted. Evicted data is removed from the underlying storage by punching holes in the backing file it's stored in. As a backing file reaches a certain size (by default 4TB), new data will stop being appended to it and a new file will be created instead. Note that due to hole punching, the backing file is actually sparse. When the number of backing files per partition exceeds, --data_cache_max_files_per_partition, files are deleted in the order in which they are created. Stale cache entries referencing deleted files are erased lazily or evicted due to inactivity. Optionally, checksumming can be enabled to verify read from the cache is consistent with what was inserted and to verify that multiple attempted insertions with the same cache key have the same cache content. Checksumming is enabled by default for debug builds. To probe for cached data in the cache, the interface Lookup() is used; To insert data into the cache, the interface Store() is used. Please note that eviction happens inline currently during Store(). This patch also added two startup flags for start-impala-cluster.py: '--data_cache_dir' specifies the base directory in which each Impalad creates the caching directory '--data_cache_size' specifies the capacity string for each cache directory. Testing done: - added a new BE and EE test - exhaustive (debug, release) builds with cache enabled - core ASAN build with cache enabled Perf: - 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement over runs without the cache. Each node has a cache size of 150GB per node. The performance is at parity with a configuration of a HDFS cluster using EBS as the storage. Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Reviewed-on: http://gerrit.cloudera.org:8080/12987 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/runtime/io/CMakeLists.txt A be/src/runtime/io/data-cache-test.cc A be/src/runtime/io/data-cache.cc A be/src/runtime/io/data-cache.h M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/runtime/io/hdfs-file-reader.h M be/src/runtime/io/request-context.h M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/impalad-metrics.cc M be/src/util/impalad-metrics.h M bin/start-impala-cluster.py M common/thrift/metrics.json A testdata/workloads/functional-query/queries/QueryTest/data-cache.test M tests/common/custom_cluster_test_suite.py M tests/common/impala_test_suite.py A tests/custom_cluster/test_data_cache.py M tests/custom_cluster/test_krpc_metrics.py 23 files changed, 2,150 insertions(+), 55 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 12: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 12 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 03 May 2019 19:39:41 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 12: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4148/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 12 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 03 May 2019 14:10:25 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 12: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/4147/ -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 12 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 03 May 2019 10:32:33 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 12: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4147/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 12 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 03 May 2019 08:06:28 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 12: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 12 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 03 May 2019 08:06:27 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 10: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3051/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 10 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 03 May 2019 03:42:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 11: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4146/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 11 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 03 May 2019 02:49:46 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 10: Code-Review+2 (1 comment) Carry Todd's +2 http://gerrit.cloudera.org:8080/#/c/12987/9/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/9/be/src/runtime/io/data-cache.cc@260 PS9, Line 260: explicit CacheEntry(const Slice& value) { > explicit :) Done -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 10 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 03 May 2019 02:46:01 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 11: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 11 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 03 May 2019 02:49:45 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Michael Ho has uploaded a new patch set (#10). ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. IMPALA-8341: Data cache for remote reads This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/ This change implements an IO data cache which is backed by local storage. It implicitly relies on the OS page cache management to shuffle data between memory and the storage device. This is useful for caching data read from remote filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS). A data cache is divided into one or more partitions based on the configuration string which is a list of directories, separated by comma, followed by the storage capacity per directory. An example configuration string is like the following: --data_cache_config=/data/0,/data/1:150GB In the configuration above, the cache may use up to 300GB of storage space, with 150GB max for /data/0 and /data/1 respectively. Each partition has a meta-data cache which tracks the mappings of cache keys to the locations of the cached data. A cache key is a tuple of (file's name, file's modification time, file offset) and a cache entry is a tuple of (backing file, offset in the backing file, length of the cached data, optional checksum). Note that the cache currently doesn't support overlapping ranges. In other words, if the cache contains an entry of a file for range [m, m+4MB), a lookup for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen this as a problem but this may require further evaluation in the future. Each partition stores its set of cached data in backing files created on local storage. When inserting new data into the cache, the data is appended to the current backing file in use. The storage consumption of each cache entry counts towards the quota of that partition. When a partition reaches its capacity, the least recently used (LRU) data in that partition is evicted. Evicted data is removed from the underlying storage by punching holes in the backing file it's stored in. As a backing file reaches a certain size (by default 4TB), new data will stop being appended to it and a new file will be created instead. Note that due to hole punching, the backing file is actually sparse. When the number of backing files per partition exceeds, --data_cache_max_files_per_partition, files are deleted in the order in which they are created. Stale cache entries referencing deleted files are erased lazily or evicted due to inactivity. Optionally, checksumming can be enabled to verify read from the cache is consistent with what was inserted and to verify that multiple attempted insertions with the same cache key have the same cache content. Checksumming is enabled by default for debug builds. To probe for cached data in the cache, the interface Lookup() is used; To insert data into the cache, the interface Store() is used. Please note that eviction happens inline currently during Store(). This patch also added two startup flags for start-impala-cluster.py: '--data_cache_dir' specifies the base directory in which each Impalad creates the caching directory '--data_cache_size' specifies the capacity string for each cache directory. Testing done: - added a new BE and EE test - exhaustive (debug, release) builds with cache enabled - core ASAN build with cache enabled Perf: - 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement over runs without the cache. Each node has a cache size of 150GB per node. The performance is at parity with a configuration of a HDFS cluster using EBS as the storage. Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/runtime/io/CMakeLists.txt A be/src/runtime/io/data-cache-test.cc A be/src/runtime/io/data-cache.cc A be/src/runtime/io/data-cache.h M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/runtime/io/hdfs-file-reader.h M be/src/runtime/io/request-context.h M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/impalad-metrics.cc M be/src/util/impalad-metrics.h M bin/start-impala-cluster.py M common/thrift/metrics.json A testdata/workloads/functional-query/queries/QueryTest/data-cache.test M tests/common/custom_cluster_test_suite.py M tests/common/impala_test_suite.py A tests/custom_cluster/test_data_cache.py M tests/custom_cluster/test_krpc_metrics.py 23 files changed, 2,150 insertions(+), 55 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/10 -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I734803c1c1787c85
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 9: Code-Review+2 (1 comment) just simple nit, otherwise +2 http://gerrit.cloudera.org:8080/#/c/12987/9/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/9/be/src/runtime/io/data-cache.cc@260 PS9, Line 260: CacheEntry(const Slice& value) { explicit :) -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 9 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 02 May 2019 17:58:29 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 9: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3031/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 9 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 02 May 2019 03:07:47 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 9: Code-Review+1 Carry Lars' +1 -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 9 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 02 May 2019 02:24:51 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Michael Ho has uploaded a new patch set (#9). ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. IMPALA-8341: Data cache for remote reads This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/ This change implements an IO data cache which is backed by local storage. It implicitly relies on the OS page cache management to shuffle data between memory and the storage device. This is useful for caching data read from remote filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS). A data cache is divided into one or more partitions based on the configuration string which is a list of directories, separated by comma, followed by the storage capacity per directory. An example configuration string is like the following: --data_cache_config=/data/0,/data/1:150GB In the configuration above, the cache may use up to 300GB of storage space, with 150GB max for /data/0 and /data/1 respectively. Each partition has a meta-data cache which tracks the mappings of cache keys to the locations of the cached data. A cache key is a tuple of (file's name, file's modification time, file offset) and a cache entry is a tuple of (backing file, offset in the backing file, length of the cached data, optional checksum). Note that the cache currently doesn't support overlapping ranges. In other words, if the cache contains an entry of a file for range [m, m+4MB), a lookup for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen this as a problem but this may require further evaluation in the future. Each partition stores its set of cached data in backing files created on local storage. When inserting new data into the cache, the data is appended to the current backing file in use. The storage consumption of each cache entry counts towards the quota of that partition. When a partition reaches its capacity, the least recently used (LRU) data in that partition is evicted. Evicted data is removed from the underlying storage by punching holes in the backing file it's stored in. As a backing file reaches a certain size (by default 4TB), new data will stop being appended to it and a new file will be created instead. Note that due to hole punching, the backing file is actually sparse. When the number of backing files per partition exceeds, --data_cache_max_files_per_partition, files are deleted in the order in which they are created. Stale cache entries referencing deleted files are erased lazily or evicted due to inactivity. Optionally, checksumming can be enabled to verify read from the cache is consistent with what was inserted and to verify that multiple attempted insertions with the same cache key have the same cache content. Checksumming is enabled by default for debug builds. To probe for cached data in the cache, the interface Lookup() is used; To insert data into the cache, the interface Store() is used. Please note that eviction happens inline currently during Store(). This patch also added two startup flags for start-impala-cluster.py: '--data_cache_dir' specifies the base directory in which each Impalad creates the caching directory '--data_cache_size' specifies the capacity string for each cache directory. Testing done: - added a new BE and EE test - exhaustive (debug, release) builds with cache enabled - core ASAN build with cache enabled Perf: - 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement over runs without the cache. Each node has a cache size of 150GB per node. The performance is at parity with a configuration of a HDFS cluster using EBS as the storage. Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/runtime/io/CMakeLists.txt A be/src/runtime/io/data-cache-test.cc A be/src/runtime/io/data-cache.cc A be/src/runtime/io/data-cache.h M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/runtime/io/hdfs-file-reader.h M be/src/runtime/io/request-context.h M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/impalad-metrics.cc M be/src/util/impalad-metrics.h M bin/start-impala-cluster.py M common/thrift/metrics.json A testdata/workloads/functional-query/queries/QueryTest/data-cache.test M tests/common/custom_cluster_test_suite.py M tests/common/impala_test_suite.py A tests/custom_cluster/test_data_cache.py M tests/custom_cluster/test_krpc_metrics.py 23 files changed, 2,150 insertions(+), 55 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/9 -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I734803c1c1787c858d
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 8: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/3029/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 8 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 02 May 2019 01:26:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 8: (5 comments) Accidentally pushed a draft. Will push PS9. http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache-test.cc File be/src/runtime/io/data-cache-test.cc: http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache-test.cc@274 PS7, Line 274: > I think gflags has a built in 'google::FlagsSaver', no? It handles restorin Nice. I don't think we have an equivalent of that in Impala although it's a nice improvement. Filed IMPALA-8480. http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc@245 PS7, Line 245: explicit CacheFile(std::string path) : path_(move(path)) { } > nit: explicit Done http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc@301 PS7, Line 301: > here I think you could just call lock_.DCheckLocked() right? Done http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc@409 PS7, Line 409: uint8_t* buffer) { > why not have UnpackCacheEntry return the CacheEntry object? (the generated Done http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/util/impalad-metrics.h File be/src/util/impalad-metrics.h: http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/util/impalad-metrics.h@91 PS7, Line 91: not ins > In the rest of the code "skipped" often means "pruned" and is a good thing, Done -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 8 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 02 May 2019 00:50:47 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Michael Ho has uploaded a new patch set (#8). ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. IMPALA-8341: Data cache for remote reads This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/ This change implements an IO data cache which is backed by local storage. It implicitly relies on the OS page cache management to shuffle data between memory and the storage device. This is useful for caching data read from remote filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS). A data cache is divided into one or more partitions based on the configuration string which is a list of directories, separated by comma, followed by the storage capacity per directory. An example configuration string is like the following: --data_cache_config=/data/0,/data/1:150GB In the configuration above, the cache may use up to 300GB of storage space, with 150GB max for /data/0 and /data/1 respectively. Each partition has a meta-data cache which tracks the mappings of cache keys to the locations of the cached data. A cache key is a tuple of (file's name, file's modification time, file offset) and a cache entry is a tuple of (backing file, offset in the backing file, length of the cached data, optional checksum). Note that the cache currently doesn't support overlapping ranges. In other words, if the cache contains an entry of a file for range [m, m+4MB), a lookup for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen this as a problem but this may require further evaluation in the future. Each partition stores its set of cached data in backing files created on local storage. When inserting new data into the cache, the data is appended to the current backing file in use. The storage consumption of each cache entry counts towards the quota of that partition. When a partition reaches its capacity, the least recently used (LRU) data in that partition is evicted. Evicted data is removed from the underlying storage by punching holes in the backing file it's stored in. As a backing file reaches a certain size (by default 4TB), new data will stop being appended to it and a new file will be created instead. Note that due to hole punching, the backing file is actually sparse. When the number of backing files per partition exceeds, --data_cache_max_files_per_partition, files are deleted in the order in which they are created. Stale cache entries referencing deleted files are erased lazily or evicted due to inactivity. Optionally, checksumming can be enabled to verify read from the cache is consistent with what was inserted and to verify that multiple attempted insertions with the same cache key have the same cache content. Checksumming is enabled by default for debug builds. To probe for cached data in the cache, the interface Lookup() is used; To insert data into the cache, the interface Store() is used. Please note that eviction happens inline currently during Store(). This patch also added two startup flags for start-impala-cluster.py: '--data_cache_dir' specifies the base directory in which each Impalad creates the caching directory '--data_cache_size' specifies the capacity string for each cache directory. Testing done: - added a new BE and EE test - exhaustive (debug, release) builds with cache enabled - core ASAN build with cache enabled Perf: - 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement over runs without the cache. Each node has a cache size of 150GB per node. The performance is at parity with a configuration of a HDFS cluster using EBS as the storage. Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/runtime/io/CMakeLists.txt A be/src/runtime/io/data-cache-test.cc A be/src/runtime/io/data-cache.cc A be/src/runtime/io/data-cache.h M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/runtime/io/hdfs-file-reader.h M be/src/runtime/io/request-context.h M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/impalad-metrics.cc M be/src/util/impalad-metrics.h M bin/start-impala-cluster.py M common/thrift/metrics.json A testdata/workloads/functional-query/queries/QueryTest/data-cache.test M tests/common/custom_cluster_test_suite.py M tests/common/impala_test_suite.py A tests/custom_cluster/test_data_cache.py M tests/custom_cluster/test_krpc_metrics.py 23 files changed, 2,150 insertions(+), 55 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/8 -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I734803c1c1787c858d
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 7: (4 comments) http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache-test.cc File be/src/runtime/io/data-cache-test.cc: http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache-test.cc@274 PS7, Line 274: ScopedFlagSetter::Make(&FLAGS_data_cache_file_max_size_bytes, 1024 * 1024); I think gflags has a built in 'google::FlagsSaver', no? It handles restoring all modified flags upon scope exit. Soi you could just do: google::FlagSaver s; FLAGS_data_cache_file_max_size_bytes = 1024 * 1024; (in kudu we do the FlagSaver thing in our base test class, not sure if there is something equivalent in Impala) http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc@245 PS7, Line 245: CacheFile(std::string path) : path_(move(path)) { } nit: explicit http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc@301 PS7, Line 301: DCHECK(partition_lock.owns_lock()); here I think you could just call lock_.DCheckLocked() right? http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc@409 PS7, Line 409: UnpackCacheEntry(meta_cache_->Value(handle), &entry); why not have UnpackCacheEntry return the CacheEntry object? (the generated code would be equivalent due to return value optimization) Or make a constructor for CacheEntry which takes a Slice directly? -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 7 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 01 May 2019 22:47:18 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 7: Code-Review+1 (3 comments) http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h File be/src/runtime/io/data-cache.h: http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@337 PS5, Line 337: }; > The reason for using a pool is that we want to be able to queue the work fo That makes sense, thx. http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/util/impalad-metrics.h File be/src/util/impalad-metrics.h: http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/util/impalad-metrics.h@91 PS7, Line 91: skipped In the rest of the code "skipped" often means "pruned" and is a good thing, e.g. we figured out we didn't have to read something. Here its meaning is closer to "rejected", so would IO_MGR_REMOTE_DATA_CACHE_REJECTED_BYTES be a better choice? I don't feel strongly about it. http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py File tests/custom_cluster/test_data_cache.py: http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py@23 PS5, Line 23: : """ This test enables the data cache an > It checks both. It checks the metrics in the python code but also checks th Thx for clarifying. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 7 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 01 May 2019 21:04:33 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 7: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/3021/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 7 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 01 May 2019 02:34:27 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 7: (41 comments) http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache-test.cc File be/src/runtime/io/data-cache-test.cc: http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache-test.cc@273 PS5, Line 273: auto s = > I just found out we have ScopedFlagSetter in scoped-flag-setter.h, I think Done http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.h File be/src/runtime/io/data-cache.h: http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.h@112 PS6, Line 112: : /// 'config' is the configurati > per the commit message, we've moved to a single quota rather than per-direc Yes, comments were stale. Fixed in new patch. http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.h@261 PS6, Line 261: > perhaps init to -1? Done. Switched to initializing it to 0 in CreateCacheFile() instead. http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h File be/src/runtime/io/data-cache.h: http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@215 PS5, Line 215: > 'start_reclaim'? Done http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@337 PS5, Line 337: }; > Can you mention in the comment that the pool has only 1 thread and why you' The reason for using a pool is that we want to be able to queue the work for deferred processing by another thread and ThreadPool seems to provide the right abstraction. May be there are other classes in utility/ which are also applicable ? http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@341 PS5, Line 341: > Some functions around deleting files are called "Close...". We should point Done http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@306 PS4, Line 306: cache_files_.emplace_back(std::mo > This check is racy. More allocation could have happened since 'insertion_of Done http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@95 PS5, Line 95: ng files opened, ol > switch to single thread, or mention pool here Done http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@112 PS5, Line 112: DeleteFile > Can we call this DeleteFile? Otherwise there's a third thing to keep track Done http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@125 PS5, Line 125: d the lock in > It's not obvious to me why we only need a percpu_rwlock here. Can you add a Done. Please also see explanation in comments above. http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@208 PS5, Line 208: > nit: singular Done http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@335 PS5, Line 335: ing files left over fro > Similar to other comments, I'd call this "VerifySizeAndDeleteFiles", I thin Done http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@395 PS5, Line 395: } > Will this handle hole punching through the eviction logic? Yes. http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@436 PS5, Line 436: // Try verifying the checksum of the new buffer matches that of the existing entry. > nit: only append the "checksum $3" part if checksumming is enabled? I don't Keeping it for simplicity. http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@457 PS5, Line 457: == nullptr)) r > start_reclaim? Done http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@633 PS5, Line 633: URN_IF_ERROR(p > start_reclaim? Done http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@64 PS6, Line 64: DEFINE_int64(data_cache_file_max_size_bytes, 1L << 40 /* 1TB */, > - can you add a comment here like /* 4TB */? Done. Yes, according to https://access.redhat.com/solutions/1532, ext4 should support up to 16TB. Apparently, there is still need to support ext3 which has a limit of 2TB. So, I will set it to 1TB to be safe. http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@68 PS6, Line 68: "(Advanced) The maximum number of allowed opened files. This must be at least the " > Setting this per-partition creates a dependency between this and the number Makes sense. Done. http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@70 PS6, Line 70: DEFINE_int32(data_cache_wri
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Hello Thomas Marshall, Lars Volker, David Rorke, Sahil Takiar, Todd Lipcon, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12987 to look at the new patch set (#7). Change subject: IMPALA-8341: Data cache for remote reads .. IMPALA-8341: Data cache for remote reads This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/ This change implements an IO data cache which is backed by local storage. It implicitly relies on the OS page cache management to shuffle data between memory and the storage device. This is useful for caching data read from remote filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS). A data cache is divided into one or more partitions based on the configuration string which is a list of directories, separated by comma, followed by the storage capacity per directory. An example configuration string is like the following: --data_cache_config=/data/0,/data/1:150GB In the configuration above, the cache may use up to 300GB of storage space, with 150GB max for /data/0 and /data/1 respectively. Each partition has a meta-data cache which tracks the mappings of cache keys to the locations of the cached data. A cache key is a tuple of (file's name, file's modification time, file offset) and a cache entry is a tuple of (backing file, offset in the backing file, length of the cached data, optional checksum). Note that the cache currently doesn't support overlapping ranges. In other words, if the cache contains an entry of a file for range [m, m+4MB), a lookup for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen this as a problem but this may require further evaluation in the future. Each partition stores its set of cached data in backing files created on local storage. When inserting new data into the cache, the data is appended to the current backing file in use. The storage consumption of each cache entry counts towards the quota of that partition. When a partition reaches its capacity, the least recently used (LRU) data in that partition is evicted. Evicted data is removed from the underlying storage by punching holes in the backing file it's stored in. As a backing file reaches a certain size (by default 4TB), new data will stop being appended to it and a new file will be created instead. Note that due to hole punching, the backing file is actually sparse. When the number of backing files per partition exceeds, --data_cache_max_files_per_partition, files are deleted in the order in which they are created. Stale cache entries referencing deleted files are erased lazily or evicted due to inactivity. Optionally, checksumming can be enabled to verify read from the cache is consistent with what was inserted and to verify that multiple attempted insertions with the same cache key have the same cache content. Checksumming is enabled by default for debug builds. To probe for cached data in the cache, the interface Lookup() is used; To insert data into the cache, the interface Store() is used. Please note that eviction happens inline currently during Store(). This patch also added two startup flags for start-impala-cluster.py: '--data_cache_dir' specifies the base directory in which each Impalad creates the caching directory '--data_cache_size' specifies the capacity string for each cache directory. Testing done: - added a new BE and EE test - exhaustive (debug, release) builds with cache enabled - core ASAN build with cache enabled Perf: - 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement over runs without the cache. Each node has a cache size of 150GB per node. The performance is at parity with a configuration of a HDFS cluster using EBS as the storage. Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/runtime/io/CMakeLists.txt A be/src/runtime/io/data-cache-test.cc A be/src/runtime/io/data-cache.cc A be/src/runtime/io/data-cache.h M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/runtime/io/hdfs-file-reader.h M be/src/runtime/io/request-context.h M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/impalad-metrics.cc M be/src/util/impalad-metrics.h M bin/start-impala-cluster.py M common/thrift/metrics.json A testdata/workloads/functional-query/queries/QueryTest/data-cache.test M tests/common/custom_cluster_test_suite.py M tests/common/impala_test_suite.py A tests/custom_cluster/test_data_cache.py M tests/custom_cluster/test_krpc_metrics.py 23 files changed, 2,157 insertions(+), 55 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/7 -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscri
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 6: (24 comments) http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.h File be/src/runtime/io/data-cache.h: http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.h@112 PS6, Line 112: /// 'config' is the configuration string which specifies a list of : : /// tuples, delimited by comma. per the commit message, we've moved to a single quota rather than per-directory quotas, right? or is that a typo in the commit message? http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.h@261 PS6, Line 261: int oldest_opened_file_ = 0; perhaps init to -1? http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@64 PS6, Line 64: DEFINE_int64(data_cache_file_max_size, 4L << 40, - can you add a comment here like /* 4TB */? - have we tested that 4TB actually works in a long-running cluster? Now that you have the deletion support in, maybe 1TB is a safer default if we're not sure about full FS support? - can you rename to _max_size_bytes? http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@68 PS6, Line 68: "(Advanced) The maximum number of allowed opened files per partition."); Setting this per-partition creates a dependency between this and the number of partitions. I think it would better to have this be a total, and then auto-set the per-partition limit by dividing the capacity among the partitions. Otherwise it's likely people will have to set this to keep fd limit in check, right? Or do we generally assume that ulimit -n is boosted super high for impala? http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@70 PS6, Line 70: "(Advanced) Number of concurrent threads allowed to insert into the cache"); is this per-partition? should be, right? http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@117 PS6, Line 117: KUDU_RETURN_IF_ERROR(kudu::Env::Default()->NewRWFile(path, &cache_file->file_), why not pass the RWFile into the CacheFile constructor vs creating an empty one and callign NewRWFile here? http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@141 PS6, Line 141: kudu::Status status = kudu::Env::Default()->DeleteFile(path_); WARN_NOT_OK could be used here (i think WARN is more appropriate than ERROR since no data is lost, etc) http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@153 PS6, Line 153: inline 'inline' here and elsewhere isn't necessary since you've defined them inline inside the class anyway http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@174 PS6, Line 174: if (UNLIKELY(!file_)) return false; worth a DCHECK that offset + bytes_to_read <= current_offset_ http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@192 PS6, Line 192: kudu::Status status = file_->Write(offset, Slice(buffer, buffer_len)); same DCHECK suggested above http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@205 PS6, Line 205: if (UNLIKELY(!file_)) return; same http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@228 PS6, Line 228: used for synchronization instead of just saying used for synchronozation" I think best to say "taken in write mode during deletion, and shared mode everywhere else" http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@281 PS6, Line 281: Status DataCache::Partition::CreateCacheFile() { can you DCHECK that lock_ is held by the current thread here? (same elsewhere in functions that require the lock to be held on entry) http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@347 PS6, Line 347: KUDU_RETURN_IF_ERROR(env->GetFileSizeOnDisk(file->path(), &logical_sz), : "CloseAndVerifyFileSizes()"); is this the right method call? seems the same as above http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@352 PS6, Line 352: resize(0); .clear() http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@360 PS6, Line 360: void DataCache::Partition::Close() { dcheck the lock is held? http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@380 PS6, Line 380: const CacheEntry* entry = reinterpret_cast(value_slice.data()); I think this pattern is used pretty widely in Impala but it's moderately sketchy, since it assumes that value_slice's allocation has the same alignment requirements as CacheEntry. That's likely but not really guaranteed by anything, and we've had crashes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 5: (17 comments) http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache-test.cc File be/src/runtime/io/data-cache-test.cc: http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache-test.cc@273 PS5, Line 273: FLAGS_data_cache_file_max_size = 1024 * 1024; I just found out we have ScopedFlagSetter in scoped-flag-setter.h, I think it fits here and in the other tests. http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h File be/src/runtime/io/data-cache.h: http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@215 PS5, Line 215: too_many_files 'start_reclaim'? http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@337 PS5, Line 337: std::unique_ptr> file_deleter_pool_; Can you mention in the comment that the pool has only 1 thread and why you're using a pool? I think it's because the pool makes handling the thread's lifetime easier, but I'm not sure that's correct. http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@341 PS5, Line 341: void CloseOldFiles(uint32_t thread_id, int partition_idx); Some functions around deleting files are called "Close...". We should point out in the comments somewhere that closing now also deletes. We could also rename the thread pool to file_closing_pool or rename the methods to "DeleteOldFiles" for consistency. I think I prefer the latter, since deletion implies closing, but the contraposition is not obvious. http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@72 PS4, Line 72: "(Advanced) Enable checksumming for the cached buffer."); > This is actually a static class member of DataCache. Sry for missing that. http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@187 PS4, Line 187: inline > Not sure which one you are referring to ? Isn't it in #include "common/name Yeah, I think we commonly omit the explicit include for vector http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@95 PS5, Line 95: file deleter thread switch to single thread, or mention pool here http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@112 PS5, Line 112: RetireFile Can we call this DeleteFile? Otherwise there's a third thing to keep track of (Close, Delete, Retire) and the differences are subtle. I feel it's clear enough that DeleteFile would make sure it's closed. http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@125 PS5, Line 125: percpu_rwlock It's not obvious to me why we only need a percpu_rwlock here. Can you add a comment? http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@208 PS5, Line 208: holes nit: singular http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@335 PS5, Line 335: CloseAndVerifyFileSizes Similar to other comments, I'd call this "VerifySizeAndDeleteFiles", I think that captures well what's going on and the caller can expect the files to get closed. I don't feel strongly about that one though. http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@395 PS5, Line 395: meta_cache_->Erase(key); Will this handle hole punching through the eviction logic? http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@436 PS5, Line 436: VLOG(2) << Substitute("Storing file $0 offset $1 len $2 checksum $3 ", nit: only append the "checksum $3" part if checksumming is enabled? I don't feel strongly about it though. http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@457 PS5, Line 457: too_many_files start_reclaim? http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@633 PS5, Line 633: too_many_files start_reclaim? http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/hdfs-file-reader.cc File be/src/runtime/io/hdfs-file-reader.cc: http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/hdfs-file-reader.cc@37 PS5, Line 37: nit: trailing space http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py File tests/custom_cluster/test_data_cache.py: http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py@23 PS5, Line 23: cache hit and miss counts : in the runtime profile are as expected. It actually seems to check the metrics, not the profile counters. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerr
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 6: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/2975/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 6 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Mon, 29 Apr 2019 19:28:23 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Hello Thomas Marshall, Lars Volker, David Rorke, Sahil Takiar, Todd Lipcon, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12987 to look at the new patch set (#6). Change subject: IMPALA-8341: Data cache for remote reads .. IMPALA-8341: Data cache for remote reads This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/ This change implements an IO data cache which is backed by local storage. It implicitly relies on the OS page cache management to shuffle data between memory and the storage device. This is useful for caching data read from remote filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS). A data cache is divided into one or more partitions based on the configuration string which is a list of directories, separated by comma, followed by the storage capacity per directory. An example configuration string is like the following: --data_cache_config=/data/0,/data/1:150GB In the configuration above, the cache may use up to 300GB of storage space, with 150GB max for /data/0 and /data/1 respectively. Each partition has a meta-data cache which tracks the mappings of cache keys to the locations of the cached data. A cache key is a tuple of (file's name, file's modification time, file offset) and a cache entry is a tuple of (backing file, offset in the backing file, length of the cached data, optional checksum). Note that the cache currently doesn't support overlapping ranges. In other words, if the cache contains an entry of a file for range [m, m+4MB), a lookup for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen this as a problem but this may require further evaluation in the future. Each partition stores its set of cached data in backing files created on local storage. When inserting new data into the cache, the data is appended to the current backing file in use. The storage consumption of each cache entry counts towards the quota of that partition. When a partition reaches its capacity, the least recently used (LRU) data in that partition is evicted. Evicted data is removed from the underlying storage by punching holes in the backing file it's stored in. As a backing file reaches a certain size (by default 4TB), new data will stop being appended to it and a new file will be created instead. Note that due to hole punching, the backing file is actually sparse. When the number of backing files per partition exceeds, --data_cache_max_files_per_partition, files are deleted in the order in which they are created. Stale cache entries referencing deleted files are erased lazily or evicted due to inactivity. Optionally, checksumming can be enabled to verify read from the cache is consistent with what was inserted and to verify that multiple attempted insertions with the same cache key have the same cache content. Checksumming is enabled by default for debug builds. To probe for cached data in the cache, the interface Lookup() is used; To insert data into the cache, the interface Store() is used. Please note that eviction happens inline currently during Store(). This patch also added two startup flags for start-impala-cluster.py: '--data_cache_dir' specifies the base directory in which each Impalad creates the caching directory '--data_cache_size' specifies the capacity string for each cache directory. Testing done: - added a new BE and EE test - exhaustive (debug, release) builds with cache enabled - core ASAN build with cache enabled Perf: - 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement over runs without the cache. Each node has a cache size of 150GB per node. The performance is at parity with a configuration of a HDFS cluster using EBS as the storage. Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/runtime/io/CMakeLists.txt A be/src/runtime/io/data-cache-test.cc A be/src/runtime/io/data-cache.cc A be/src/runtime/io/data-cache.h M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/runtime/io/hdfs-file-reader.h M be/src/runtime/io/request-context.h M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/impalad-metrics.cc M be/src/util/impalad-metrics.h M bin/start-impala-cluster.py M common/thrift/metrics.json A testdata/workloads/functional-query/queries/QueryTest/data-cache.test M tests/common/custom_cluster_test_suite.py M tests/common/impala_test_suite.py A tests/custom_cluster/test_data_cache.py M tests/custom_cluster/test_krpc_metrics.py 23 files changed, 2,062 insertions(+), 55 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/6 -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscri
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 5: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/2948/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 5 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Sat, 27 Apr 2019 08:06:04 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG@61 PS4, Line 61: Testing done: a new BE test was added; core test with cache enabled. > I have a bit of trouble in getting this to work in the mini-cluster. May be Ended up adding a startup flag to force use of the cache even for local reads. A new custom cluster test was also added for sanity check. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 4 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Sat, 27 Apr 2019 07:27:42 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py File tests/custom_cluster/test_data_cache.py: http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py@22 PS5, Line 22: class TestDataCache(CustomClusterTestSuite): flake8: E302 expected 2 blank lines, found 1 -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 5 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Sat, 27 Apr 2019 07:23:38 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Michael Ho has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. IMPALA-8341: Data cache for remote reads This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/ This change implements an IO data cache which is backed by local storage. It implicitly relies on the OS page cache management to shuffle data between memory and the storage device. This is useful for caching data read from remote filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS). A data cache is divided into one or more partitions based on the configuration string which is a list of directories, separated by comma, followed by the storage capacity per directory. An example configuration string is like the following: --data_cache_config=/data/0,/data/1:150GB In the configuration above, the cache may use up to 300GB of storage space, with 150GB max for /data/0 and /data/1 respectively. Each partition has a meta-data cache which tracks the mappings of cache keys to the locations of the cached data. A cache key is a tuple of (file's name, file's modification time, file offset) and a cache entry is a tuple of (backing file, offset in the backing file, length of the cached data, optional checksum). Note that the cache currently doesn't support overlapping ranges. In other words, if the cache contains an entry of a file for range [m, m+4MB), a lookup for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen this as a problem but this may require further evaluation in the future. Each partition stores its set of cached data in backing files created on local storage. When inserting new data into the cache, the data is appended to the current backing file in use. The storage consumption of each cache entry counts towards the quota of that partition. When a partition reaches its capacity, the least recently used (LRU) data in that partition is evicted. Evicted data is removed from the underlying storage by punching holes in the backing file it's stored in. As a backing file reaches a certain size (by default 4TB), new data will stop being appended to it and a new file will be created instead. Note that due to hole punching, the backing file is actually sparse. When the number of backing files per partition exceeds, --data_cache_max_files_per_partition, files are deleted in the order in which they are created. Stale cache entries referencing deleted files are erased lazily or evicted due to inactivity. Optionally, checksumming can be enabled to verify read from the cache is consistent with what was inserted and to verify that multiple attempted insertions with the same cache key have the same cache content. Checksumming is enabled by default for debug builds. To probe for cached data in the cache, the interface Lookup() is used; To insert data into the cache, the interface Store() is used. Please note that eviction happens inline currently during Store(). This patch also added two startup flags for start-impala-cluster.py: '--data_cache_dir' specifies the base directory in which each Impalad creates the caching directory '--data_cache_size' specifies the capacity string for each cache directory. Testing done: - added a new BE and EE test - exhaustive (debug, release) builds with cache enabled - core ASAN build with cache enabled Perf: - 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement over runs without the cache. Each node has a cache size of 150GB per node. The performance is at parity with a configuration of a HDFS cluster using EBS as the storage. Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/runtime/io/CMakeLists.txt A be/src/runtime/io/data-cache-test.cc A be/src/runtime/io/data-cache.cc A be/src/runtime/io/data-cache.h M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/runtime/io/hdfs-file-reader.h M be/src/runtime/io/request-context.h M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/impalad-metrics.cc M be/src/util/impalad-metrics.h M bin/start-impala-cluster.py M common/thrift/metrics.json A testdata/workloads/functional-query/queries/QueryTest/data-cache.test M tests/common/custom_cluster_test_suite.py M tests/common/impala_test_suite.py A tests/custom_cluster/test_data_cache.py M tests/custom_cluster/test_krpc_metrics.py 23 files changed, 2,059 insertions(+), 55 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/5 -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I734803c1c1787c858d
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 5: (32 comments) http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG@61 PS4, Line 61: '--data_cache_dir' specifies the base directory in which each Impalad > Can we add a custom cluster test to sanity check that it works end to end. I have a bit of trouble in getting this to work in the mini-cluster. May be it's easier with the dockerised test. http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG@62 PS4, Line 62: creates the caching directory > Do we want to consider enabling this by default for the dockerised tests as Definitely. Will also do so for S3 builds. http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/exec/hdfs-scan-node-base.cc@365 PS4, Line 365: "DataCacheHitCount", TUnit::UNIT); > Do we have tests for these to make sure they show up in the profiles and wo Done http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc File be/src/runtime/io/data-cache-test.cc: http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@138 PS4, Line 138: > Move to the other test constants (TEMP_BUFFER_SIZE etc)? Done http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@226 PS4, Line 226: > nit: This could now be 4 * FLAGS_data_cache_file_max_size ? Done http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@349 PS4, Line 349: num_entries = 0 > Can they be to separate tests? Done http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@410 PS4, Line 410: mp_buf > nit: typo Done http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h File be/src/runtime/io/data-cache.h: http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@83 PS4, Line 83: eviction from it happen s > That's controlled by --data_cache_write_concurrency, right? Mention here? Done http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@87 PS4, Line 87: /// of 4KB so any data inserted will be rounded up to the nearest multiple of 4KB. > Do we plan to look into picking partitions on faster disks with higher prob Ideally, we want to keep the hotter data in the faster media while keeping the lukewarm data in the slower media. Added a TODO. http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@88 PS4, Line 88: /// > Yeah this scenario is a bit concerning for me still since it's conceivable I added a "simple" implementation with rw-lock and lazy cache entry eviction. If it's deemed too complicated, please let me know and I can undo it. Also added a test case for it. http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@145 PS4, Line 145: , > nit: formatting Done http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@197 PS4, Line 197: /// - removes any stale backing file in this partition > Should we also delete the files when we close them? There's a distinction i This was needed for data-cache-test.cc as we need to close the files before verifying their sizes. However, it seems that we can hide all those internal details in VerifyFileSizes(), which is renamed to CloseAndVerifyFileSizes(); http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@203 PS4, Line 203: > Should we pass const CacheKey& here and convert it in the implementation? Done http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@224 PS4, Line 224: void EvictedEnt > nit: VerifyFileSizes Done http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@72 PS4, Line 72: "(Advanced) Enable checksumming for the cached buffer."); > static const char*? This is actually a static class member of DataCache. http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@75 PS4, Line 75: namespace io { > Should this be a class, given it has a c'tor and d'tor? Done http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@173 PS4, Line 173: ock(lock_.get_lock()); : if (UNLIKELY(!file_)) return fals > I think you can merge these two lines, which also reduces the risk that som Done http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@187 PS4, Line 187: inline > nit: missing include, but we might generally omit this one. Not sure which one you are referring to ? Isn't it in #include "common/names.h" ? http://gerrit.cloudera.org:8080/#/c/1298
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h File be/src/runtime/io/data-cache.h: http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@88 PS4, Line 88: /// - bound number of backing files per partition by consolidating the content of very > I think with our current assumption of 4TB of logical space per file, if yo Oh I didn't realise the files were that large, ok, that makes sense then. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 4 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 25 Apr 2019 18:10:57 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h File be/src/runtime/io/data-cache.h: http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@88 PS4, Line 88: /// - bound number of backing files per partition by consolidating the content of very > Yeah this scenario is a bit concerning for me still since it's conceivable I think with our current assumption of 4TB of logical space per file, if you assume you're writing a relatively aggressive estimate of 100MB/sec to the cache in a super heavy workload, then you'll only need to roll to a new file every 11 hours. So, 1000 file descriptors will last you over a year. Given that fd counts can be set into the 100k range without any real issues, I don't think this is going to be too problematic under these assumptions. If we find that we need to support a smaller "rolling" interval than 4TB for some reason, we'll definitely need to address it, but it seems like a lot of complexity to take for now. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 4 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 25 Apr 2019 18:05:56 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 4: (3 comments) I don't know that I'll have time to do a fully detailed review, and it looks like we may already have enough eyes on it. The design and APIs look great - it does seem like we want to get it checked in and exercised more soon. http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG@61 PS4, Line 61: Testing done: a new BE test was added; core test with cache enabled. Can we add a custom cluster test to sanity check that it works end to end. Myabe this is where we should sanity check metrics too. http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG@62 PS4, Line 62: Do we want to consider enabling this by default for the dockerised tests as a next step? http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h File be/src/runtime/io/data-cache.h: http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@88 PS4, Line 88: /// - bound number of backing files per partition by consolidating the content of very > Should we just delete old files if we have reached a configurable high numb Yeah this scenario is a bit concerning for me still since it's conceivable that the workloads we test with might have different properties to real ones. E.g. I can imagine a workload might end up with some set of files that are queries frequently enough to stay resident in the cache indefinitely and keep very old files alive. Not sure if Lars' idea is easier than compaction but it is appealing in some ways. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 4 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 23 Apr 2019 23:38:20 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 4: (31 comments) Mostly looks good to me. Please feel free to defer some of the comments to a subsequent change to make faster progress with this one. http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/exec/hdfs-scan-node-base.cc@365 PS4, Line 365: "DataCacheHitCount", TUnit::UNIT); Do we have tests for these to make sure they show up in the profiles and work as expected? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc File be/src/runtime/io/data-cache-test.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@330 PS1, Line 330: for (int i = 0; i < NUM_THREADS; ++i) { > Prefer to keep this test as stand-alone to make development easier. wfm http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc File be/src/runtime/io/data-cache-test.cc: http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@138 PS4, Line 138: const char* fname_ = "foobar"; Move to the other test constants (TEMP_BUFFER_SIZE etc)? http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@226 PS4, Line 226: const int64_t cache_size = 4 * 1024 * 1024; nit: This could now be 4 * FLAGS_data_cache_file_max_size ? http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@349 PS4, Line 349: This second part Can they be to separate tests? http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@410 PS4, Line 410: beyone nit: typo http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h File be/src/runtime/io/data-cache.h: http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@83 PS4, Line 83: concurrency of one thread That's controlled by --data_cache_write_concurrency, right? Mention here? http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@87 PS4, Line 87: /// Future work: Do we plan to look into picking partitions on faster disks with higher probability? http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@88 PS4, Line 88: /// - bound number of backing files per partition by consolidating the content of very Should we just delete old files if we have reached a configurable high number, 1000 or so, and take the resulting cache hits instead of compacting? http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@145 PS4, Line 145: nit: formatting http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@197 PS4, Line 197: /// to be called in a single threaded environment after all IO threads exited. Should we also delete the files when we close them? There's a distinction in the implementation and it's not clear to me why it's there. http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@203 PS4, Line 203: const kudu::Slice& key Should we pass const CacheKey& here and convert it in the implementation? http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@224 PS4, Line 224: VerifyFilesSizes nit: VerifyFileSizes http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@72 PS4, Line 72: const char* DataCache::Partition::CACHE_FILE_PREFIX = "impala-cache-file-"; static const char*? http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@75 PS4, Line 75: struct DataCache::CacheFile { Should this be a class, given it has a c'tor and d'tor? http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@173 PS4, Line 173: make_unique(path, fd); : cache_files_.emplace_back(std::move I think you can merge these two lines, which also reduces the risk that someone accidentally uses the now empty cache_file in a future change. http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@187 PS4, Line 187: vector nit: missing include, but we might generally omit this one. http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@186 PS4, Line 186: // Delete all existing backing files left over from previous runs. : vector entries; : RETURN_IF_ERROR(FileSystemUtil::Directory::GetEntryNames(path_, &entries, 0, : FileSystemUtil::Directory::EntryType::DIR_ENTRY_REG)); : for (const string& entry : entries) { : if (entry.find_first_of(CACHE_FILE_PREFIX) == 0) { : const string file_path = JoinPathSegments(path_
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 4: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/2842/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 4 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 19 Apr 2019 06:10:32 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Hello Thomas Marshall, Lars Volker, David Rorke, Sahil Takiar, Todd Lipcon, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12987 to look at the new patch set (#4). Change subject: IMPALA-8341: Data cache for remote reads .. IMPALA-8341: Data cache for remote reads This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/ This change implements an IO data cache which is backed by local storage. It implicitly relies on the OS page cache management to shuffle data between memory and the storage device. This is useful for caching data read from remote filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS). A data cache is divided into one or more partitions based on the configuration string which is a list of directories, separated by comma, followed by the storage capacity per directory. An example configuration string is like the following: --data_cache_config=/data/0,/data/1:150GB In the configuration above, the cache may use up to 300GB of storage space, with 150GB max for /data/0 and /data/1 respectively. Each partition has a meta-data cache which tracks the mappings of cache keys to the locations of the cached data. A cache key is a tuple of (file's name, file's modification time, file offset) and a cache entry is a tuple of (backing file, offset in the backing file, length of the cached data, optional checksum). Note that the cache currently doesn't support overlapping ranges. In other words, if the cache contains an entry of a file for range [m, m+4MB), a lookup for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen this as a problem but this may require further evaluation in the future. Each partition stores its set of cached data in backing files created on local storage. When inserting new data into the cache, the data is appended to the current backing file in use. The storage consumption of each cache entry counts towards the quota of that partition. When a partition reaches its capacity, the least recently used (LRU) data in that partition is evicted. Evicted data is removed from the underlying storage by punching holes in the backing file it's stored in. As a backing file reaches a certain size (by default 4TB), new data will stop being appended to it and a new file will be created instead. Note that due to hole punching, the backing file is actually sparse. Optionally, checksumming can be enabled to verify read from the cache is consistent with what was inserted and to verify that multiple attempted insertions with the same cache key have the same cache content. Checksumming is enabled by default for debug builds. To probe for cached data in the cache, the interface Lookup() is used; To insert data into the cache, the interface Store() is used. Please note that eviction happens inline currently during Store(). This patch also added two startup flags for start-impala-cluster.py: '--data_cache_dir' specifies the base directory in which each Impalad creates the caching directory '--data_cache_size' specifies the capacity string for each cache directory. Testing done: a new BE test was added; core test with cache enabled. Perf: - 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement over runs without the cache. Each node has a cache size of 150GB per node. The performance is at parity with a configuration of a HDFS cluster using EBS as the storage. Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/runtime/io/CMakeLists.txt A be/src/runtime/io/data-cache-test.cc A be/src/runtime/io/data-cache.cc A be/src/runtime/io/data-cache.h M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/runtime/io/hdfs-file-reader.h M be/src/runtime/io/request-context.h M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/impalad-metrics.cc M be/src/util/impalad-metrics.h M bin/start-impala-cluster.py M common/thrift/metrics.json 18 files changed, 1,706 insertions(+), 29 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/4 -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 4 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipco
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 3: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/2841/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 3 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 19 Apr 2019 03:04:02 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Hello Thomas Marshall, Lars Volker, David Rorke, Sahil Takiar, Todd Lipcon, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12987 to look at the new patch set (#3). Change subject: IMPALA-8341: Data cache for remote reads .. IMPALA-8341: Data cache for remote reads This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/ This change implements an IO data cache which is backed by local storage. It implicitly relies on the OS page cache management to shuffle data between memory and the storage device. This is useful for caching data read from remote filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS). A data cache is divided into one or more partitions based on the configuration string which is a list of directories, separated by comma, followed by the storage capacity per directory. An example configuration string is like the following: --data_cache_config=/data/0,/data/1:150GB In the configuration above, the cache may use up to 300GB of storage space, with 150GB max for /data/0 and /data/1 respectively. Each partition has a meta-data cache which tracks the mappings of cache keys to the locations of the cached data. A cache key is a tuple of (file's name, file's modification time, file offset) and a cache entry is a tuple of (backing file, offset in the backing file, length of the cached data, optional checksum). Note that the cache currently doesn't support overlapping ranges. In other words, if the cache contains an entry of a file for range [m, m+4MB), a lookup for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen this as a problem but this may require further evaluation in the future. Each partition stores its set of cached data in backing files created on local storage. When inserting new data into the cache, the data is appended to the current backing file in use. The storage consumption of each cache entry counts towards the quota of that partition. When a partition reaches its capacity, the least recently used (LRU) data in that partition is evicted. Evicted data is removed from the underlying storage by punching holes in the backing file it's stored in. As a backing file reaches a certain size (by default 4TB), new data will stop being appended to it and a new file will be created instead. Note that due to hole punching, the backing file is actually sparse. Optionally, checksumming can be enabled to verify read from the cache is consistent with what was inserted and to verify that multiple attempted insertions with the same cache key have the same cache content. Checksumming is enabled by default for debug builds. To probe for cached data in the cache, the interface Lookup() is used; To insert data into the cache, the interface Store() is used. Please note that eviction happens inline currently during Store(). This patch also added two startup flags for start-impala-cluster.py: '--data_cache_dir' specifies the base directory in which each Impalad creates the caching directory '--data_cache_size' specifies the capacity string for each cache directory. Testing done: a new BE test was added; core test with cache enabled. Perf: - 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement over runs without the cache. Each node has a cache size of 150GB per node. The performance is at parity with a configuration of a HDFS cluster using EBS as the storage. Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/runtime/io/CMakeLists.txt A be/src/runtime/io/data-cache-test.cc A be/src/runtime/io/data-cache.cc A be/src/runtime/io/data-cache.h M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/runtime/io/hdfs-file-reader.h M be/src/runtime/io/request-context.h M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/impalad-metrics.cc M be/src/util/impalad-metrics.h M bin/start-impala-cluster.py M common/thrift/metrics.json 18 files changed, 1,706 insertions(+), 29 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/3 -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 3 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipco
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 3: (8 comments) http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h File be/src/runtime/io/data-cache.h: http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@88 PS2, Line 88: consolidatin > consolidating Done http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@89 PS2, Line 89: sparse > sparse Done http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@166 PS2, Line 166: : /// Utility function to verify that all partitions' consumption don't exceed their : /// quotas. Retur > clang-tidy failure: struct instead of class. Done http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@184 PS2, Line 184: > empty Done http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc@296 PS2, Line 296: meta_cache_->Erase(key); : return true; : } : } > TODO: Add a metric for this. Done http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc@343 PS2, Line 343: > insertion_offset + bytes_written Done http://gerrit.cloudera.org:8080/#/c/12987/2/bin/start-impala-cluster.py File bin/start-impala-cluster.py: http://gerrit.cloudera.org:8080/#/c/12987/2/bin/start-impala-cluster.py@117 PS2, Line 117: > flake8: E703 statement ends with a semicolon Done http://gerrit.cloudera.org:8080/#/c/12987/2/bin/start-impala-cluster.py@120 PS2, Line 120: > flake8: E703 statement ends with a semicolon Done -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 3 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 19 Apr 2019 02:20:58 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc@343 PS2, Line 343: insertion_offset)); insertion_offset + bytes_written -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 2 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 17 Apr 2019 18:35:31 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 2: (4 comments) http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h File be/src/runtime/io/data-cache.h: http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@88 PS2, Line 88: cosolidating consolidating http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@89 PS2, Line 89: sprase sparse http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@184 PS2, Line 184: emtpy empty http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc File be/src/runtime/io/hdfs-file-reader.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc@193 PS1, Line 193: return Status(TErrorCode::DISK_IO_ERROR, GetBackendString(), > Addressed the concern with larger entry in the new patch set. To answer the Two cases I can think of: * If a scan was not able to get its "ideal" reservation and is scanning with a smaller buffer size. This is hopefully rare. * With the new parquet page index support if different subset of pages are selected. I'm ok with deferring this but I think it is worth understanding if there's a bad interaction between this and page indexes. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 2 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 17 Apr 2019 06:58:00 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@147 PS1, Line 147: if (charge_len > capacity_) return false; > Given that we won't actually insert into the cache until after trying to wr This actually seems to be a problematic behavior. We may temporarily exceed the capacity limit as a result of this. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 1 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 17 Apr 2019 05:57:05 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc@296 PS2, Line 296: if (pending_insert_set_.size() >= FLAGS_data_cache_write_concurrency || : pending_insert_set_.find(key.ToString()) != pending_insert_set_.end()) { : return false; : } TODO: Add a metric for this. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 2 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 16 Apr 2019 19:47:18 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 2: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/2802/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 2 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 16 Apr 2019 20:26:56 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 2: TODO: - add a test to rotate the files in data-cache-test.cc - add a test for the changes in filesystem-util.cc -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 2 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 16 Apr 2019 19:46:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Hello Thomas Marshall, Lars Volker, David Rorke, Sahil Takiar, Todd Lipcon, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/12987 to look at the new patch set (#2). Change subject: IMPALA-8341: Data cache for remote reads .. IMPALA-8341: Data cache for remote reads This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/ This change implements an IO data cache which is backed by local storage. It implicitly relies on the OS page cache management to shuffle data between memory and the storage device. This is useful for caching data read from remote filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS). A data cache is divided into one or more partitions based on the configuration string which is a list of directories, separated by comma, followed by the storage capacity per directory. An example configuration string is like the following: --data_cache_config=/data/0,/data/1:150GB In the configuration above, the cache may use up to 300GB of storage space, with 150GB max for /data/0 and /data/1 respectively. Each partition has a meta-data cache which tracks the mappings of cache keys to the locations of the cached data. A cache key is a tuple of (file's name, file's modification time, file offset) and a cache entry is a tuple of (backing file, offset in the backing file, length of the cached data, optional checksum). Note that the cache currently doesn't support overlapping ranges. In other words, if the cache contains an entry of a file for range [m, m+4MB), a lookup for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen this as a problem but this may require further evaluation in the future. Each partition stores its set of cached data in backing files created on local storage. When inserting new data into the cache, the data is appended to the current backing file in use. The storage consumption of each cache entry counts towards the quota of that partition. When a partition reaches its capacity, the least recently used (LRU) data in that partition is evicted. Evicted data is removed from the underlying storage by punching holes in the backing file it's stored in. As a backing file reaches a certain size (by default 4TB), new data will stop being appended to it and a new file will be created instead. Note that due to hole punching, the backing file is actually sparse. Optionally, checksumming can be enabled to verify read from the cache is consistent with what was inserted and to verify that multiple attempted insertions with the same cache key have the same cache content. Checksumming is enabled by default for debug builds. To probe for cached data in the cache, the interface Lookup() is used; To insert data into the cache, the interface Store() is used. Please note that eviction happens inline currently during Store(). This patch also added two startup flags for start-impala-cluster.py: '--data_cache_dir' specifies the base directory in which each Impalad creates the caching directory '--data_cache_size' specifies the capacity string for each cache directory. Testing done: a new BE test was added; core test with cache enabled. Perf: - 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement over runs without the cache. Each node has a cache size of 150GB per node. The performance is at parity with a configuration of a HDFS cluster using EBS as the storage. Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/runtime/io/CMakeLists.txt A be/src/runtime/io/data-cache-test.cc A be/src/runtime/io/data-cache.cc A be/src/runtime/io/data-cache.h M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/runtime/io/hdfs-file-reader.h M be/src/runtime/io/request-context.h M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/impalad-metrics.cc M be/src/util/impalad-metrics.h M bin/start-impala-cluster.py M common/thrift/metrics.json 17 files changed, 1,467 insertions(+), 29 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/2 -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 2 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 1: (71 comments) This new patch changes the configuration string to use a uniform capacity quota for all directories. http://gerrit.cloudera.org:8080/#/c/12987/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12987/1//COMMIT_MSG@24 PS1, Line 24: is a tuple of (file's name, file's modification time, file offset) > So this means that you can get a partial cache hit if the offsets match but Done http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/exec/hdfs-scan-node-base.cc@948 PS1, Line 948: data_cache_miss_bytes_->Set(reader_context_->data_cache_miss_bytes()); > I think for the above counters we figured that this pattern of copying the Done http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc File be/src/runtime/io/data-cache-test.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@52 PS1, Line 52: // The callback is invoked by each thread in the multi-threaded tests below. > callback reads like it's called when something is done, how about ThreadFn? Done http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@157 PS1, Line 157: // Read the same same range inserted previously and they should still all in the cache. > nit: be Done http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@196 PS1, Line 196: // Create a buffer way larger than the cache. > Why does it need to be larger? Typo. Fixed. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@201 PS1, Line 201: ASSERT_TRUE(cache.Store("foobar", 12345, 0, large_buffer.get(), cache_size)); > Use variables instead of inline constants? Done http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@220 PS1, Line 220: const int64_t cache_size = 1 << 22; > I find these easier to read in the form of 4 * 1024 * 1024 Done http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@248 PS1, Line 248: differen > typo Done http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@301 PS1, Line 301: ootprint) { > Can you add a comment what this test does? Done http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@330 PS1, Line 330: int main(int argc, char **argv) { > This is not needed anymore if you add your test to the unified backend test Prefer to keep this test as stand-alone to make development easier. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h File be/src/runtime/io/data-cache.h: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@18 PS1, Line 18: #ifndef IMPALA_RUNTIME_IO_DATA_CACHE_H > Could use #pragma once instead of include guards Done http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@53 PS1, Line 53: /// punching holes in the backing file it's stored in. As a backing file reaches a certain > It's actually a TODO item to retire older files and copy what's left in the Added a check for filesystem support for hole punching in the new patch. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@55 PS1, Line 55: /// created instead. Note that due to hole punching, the backing file is actually sparse. > This might be a case where a simple ASCII diagram would illustrate the conc Done http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@56 PS1, Line 56: /// > It's by design that we don't support overlapping range for the first implem Added some explanation in the header comment. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@63 PS1, Line 63: /// happens inline currently during Store(). > It's worth documenting the PAGE_SIZE behaviour since it implies that there' Done http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@76 PS1, Line 76: class DataCache{ > style nit: missing space before { Done http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@81 PS1, Line 81: DataCache(const std::string& config) : config_(config) { } > totally doesn't matter here, but general best practice in C++11 is to have Thanks for the reminder. Almost every time, I have to look up in the internet for the advantage of pass-by-value-then-move over pass-by-reference. It's subtle but it makes sense. We should probably start sticking to this idiom more widely in Impala code base. May be a clang-tidy rule will help ?! Also totally irrelevant but that's an area where I find passing by pointer in C is sometimes easier to use or understand than C++.
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 2: (2 comments) http://gerrit.cloudera.org:8080/#/c/12987/2/bin/start-impala-cluster.py File bin/start-impala-cluster.py: http://gerrit.cloudera.org:8080/#/c/12987/2/bin/start-impala-cluster.py@117 PS2, Line 117: ; flake8: E703 statement ends with a semicolon http://gerrit.cloudera.org:8080/#/c/12987/2/bin/start-impala-cluster.py@120 PS2, Line 120: ; flake8: E703 statement ends with a semicolon -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 2 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 16 Apr 2019 19:41:39 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 1: > (1 comment) > > > I think the _IO_ apportioned relative to capacity may be fine as > that's a reasonable expectation if a user specifies more capacity > on a slower device. One thing to consider may be to use multiple > backing files for larger partition to avoid per-file lock problem. > > I think, unfortunately, the lower capacity device is most likely to > be the faster one (eg a small SSD plus big HDDs). > > Maybe we can simplify this for now by just requiring that all > partitions of the cache be allocated the same amount of space? The > most immediate scenario I can see is on Amazon instances like > r5d.4xlarge (two local SSDs with the same size) where you'd want to > give the same amount of capacity to each cache drive anyway. > > Put another way, I think we should constrain the configuration > space in such a way that only good configs are possible, rather > than hope that users don't do something like: > /ssd/:100G,/hdd1:1TB,/hdd2:1TB,/hdd3:1TB > and find that their SSD's fast IO performance is basically ignored. Is it important that the caches fill up at the same ratio (e.g. all disks at 20%)? Or can we pick each disk with the probability of its throughput to optimize for combined bandwidth? Then the SSD would fill up first with 6x more data on the SSD as on each of the HDD (for 6x throughput ratio). Then we could either adjust that by a constant factor through experiments, or add an expert level option to allow admins to specify the throughput ratio between the disks. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 1 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 12 Apr 2019 16:10:47 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 1: (1 comment) > I think the _IO_ apportioned relative to capacity may be fine as that's a > reasonable expectation if a user specifies more capacity on a slower device. > One thing to consider may be to use multiple backing files for larger > partition to avoid per-file lock problem. I think, unfortunately, the lower capacity device is most likely to be the faster one (eg a small SSD plus big HDDs). Maybe we can simplify this for now by just requiring that all partitions of the cache be allocated the same amount of space? The most immediate scenario I can see is on Amazon instances like r5d.4xlarge (two local SSDs with the same size) where you'd want to give the same amount of capacity to each cache drive anyway. Put another way, I think we should constrain the configuration space in such a way that only good configs are possible, rather than hope that users don't do something like: /ssd/:100G,/hdd1:1TB,/hdd2:1TB,/hdd3:1TB and find that their SSD's fast IO performance is basically ignored. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@79 PS1, Line 79: // Unlink the file so the disk space is recycled once the process exits. > Thanks for the feedback. I also considered the approach of cleaning up the In terms of preventing the "wiping" of a directory if a previous Impala's already running, we can always use advisory locks. kudu::Env::Default()->LockFile() can do this for you pretty easily. One question I don't know the answer to: there might be an advantage to operating on a deleted inode in performance. It may be that XFS or ext4 has optimizations that kick in when a file is unlinked. For example, normally, every write to a file will update the file's mtime, which requires writing to the filesystem journal, etc. (i've often seen stacks blocked in file_update_time() waiting on a jbd2 lock in the kernel under heavy IO). I'm not sure if this is optimized or not but you could certainly imagine that, for an unlinked file, journaling would be skipped, etc, since the goal is that on a crash we don't need to restore that file. May also be that this optimization isn't really implemented in practice :) I looked through the kernel source for a bit to see if I could find such an optimization but wasn't one obviously present. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 1 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 12 Apr 2019 16:00:21 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
David Rorke has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 1: > (3 comments) > > > Another option would be to do a global hashtable and have the > > entries directly identify their partitions (in fact I think they > > already do via the CacheFile pointer, right?). Eviction still > needs > > to be per-partition based on that partition's capacity, but > > insertion can try to more smartly allocate across drives. > > > > Yes, I had a similar question when thinking over the patch over the > weekend. I tried the approach of weighted sampling + a global LRU > cache but stopped short of coming up with a way to maintain the > quota per partition with a global LRU list so I abandoned it and > kind of punted on this problem for now. May be it's okay to do LRU > per partition. Should have documented my thought there in the code. > > I think the _IO_ apportioned relative to capacity may be fine as > that's a reasonable expectation if a user specifies more capacity > on a slower device. One thing to consider may be to use multiple > backing files for larger partition to avoid per-file lock problem. > > > Last thought: since there are some tricky design decisions above, > > maybe we can simplify this and just support a single partition > for > > now, and come back to supporting multi-partition? > > It may be reasonable to punt on it for now. Let me see how much > work it is to switch over to the global hash-table + per-partition > LRU approach. My main concern with a single partition in the short term is whether we can get enough IOPS from a single spinning disk per node. For cloud deployments we should definitely try to use SSD for the cache, but if we expect this to be used for on-prem compute clusters it's likely that many of those will have no SSD but multiple spinning disks. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 1 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 11 Apr 2019 19:14:32 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 1: (3 comments) > Another option would be to do a global hashtable and have the > entries directly identify their partitions (in fact I think they > already do via the CacheFile pointer, right?). Eviction still needs > to be per-partition based on that partition's capacity, but > insertion can try to more smartly allocate across drives. > Yes, I had a similar question when thinking over the patch over the weekend. I tried the approach of weighted sampling + a global LRU cache but stopped short of coming up with a way to maintain the quota per partition with a global LRU list so I abandoned it and kind of punted on this problem for now. May be it's okay to do LRU per partition. Should have documented my thought there in the code. I think the _IO_ apportioned relative to capacity may be fine as that's a reasonable expectation if a user specifies more capacity on a slower device. One thing to consider may be to use multiple backing files for larger partition to avoid per-file lock problem. > Last thought: since there are some tricky design decisions above, > maybe we can simplify this and just support a single partition for > now, and come back to supporting multi-partition? It may be reasonable to punt on it for now. Let me see how much work it is to switch over to the global hash-table + per-partition LRU approach. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h File be/src/runtime/io/data-cache.h: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@53 PS1, Line 53: /// punching holes in the backing file it's stored in. As a backing file reaches a certain > This does mean that the number of backing files could grown unbounded right It's actually a TODO item to retire older files and copy what's left in them to the current file. I didn't get around to doing that in this first version. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@56 PS1, Line 56: /// > We should mention in the comment how lookup works, in particular what happe It's by design that we don't support overlapping range for the first implementation. It seems to fare pretty well with the TPC-DS workload and parquet file format we are using. One of the TODO in the future is to measure the overlapping case and consider handling them if they are actually common. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@79 PS1, Line 79: // Unlink the file so the disk space is recycled once the process exits. > I'm not sure how I feel about this - it could cause a lot of confusion havi Thanks for the feedback. I also considered the approach of cleaning up the directory on startup but this seems to make certain assumption about whether the directory specified by the user for caching is exclusively used by the Impalad process. While this is not a common scenario right now, I can see there may be configuration in the future where one may run multiple Impala containers on the same host and these containers may belong to the same cluster. So, wiping out a particular directory on restart isn't exactly safe. Another alternative would be to include the hostname / IP address + port number as a unique name for the file so we will automatically truncate the file on restart. However, there is no guarantee that an Impalad will restart with the same IP address / hostname after a crash, esp. in a public cloud setting. That said, one can also argue that this is a configuration issue and those Impala containers shouldn't be sharing directories etc. The feedback so far suggests that the "hidden" disk usage seems even more undesirable so may be wiping out the directory at startup is not as bad and we can just forbid sharing of caching directories by multiple Impalads on the same host. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 1 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 11 Apr 2019 17:13:23 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 1: After sleeping on this patch, had a few more higher-level thoughts: 1- we should probably periodically check disk space on the allocated partition and if there is a limited amount remaining we should stop using that partition for cache. In the future maybe we would dynamically evict stuff if we're about to run out of space but I think a simple "stop adding to the problem" is a good starting point. We should also handle ENOSPC gracefully - not sure if we do today. 2- currently we hash the cache keys to pick a partition, and then each partition is separately LRU. That should work OK if the partitions are allocated equal sizes. In the case that they aren't, however, this will have a negative effect on hit ratio. For example, consider the limit where one partition has only 10MB of capacity whereas a second partition has 1TB. If we assume that 10MB of cache provides a very low hit rate, then half of our cache queries will hit the low-hit-rate cache and we'll be limited at close to 50% hit rate for the combined cache. A few different solutions come to mind: (a) instead of hashmod to pick a partition, we could use the hsah code to place a token in a hash space sized based on capacity. eg in the above example, the first partition owns hash values 0-1MB and the second partition owns hash values 1MB-1.001TB. Upon hashing a key, we mod by the total capacity and pick a partition with weight relative to capacity. One downside of this approach is that the _IO_ usage will end up apportioned relative to capacity. That may be especially bad considering smaller devices are likely to be faster (a well-meaning user might allocate /data/ssd:100GB,/data/hdd:1TB but now they're getting 10x the IO pushed on their HDD instead of the SSD. Another option would be to do a global hashtable and have the entries directly identify their partitions (in fact I think they already do via the CacheFile pointer, right?). Eviction still needs to be per-partition based on that partition's capacity, but insertion can try to more smartly allocate across drives. Last thought: since there are some tricky design decisions above, maybe we can simplify this and just support a single partition for now, and come back to supporting multi-partition? -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 1 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 11 Apr 2019 16:20:55 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc File be/src/runtime/io/disk-io-mgr.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@240 PS1, Line 240: get > I'm on your side on this one but some people actually disagreed with me and yea, I guess I go this way because clang-tidy has this check which we use in Kudu, so I'm used to it getting flagged: https://clang.llvm.org/extra/clang-tidy/checks/readability-redundant-smartptr-get.html -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 1 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 10 Apr 2019 23:50:17 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc File be/src/runtime/io/disk-io-mgr.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@240 PS1, Line 240: get > redundant get() != nullptr -- unique_ptr implicitly casts to bool in the wa I'm on your side on this one but some people actually disagreed with me and I think we effectively called a truce on the great "whether to use .get() in null checks". Similar to American vs Commonwealth English. It would be nice to have a consensus going forward - maybe people's feelings have changed. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 1 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 10 Apr 2019 21:51:47 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 1: (37 comments) Didn't look through the tests yet, but here are some comments on the code. Overall looking pretty good! Like the thorough docs. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h File be/src/runtime/io/data-cache.h: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@76 PS1, Line 76: class DataCache{ style nit: missing space before { http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@81 PS1, Line 81: DataCache(const std::string& config) : config_(config) { } totally doesn't matter here, but general best practice in C++11 is to have such parameters be pass by value, and then assign in the initializer list using std::move: DataCache(std::string config) : config_(std::move(config)) {} since it avoids a copy in the case that you pass a temporary or other rvalue reference into the parameter. This is more important with big heavy structures and things called in hot paths of course... I don't really care too much here, but figured I'd note it. In the Kudu codebase our clang-tidy bot catches and complains about these. On a separate note: this should be marked explicit. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@108 PS1, Line 108: /// Inserts a new cache entry by copying the content in 'buffer' into the cache. Can you specify whether this is synchronous or not? ie should the caller expect this might take a long time (and thus not want to be holding locks, etc)? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@110 PS1, Line 110: /// is installed successfully. worth documenting under what cases the data wouldn't be stored successfully http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@127 PS1, Line 127: class Partition : public kudu::Cache::EvictionCallback { This could be defined in the .cc file and just forward-declared here, right? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@162 PS1, Line 162: struct CacheFile { Consider making a destructor here responsible for closing fd if it's been set (and make it non-copyable so that's safe) http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@208 PS1, Line 208: std::vector cache_files_; It seems like CacheEntrys point to CacheFile objects, but here the CacheFiles can be reallocated when the vector resizes. Would need to make this a vector> to fix. I'm guessing the test isn't covering creation of enough CacheFiles to trigger the vector reallocation. Might be worth such a stress test. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@236 PS1, Line 236: /// 'read' is true iff this function is called from Lookup(). This function is : /// also called from Store(), in which case, 'read' is false. nit: I'm not a big fan of parameters that get documented like this. From reading this I have no idea what it actually does, or why I'd set it to true or false. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@239 PS1, Line 239: /// Please note that this function will CHECK fail if there is any checksum mismatch. Can we make checksum errors get treated as cache misses and remove the entry after logging? It seems like we shouldn't be fatal if the disk has some flipped bit. (maybe DCHECK for debug builds to catch bugs) http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@254 PS1, Line 254: std::unique_ptr* key); why not just take a 'faststring*' here directly, since it's mutable? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@44 PS1, Line 44: using namespace impala; : using namespace impala::io; why do you need these? The class is already in impala::io so seems like the using directives aren't really necessary (and in general I think style guide discourages them) http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@57 PS1, Line 57: "(Advanced) Enable checksumming for the cached buffer."); do you have any sense of the overhead here? Checksums with CRC32C are almost 8bytes/cycle (so <500us to checksum an 8MB read at 2.5Ghz). I think we should probably default them to on for safety's sake http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@62 PS1, Line 62: static const int64_t PAGE_SIZE = 1L << 12; any reason for this to be configurable? eg on SSD I think erase blocks are typically 64kb so maybe there's a benefit to writing 64kb-aligned? not sure. http://gerrit.cloudera
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
David Rorke has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@79 PS1, Line 79: // Unlink the file so the disk space is recycled once the process exits. > I'm not sure how I feel about this - it could cause a lot of confusion havi Agree and I also think there are some legitimate reasons that people might need to be able to access the files by name including capturing the file for debugging in support cases. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 1 Gerrit-Owner: Michael Ho Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 10 Apr 2019 21:02:39 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 1: (13 comments) I did a quick pass just trying to understand the big picture. Looks like Lars beat me to some things. http://gerrit.cloudera.org:8080/#/c/12987/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12987/1//COMMIT_MSG@24 PS1, Line 24: is a tuple of (file's name, file's modification time, file offset) So this means that you can get a partial cache hit if the offsets match but lengths differ, but not if I guess this is a consequence of the cache being keyed this way (one alternative would be to key by name/mod and then have some kind of sorted map of cache entries per file). It seems like this implementation simplifies things and probably the benefit of supporting more partial cache hits would be minimal. Might be worth briefly mentioning this fact and any reasoning behind it, for posterity. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/exec/hdfs-scan-node-base.cc@948 PS1, Line 948: data_cache_miss_bytes_->Set(reader_context_->data_cache_miss_bytes()); I think for the above counters we figured that this pattern of copying the value out of the counter at the end was bad. Instead we could just pass in the counter to reader_context_ and have it updated live. E.g. see reader_context_->set_bytes_read_counter() http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h File be/src/runtime/io/data-cache.h: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@18 PS1, Line 18: #ifndef IMPALA_RUNTIME_IO_DATA_CACHE_H Could use #pragma once instead of include guards http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@53 PS1, Line 53: /// punching holes in the backing file it's stored in. As a backing file reaches a certain This does mean that the number of backing files could grown unbounded right? E.g. if you end up with one or more long-lived cache entries per file. This doesn't seem likely to be a problem in practice since we should be able to have a reasonable number of open file descriptors without issue but may be worth noting. I suspect trying to compact the files would be overengineering things but it may be worth thinking about whether this could be a problem. Also, this requires that the filesystem supports hole punching, right? - IIRC there are some that don't. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@55 PS1, Line 55: /// created instead. Note that due to hole punching, the backing file is actually sparse. This might be a case where a simple ASCII diagram would illustrate the concept nicely (although the text is pretty clear on its own). http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@63 PS1, Line 63: /// happens inline currently during Store(). It's worth documenting the PAGE_SIZE behaviour since it implies that there's a minimum granularity of data that clients of the cache should be caching. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@105 PS1, Line 105: space http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@110 PS1, Line 110: /// is installed successfully. Is it interesting to document reasons why the content may not be installed? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@79 PS1, Line 79: // Unlink the file so the disk space is recycled once the process exits. I'm not sure how I feel about this - it could cause a lot of confusion having all this "hidden" disk usage. It seems like TmpFileMgr's approach of deleting the existing scratch on startup has worked out ok so might be worth considering. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@158 PS1, Line 158: // TODO: handle cases in which 'store_len' is longer than existing entry; It's probably worth documenting this behaviour in the header - it's a little surprising. I guess we're really hoping that we don't have these collisions too often? Since we could end up in a situation where we repeatedly miss the cache for the latter part of a data range and never actually store the data. Or does the partial read handling in the I/O manager avoid this? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc File be/src/runtime/io/disk-io-mgr.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@295 PS1, Line 295: LOG(ERROR) << "Failed to initialize data cache."; > Have you considered making this fatal? Should
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 1: (23 comments) I think it would be good to have some tests that make sure that data gets freed up on the disk by the hole punching mechanism as expected. Additionally, maybe a test to make sure that mtime and other parts of the key are properly considered would be nice to have. http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc File be/src/runtime/io/data-cache-test.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@52 PS1, Line 52: // The callback is invoked by each thread in the multi-threaded tests below. callback reads like it's called when something is done, how about ThreadFn? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@157 PS1, Line 157: // Read the same same range inserted previously and they should still all in the cache. nit: be http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@196 PS1, Line 196: // Create a buffer way larger than the cache. Why does it need to be larger? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@201 PS1, Line 201: ASSERT_TRUE(cache.Store("foobar", 12345, 0, large_buffer.get(), cache_size)); Use variables instead of inline constants? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@220 PS1, Line 220: const int64_t cache_size = 1 << 22; I find these easier to read in the form of 4 * 1024 * 1024 http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@248 PS1, Line 248: differen typo http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@301 PS1, Line 301: ootprint) { Can you add a comment what this test does? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@330 PS1, Line 330: int main(int argc, char **argv) { This is not needed anymore if you add your test to the unified backend tests (but better confirm with Joe). http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h File be/src/runtime/io/data-cache.h: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@56 PS1, Line 56: /// We should mention in the comment how lookup works, in particular what happens when we look up data that is a sub-piece of something in the cache and the lookup by offset wouldn't work (i.e. state that only exact matches are implemented) http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@110 PS1, Line 110: /// is installed successfully. Maybe point out explicitly that it returns false for errors and if the content was already in the cache? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@129 PS1, Line 129: /// Creates a partition at the given directory 'path' with quota 'capacity'. Mention that capacity is bytes? Maybe even add a suffix that indicates it's bytes (and not number of entries, here and elsewhere) http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@157 PS1, Line 157: void EvictedEntry(kudu::Slice key, kudu::Slice value) override; Can you add the virtual and override keywords to all virtual methods to make it obvious which ones are part of the interface? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@226 PS1, Line 226: /// failure. Please note that the backing file is unlinked right after it's created How will these files look to an administrator? Will this make it harder to debug out-of-space issues? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@239 PS1, Line 239: /// Please note that this function will CHECK fail if there is any checksum mismatch. Does that mean "crash a release build"? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@254 PS1, Line 254: std::unique_ptr* key); You could return the pointer by value http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@66 PS1, Line 66: DRAM_CACHE Should we add DISK_CACHE here? Otherwise, can you add a comment to explain why we use DRAM_CACHE? http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@193 PS1, Line 193: ssize_t bytes_written = 0; declare right before usage http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@222 PS1, Line 222: done: You could use a scoped cleanup lambda and returns instead of goto http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@262 PS1, Line 262: for (const string& cache_config : all_cache_configs) { gutil has Split
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 1: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/2722/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 1 Gerrit-Owner: Michael Ho Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Marshall Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 10 Apr 2019 17:46:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 ) Change subject: IMPALA-8341: Data cache for remote reads .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/request-context.h File be/src/runtime/io/request-context.h: http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/request-context.h@161 PS1, Line 161: int data_cache_partial_hit_count() const { return data_cache_partial_hit_count_.Load(); } line too long (91 > 90) -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 1 Gerrit-Owner: Michael Ho Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Wed, 10 Apr 2019 17:13:26 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads
Michael Ho has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12987 Change subject: IMPALA-8341: Data cache for remote reads .. IMPALA-8341: Data cache for remote reads This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/ This change implements an IO data cache which is backed by local storage. It implicitly relies on the OS page cache management to shuffle data between memory and the storage device. This is useful for caching data read from remote filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS). A data cache is divided into one or more partitions based on the configuration string which is a list of : pairs, separated by comma. An example configuration string: --data_cache_config=/data/0:150GB,/data/1:150GB Each partition has a meta-data cache which tracks the mappings of cache keys to the locations of the cached data. A cache key is a tuple of (file's name, file's modification time, file offset) and a cache entry is a tuple of (backing file, offset in the backing file, length of the cached data, optional checksum). Each partition stores its set of cached data in backing files created on local storage. When inserting new data into the cache, the data is appended to the current backing file in use. The storage consumption of each cache entry counts towards the quota of that partition. When a partition reaches its capacity, the least recently used (LRU) data in that partition is evicted. Evicted data is removed from the underlying storage by punching holes in the backing file it's stored in. As a backing file reaches a certain size (by default 4TB), new data will stop being appended to it and a new file will be created instead. Note that due to hole punching, the backing file is actually sparse. Optionally, checksumming can be enabled to verify read from the cache is consistent with what was inserted and to verify that multiple attempted insertions with the same cache key have the same cache content. To probe for cached data in the cache, the interface Lookup() is used; To insert data into the cache, the interface Store() is used. Please note that eviction happens inline currently during Store(). Testing done: a new BE test was added; core test with cache enabled. Perf: - 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement over runs without the cache. Each node has a cache size of 150GB per node. Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/runtime/io/CMakeLists.txt A be/src/runtime/io/data-cache-test.cc A be/src/runtime/io/data-cache.cc A be/src/runtime/io/data-cache.h M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/runtime/io/hdfs-file-reader.cc M be/src/runtime/io/hdfs-file-reader.h M be/src/runtime/io/request-context.h M be/src/util/impalad-metrics.cc M be/src/util/impalad-metrics.h M common/thrift/metrics.json 14 files changed, 1,115 insertions(+), 9 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/1 -- To view, visit http://gerrit.cloudera.org:8080/12987 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc Gerrit-Change-Number: 12987 Gerrit-PatchSet: 1 Gerrit-Owner: Michael Ho