[email protected] has posted comments on this change. ( http://gerrit.cloudera.org:8080/19532 )
Change subject: IMPALA-11904: Data cache support dumping for reloading ...................................................................... Patch Set 4: (4 comments) Thanks for your comment! Actually I have considered and tested what if the parameter changes before loading, and here are some explanations: 1. If data_cache changes, that may include add/remove partitions and increase/reduce partition capacity, since dump & load are granular to partitions level and each partition has an independent dump file located in the partition directory, so add/remove partitions has little effect, just adds new empty partitions or remove some partition with data. Increase partition capacity is no problem, that allows fully load all the dumped data and cache more later. If reduce partition capacity, then it will trigger cache entries evictions during loaded entries insertion process so that after loading the cache size has be reduced to less than the new capacity,and extra cache entries evicted. 2. If data_cache_checksum changes, there will be no effect from enabling to disabling, but from disabling to enabling will gradually cause all the data that has been loaded from dump to become invalid and evicted, because they cannot pass the VerifyChecksum. 3. If data_cache_file_max_size_bytes changes, it does not affect the existing cache files, but it will affect the files created later (and may also affect the newest existing cache file). 4. Other parameter changes, no matter. http://gerrit.cloudera.org:8080/#/c/19532/4/be/src/runtime/io/data-cache.cc File be/src/runtime/io/data-cache.cc: http://gerrit.cloudera.org:8080/#/c/19532/4/be/src/runtime/io/data-cache.cc@993 PS4, Line 993: std::ofstream file; : file.open(JoinPathSegments(path_, DUMP_FILE_NAME)); > What happens if the file already exists? For example, if we dump to > the file, then start back up, load it, and then we are shutting > down again. Should we be truncating any existing file? (e.g. > std::ofstream::trunc) > > Also, do we need to open the file in binary mode? (std::ofstream::binary) It doesn't matter if the file exists. The default open mode for ofstream is std::ostream::out which also implicitly includes std::ofstream::trunc, so if the file already exists it will be opened and truncated as if it were a new empty file. I have not found any difference yet between adding std::ofstream::binary or not, they both work, but I will add it later anyway just to be on the safe side. http://gerrit.cloudera.org:8080/#/c/19532/4/be/src/runtime/io/data-cache.cc@1010 PS4, Line 1010: std::ifstream file; : file.open(JoinPathSegments(path_, DUMP_FILE_NAME)); > Do we need to open in binary mode? (std::ofstream::binary) Be consistent with above, open in binary mode too. http://gerrit.cloudera.org:8080/#/c/19532/4/be/src/util/cache/cache-internal.h File be/src/util/cache/cache-internal.h: http://gerrit.cloudera.org:8080/#/c/19532/4/be/src/util/cache/cache-internal.h@316 PS4, Line 316: handles.emplace_back( : reinterpret_cast<Cache::Handle*>(handle), Cache::HandleDeleter(this)); > I haven't tried this, but I think we should be able to move the > UniqueHandle into the vector. i.e. > > handles.push_back(std::move(handle)); > > This should also bring along the custom deleter. I tried but couldn't compile. Am I missing something? http://gerrit.cloudera.org:8080/#/c/19532/4/be/src/util/cache/lirs-cache.cc File be/src/util/cache/lirs-cache.cc: http://gerrit.cloudera.org:8080/#/c/19532/4/be/src/util/cache/lirs-cache.cc@1088 PS4, Line 1088: The UNPROTECTED : // entry can be in a recency list and a unprotected list at the same time, so a : // temporary list is required to remove duplicating unprotected entries when collecting. : std::list<LIRSHandle*> unprotecteds; > There are two ways to avoid constructing this list: > > 1) When iterating the recency_list_, don't do anything for > UNPROTECTED entries. Then, all the UNPROTECTED will be handled by > walking the unprotected list. > > 2) When iterating the recency_list_, put both PROTECTED and > UNPROTECTED into the handles vector. Then, when walking the > unprotected list, check whether it is linked on the recency_list_ > (i.e. e->recency_list_hook_.is_linked()) and only add it to the > handles vector if it is not. Thanks for the suggestion, I will fix it later. -- To view, visit http://gerrit.cloudera.org:8080/19532 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Id867f4fc7343898e4906332c3caa40eb57a03101 Gerrit-Change-Number: 19532 Gerrit-PatchSet: 4 Gerrit-Owner: Anonymous Coward <[email protected]> Gerrit-Reviewer: Anonymous Coward <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Comment-Date: Thu, 02 Mar 2023 10:37:04 +0000 Gerrit-HasComments: Yes
