[email protected] has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19532 )

Change subject: IMPALA-11904: Data cache support dumping for reloading
......................................................................


Patch Set 4:

(4 comments)

Thanks for your comment! Actually I have considered and tested what if the 
parameter changes before loading, and here are some explanations:

1. If data_cache changes, that may include add/remove partitions and 
increase/reduce  partition capacity, since dump & load are granular to 
partitions level and each partition has an independent dump file located in the 
partition directory, so add/remove partitions has little effect, just adds new 
empty partitions or remove some partition with data. Increase partition 
capacity is no problem, that allows fully load all the dumped data and cache 
more later. If reduce partition capacity, then it will trigger cache entries 
evictions during loaded entries insertion process so that after loading the 
cache size has be reduced to less than the new capacity,and extra cache entries 
evicted.

2. If data_cache_checksum changes, there will be no effect from enabling to 
disabling, but from disabling to enabling will gradually cause all the data 
that has been loaded from dump to become invalid and evicted, because they 
cannot pass the VerifyChecksum.

3. If data_cache_file_max_size_bytes changes, it does not affect the existing 
cache files, but it will affect the files created later (and may also affect 
the newest existing cache file).

4. Other parameter changes, no matter.

http://gerrit.cloudera.org:8080/#/c/19532/4/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/19532/4/be/src/runtime/io/data-cache.cc@993
PS4, Line 993:   std::ofstream file;
             :   file.open(JoinPathSegments(path_, DUMP_FILE_NAME));
> What happens if the file already exists? For example, if we dump to
 > the file, then start back up, load it, and then we are shutting
 > down again. Should we be truncating any existing file? (e.g.
 > std::ofstream::trunc)
 >
 > Also, do we need to open the file in binary mode? (std::ofstream::binary)

It doesn't matter if the file exists.  The default open mode for ofstream is 
std::ostream::out which also implicitly includes std::ofstream::trunc, so if 
the file already exists it will be opened and truncated as if it were a new 
empty file.

I have not found any difference yet between adding std::ofstream::binary or 
not, they both work, but I will add it later anyway just to be on the safe side.


http://gerrit.cloudera.org:8080/#/c/19532/4/be/src/runtime/io/data-cache.cc@1010
PS4, Line 1010:   std::ifstream file;
              :   file.open(JoinPathSegments(path_, DUMP_FILE_NAME));
> Do we need to open in binary mode? (std::ofstream::binary)

Be consistent with above, open in binary mode too.


http://gerrit.cloudera.org:8080/#/c/19532/4/be/src/util/cache/cache-internal.h
File be/src/util/cache/cache-internal.h:

http://gerrit.cloudera.org:8080/#/c/19532/4/be/src/util/cache/cache-internal.h@316
PS4, Line 316:         handles.emplace_back(
             :             reinterpret_cast<Cache::Handle*>(handle), 
Cache::HandleDeleter(this));
> I haven't tried this, but I think we should be able to move the
 > UniqueHandle into the vector. i.e.
 >
 > handles.push_back(std::move(handle));
 >
 > This should also bring along the custom deleter.

I tried but couldn't compile. Am I missing something?


http://gerrit.cloudera.org:8080/#/c/19532/4/be/src/util/cache/lirs-cache.cc
File be/src/util/cache/lirs-cache.cc:

http://gerrit.cloudera.org:8080/#/c/19532/4/be/src/util/cache/lirs-cache.cc@1088
PS4, Line 1088:  The UNPROTECTED
              :   // entry can be in a recency list and a unprotected list at 
the same time, so a
              :   // temporary list is required to remove duplicating 
unprotected entries when collecting.
              :   std::list<LIRSHandle*> unprotecteds;
> There are two ways to avoid constructing this list:
 >
 > 1) When iterating the recency_list_, don't do anything for
 > UNPROTECTED entries. Then, all the UNPROTECTED will be handled by
 > walking the unprotected list.
 >
 > 2) When iterating the recency_list_, put both PROTECTED and
 > UNPROTECTED into the handles vector. Then, when walking the
 > unprotected list, check whether it is linked on the recency_list_
 > (i.e. e->recency_list_hook_.is_linked()) and only add it to the
 > handles vector if it is not.

Thanks for the suggestion, I will fix it later.



--
To view, visit http://gerrit.cloudera.org:8080/19532
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id867f4fc7343898e4906332c3caa40eb57a03101
Gerrit-Change-Number: 19532
Gerrit-PatchSet: 4
Gerrit-Owner: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Comment-Date: Thu, 02 Mar 2023 10:37:04 +0000
Gerrit-HasComments: Yes

Reply via email to