[ 
https://issues.apache.org/jira/browse/IMPALA-12908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885964#comment-17885964
 ] 

ASF subversion and git services commented on IMPALA-12908:
----------------------------------------------------------

Commit f11172a4a27e366908057c04842b91be90a573e1 in impala's branch 
refs/heads/master from Yida Wu
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=f11172a4a ]

IMPALA-12908: Add correctness check for tuple cache

The patch adds a feature to the automated correctness check for
tuple cache. The purpose of this feature is to enable the
verification of the correctness of the tuple cache by comparing
caches with the same key across different queries.

The feature consists of two main components: cache dumping and
runtime correctness validation.

During the cache dumping phase, if a tuple cache is detected,
we retrieve the cache from the global cache and dump it to a
subdirectory as a reference file within the specified debug
dumping directory. The subdirectory is using the cache key as
its name. Additionally, data from the child is also read and
dumped to a separate file in the same directory. We expect
these two files to be identical, assuming the results are
deterministic. For non-deterministic cases like TOP-N or others,
we may detect them and exclude them from dumping later.
Furthermore, the cache data will be transformed into a
human-readable text format on a row-by-row basis before dumping.
This approach allows for easier investigation and later analysis.

The verification process starts by comparing the entire file
content sharing with the same key. If the content matches, the
verification is considered successful. However, if the content
doesn't match, we enter a slower mode where we compare all the
rows individually. In the slow mode, we will create a hash map
from the reference cache file, then iterate the current cache
file row by row and search if every row exists in the hash map.
Additionally, a counter is integrated into the hash map to
handle scenarios involving duplicated rows. Once verification is
complete, if no discrepancies are found, both files will be removed.
If discrepancies are detected, the files will be kept and appended
with a '.bad' postfix.

New start flags:
Added a starting flag tuple_cache_debug_dump_dir for specifying
the directory for dumping the result caches. if
tuple_cache_debug_dump_dir is empty, the feature is disabled.

Added a query option enable_tuple_cache_verification to enable
or disable the tuple cache verification. Default is true. Only
valid when tuple_cache_debug_dump_dir is specified.

Tests:
Ran the testcase test_tuple_cache_tpc_queries and caught known
inconsistencies.

Change-Id: Ied074e274ebf99fb57e3ee41a13148725775b77c
Reviewed-on: http://gerrit.cloudera.org:8080/21754
Reviewed-by: Michael Smith <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Joe McDonnell <[email protected]>


> Add a correctness verification mode for tuple caching
> -----------------------------------------------------
>
>                 Key: IMPALA-12908
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12908
>             Project: IMPALA
>          Issue Type: Task
>          Components: Backend
>    Affects Versions: Impala 4.4.0
>            Reporter: Joe McDonnell
>            Assignee: Yida Wu
>            Priority: Major
>
> To get more coverage of tuple caching correctness, it would be useful to have 
> automated correctness checking for tuple caching. In this mode, the tuple 
> cache node would fetch results from its child, persist the new results to 
> disk, then compare the new results to the cache contents at the end. The goal 
> is to be able to run a variety of queries, including various end-to-end tests 
> and verify that there is no variability in the results stored to the cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to