Yida Wu has posted comments on this change. ( http://gerrit.cloudera.org:8080/24282 )
Change subject: IMPALA-14961: Query Profile Redaction ...................................................................... Patch Set 4: (4 comments) http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction-test.cc File be/src/service/query-profile-redaction-test.cc: http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction-test.cc@79 PS4, Line 79: TEST(QueryProfileRedactionTest, RedactedProfileMatchesGoldenForTpcds72) { It seems we should add more tests here. One kind of test case I think we need is to use some mixed edge-case names for original tables or hosts (like test_table_001 or test_host_002) to have unit tests comparing the actual output http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc File be/src/service/query-profile-redaction.cc: http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@129 PS4, Line 129: std::sort(entries.begin(), entries.end(), : [](const std::pair<string, string>& a, const std::pair<string, string>& b) { : return a.first.size() > b.first.size(); : }); : for (const auto& entry : entries) { : if (entry.first.empty()) continue; : boost::algorithm::replace_all(text, entry.first, entry.second); : } The logic here seems vulnerable. For example, if the original tables (decoy_table_002, original_target) are aliased to (table_001, table_002), during the unredaction phase, if we replace table_001 with decoy_table_002 first, the text now contains the literal string 'table_002'. In the next pass, when the loop searches for 'table_002' to replace it with 'original_target', it will overwrite the substring inside 'decoy_table_002' by mistake. Can you confirm this? http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@387 PS4, Line 387: const string profile_text_copy(profile_text); : Document source_json; : source_json.Parse(profile_text_copy.c_str()); : if (source_json.HasParseError() || !source_json.IsObject()) { : return Status("Query profile input must be a valid JSON object"); : } We will call RedactQueryProfileWithAliases() later, but RedactQueryProfileWithAliases() also have "static string RedactQueryProfileWithAliases(const string_view& profile_text...) { ... const string profile_text_copy(profile_text); Document source_json; source_json.Parse(profile_text_copy.c_str()); if (source_json.HasParseError() || !source_json.IsObject()) return redacted;", it seems that we parse the same 'profile_text' twice. I think maybe we can remove this entirely and return the Status in RedactQueryProfileWithAliases(), or pass the source_json reference to the RedactQueryProfileWithAliases as the input. http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@393 PS4, Line 393: unordered_map<string, string> alias_to_original; : redacted_profile_text_ = RedactQueryProfileWithAliases(profile_text, alias_to_original); : alias_to_original_ = map<string, string>( : alias_to_original.cbegin(), alias_to_original.cend()); Is there a specific reason alias_to_original_ needs to be an ordered std::map? Changing alias_to_original_ to an std::unordered_map seems more clear and allow us to avoid extra copy step -- To view, visit http://gerrit.cloudera.org:8080/24282 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If0c5b4911a64888f319f212155df6e08c1800b32 Gerrit-Change-Number: 24282 Gerrit-PatchSet: 4 Gerrit-Owner: Gokul Kolady <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: Gokul Kolady <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Jason Fehr <[email protected]> Gerrit-Reviewer: Yida Wu <[email protected]> Gerrit-Comment-Date: Thu, 14 May 2026 05:30:06 +0000 Gerrit-HasComments: Yes
