Yida Wu has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/24282 )

Change subject: IMPALA-14961: Query Profile Redaction
......................................................................


Patch Set 4:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction-test.cc
File be/src/service/query-profile-redaction-test.cc:

http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction-test.cc@79
PS4, Line 79: TEST(QueryProfileRedactionTest, 
RedactedProfileMatchesGoldenForTpcds72) {
It seems we should add more tests here. One kind of test case I think we need 
is to use some mixed edge-case names for original tables or hosts (like 
test_table_001 or test_host_002) to have unit tests comparing the actual output


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc
File be/src/service/query-profile-redaction.cc:

http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@129
PS4, Line 129:   std::sort(entries.begin(), entries.end(),
             :       [](const std::pair<string, string>& a, const 
std::pair<string, string>& b) {
             :         return a.first.size() > b.first.size();
             :       });
             :   for (const auto& entry : entries) {
             :     if (entry.first.empty()) continue;
             :     boost::algorithm::replace_all(text, entry.first, 
entry.second);
             :   }
The logic here seems vulnerable. For example, if the original tables 
(decoy_table_002, original_target) are aliased to (table_001, table_002), 
during the unredaction phase, if we replace table_001 with decoy_table_002 
first, the text now contains the literal string 'table_002'. In the next pass, 
when the loop searches for 'table_002' to replace it with 'original_target', it 
will overwrite the substring inside 'decoy_table_002' by mistake. Can you 
confirm this?


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@387
PS4, Line 387:   const string profile_text_copy(profile_text);
             :   Document source_json;
             :   source_json.Parse(profile_text_copy.c_str());
             :   if (source_json.HasParseError() || !source_json.IsObject()) {
             :     return Status("Query profile input must be a valid JSON 
object");
             :   }
We will call RedactQueryProfileWithAliases() later, but 
RedactQueryProfileWithAliases() also have
"static string RedactQueryProfileWithAliases(const string_view& 
profile_text...) {
  ...
  const string profile_text_copy(profile_text);
  Document source_json;
  source_json.Parse(profile_text_copy.c_str());
  if (source_json.HasParseError() || !source_json.IsObject()) return 
redacted;", it seems that we parse the same 'profile_text' twice. I think maybe 
we can remove this entirely and return the Status in 
RedactQueryProfileWithAliases(), or pass the source_json reference to the 
RedactQueryProfileWithAliases as the input.


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@393
PS4, Line 393:   unordered_map<string, string> alias_to_original;
             :   redacted_profile_text_ = 
RedactQueryProfileWithAliases(profile_text, alias_to_original);
             :   alias_to_original_ = map<string, string>(
             :       alias_to_original.cbegin(), alias_to_original.cend());
Is there a specific reason alias_to_original_ needs to be an ordered std::map? 
Changing alias_to_original_ to an std::unordered_map seems more clear and allow 
us to avoid extra copy step



--
To view, visit http://gerrit.cloudera.org:8080/24282
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If0c5b4911a64888f319f212155df6e08c1800b32
Gerrit-Change-Number: 24282
Gerrit-PatchSet: 4
Gerrit-Owner: Gokul Kolady <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: Gokul Kolady <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Jason Fehr <[email protected]>
Gerrit-Reviewer: Yida Wu <[email protected]>
Gerrit-Comment-Date: Thu, 14 May 2026 05:30:06 +0000
Gerrit-HasComments: Yes

Reply via email to