Gokul Kolady has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/24282 )

Change subject: IMPALA-14961: Query Profile Redaction
......................................................................


Patch Set 6:

(34 comments)

http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction-test.cc
File be/src/service/query-profile-redaction-test.cc:

http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction-test.cc@79
PS4, Line 79: TEST(QueryProfileRedactionTest, 
RedactedProfileMatchesGoldenForTpcds72) {
> It seems we should add more tests here. One kind of test case I think we ne
Added a unit test for this edge case


http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction-test.cc
File be/src/service/query-profile-redaction-test.cc:

http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction-test.cc@124
PS5, Line 124:       "value": "select decoy_table_002.id from decoy_table_002 "
> line too long (127 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/24282/3/be/src/service/query-profile-redaction.cc
File be/src/service/query-profile-redaction.cc:

http://gerrit.cloudera.org:8080/#/c/24282/3/be/src/service/query-profile-redaction.cc@138
PS3, Line 138:
> Pass entries by reference.
This method has been removed as a result of CR changes


http://gerrit.cloudera.org:8080/#/c/24282/3/be/src/service/query-profile-redaction.cc@148
PS3, Line 148:         if (a.first.size() != b.first.size()) return 
a.first.size() > b.first.size();
> Pass parameters by reference.
This method has been removed as a result of CR changes


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc
File be/src/service/query-profile-redaction.cc:

http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@67
PS4, Line 67: static const std::regex EMAIL_RE(
> Since alias_to_original basically points to entries in original_to_alias we
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@74
PS4, Line 74: static const std::regex IP_RE(
            :     R"(\b(?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}"
> No need to have a separate declaration for static functions.
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@80
PS4, Line 80: static string JsonEscapeString(const string_view& input) {
> use unordered_set<string_view> since ordering is not required and is more e
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@95
PS4, Line 95:   unordered_map<string_view, string_view> alias_to_original;
> We could use one of these sections, maybe "Per Host Min Memory Reservation"
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@96
PS4, Line 96:   alias_to_original.reserve(original_to_alias.size());
            :   for (const auto& entry : original_to_alias) {
> Declare all regexs after line 49 like this:
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@99
PS4, Line 99:   }
> We probably don't need an ordered set here so maybe use unordered_set.
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@115
PS4, Line 115:   vector<string> results;
> We don't need a temporary set here. We can simply iterate through the input
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@129
PS4, Line 129:   for (const string& token : tokens) {
             :     if (token.empty() || seen_tokens.find(token) != 
seen_tokens.end()) continue;
             :     seen_tokens.insert(token);
             :     std::stringstream alias;
             :     alias << prefix << "_" << std::setfill('0') << std::setw(3) 
<< (++idx);
             :     alias_map[token] = alias.str();
             :   }
             :   r
> The logic here seems vulnerable. For example, if the original tables (decoy
Updated this logic to do two passes. The first pass unredacts aliases into 
temporary unique names that won't get confused for other aliases, then the 
second pass unredacts these temporary unique names into the original strings.


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@143
PS4, Line 143:   for (const auto& entry : replacements) {
> Not sure why we need a temporary vector here ? We can pass the map by refer
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@154
PS4, Line 154: static bool ReplacementTokenCollides(const string& token, const 
string& text,
> Pass alias_to_original by reference to BuildAliasReverseMap to avoid temp c
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@154
PS4, Line 154: static bool ReplacementTokenCollides(const string& token, const 
string& text,
> Actually, this looks like copy elision so no change needed.
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@164
PS4, Line 164: static string BuildReplacementToken(size_t idx, const string& 
text,
> avoid local copy here, either pass string by reference to this function or
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@263
PS4, Line 263:         contexts.end(), 
std::make_move_iterator(child_contexts.begin()),
             :         std::make_move_it
> use unordered_set.
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@266
PS4, Line 266:   return contexts;
> This function is non trivial. Maybe add comments explaining what each of th
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@328
PS4, Line 328:   if (source_json.HasParseError() || !source_json.IsObject()) 
return {};
> We're making 2 copies here and in the following line. We need one copy for
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@334
PS4, Line 334:     const vector<string>& contexts) {
> This is also using copy elision so we are good.
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@334
PS4, Line 334:     const vector<string>& contexts) {
> Either pass sql_statements by reference to CollectInfoStringValuesByKeys or
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@336
PS4, Line 336:   unordered_set<string_view> column_set;
> Do we ever have multiple "Sql Statement" in the profile?
Updated logic to only use the first SQL statement occurrence.


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@339
PS4, Line 339:   for (const string& context : contexts) {
> This seems unnecessary since we've already redacted above ?
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@377
PS4, Line 377:       if (match.length(1) == 0) continue;
> We can use an unordered_map unless we really need ordered replacements in L
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@387
PS4, Line 387:   table_tokens.reserve(table_set.size());
> This seems redundant? We can directly pass profile_text.data() and profile_
Done


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@387
PS4, Line 387:   table_tokens.reserve(table_set.size());
             :   for (const string_view token : table_set) 
table_tokens.emplace_back(token);
             :   vector<string> column_tokens;
             :   column_tokens.reserve(column_set.size());
             :   for (const string_view token : column_set) 
column_tokens.emplace_back(token);
             :   s
> We will call RedactQueryProfileWithAliases() later, but RedactQueryProfileW
Removed redundant redaction


http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@393
PS4, Line 393:   std::sort(column_tokens.begin(), column_tokens.end());
             :   return {table_tokens, column_tokens};
             : }
             :
> Is there a specific reason alias_to_original_ needs to be an ordered std::m
Done


http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction.cc
File be/src/service/query-profile-redaction.cc:

http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction.cc@75
PS5, Line 75:     R"(\b(?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}"
> line too long (93 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction.cc@303
PS5, Line 303: static vector<string> ExtractHostTokensFromPerHostSections(const 
Value& source_json) {
> line too long (93 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction.cc@309
PS5, Line 309:       CollectInfoStringValuesByKeys(source_json, 
HOST_SECTION_KEYS);
> line too long (91 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction.cc@344
PS5, Line 344:       if (match.length(1) == 0) continue;
> line too long (96 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction.cc@414
PS5, Line 414:   for (const string& value : user_values) 
username_tokens.insert(value);
> line too long (93 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction.cc@421
PS5, Line 421:   redacted = BuildApplyAndTrackAliases(
> line too long (91 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction.cc@422
PS5, Line 422:       username_tokens, "user", redacted, alias_to_original);
> line too long (93 > 90)
Done



--
To view, visit http://gerrit.cloudera.org:8080/24282
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If0c5b4911a64888f319f212155df6e08c1800b32
Gerrit-Change-Number: 24282
Gerrit-PatchSet: 6
Gerrit-Owner: Gokul Kolady <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: Gokul Kolady <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Jason Fehr <[email protected]>
Gerrit-Reviewer: Yida Wu <[email protected]>
Gerrit-Comment-Date: Thu, 14 May 2026 20:40:20 +0000
Gerrit-HasComments: Yes

Reply via email to