Gokul Kolady has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/24282 )

Change subject: IMPALA-14961: Query Profile Redaction
......................................................................


Patch Set 7:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction-test.cc
File be/src/service/query-profile-redaction-test.cc:

http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction-test.cc@79
PS4, Line 79: TEST(QueryProfileRedactionTest, 
RedactedProfileMatchesGoldenForTpcds72) {
> Thanks. I think we can also add some negative test cases to verify it is no
Done


http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction.cc
File be/src/service/query-profile-redaction.cc:

http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction.cc@51
PS5, Line 51: // Matches the analyzed query subsection embedded in the textual 
plan.
            : static const regex ANALYZED_RE(
            :     R"(Analyzed query:\s*([\s\S]*?)\n\nF\d+:PLAN FRAGMENT)", 
regex::optimize);
            : // Matches fully-qualified identifiers with at least one dot 
(e.g. db.tbl.col).
            : static const regex QUALIFIED_ID_RE(
            :     
R"(\b([A-Za-z_][A-Za-z0-9_]*(?:\.[A-Za-z_][A-Za-z0-9_]*)+)\b)", 
regex::optimize);
            : // Matches table tokens following FROM/JOIN clauses.
            : static const regex FROM_JOIN_TABLE_RE(
            :     R"(\b(?:from|join)\s+([A-Za-z_][A-Za-z0-9_\.]*)\b)",
            :     regex::optimize | regex::icase);
            : // Matches snake_case identifiers that are candidates for column 
tokens.
            : static const regex SNAKE_CASE_ID_RE(
            :     R"(\b([A-Za-z_][A-Za-z0-9_]*_[A-Za-z0-9_]*)\b)", 
regex::optimize);
            : // Matches e-mail addresses.
            : static const regex EMAIL_RE(
            :     R"([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,})",
            :     regex::optimize | regex::nosubs);
            : // Matches user/userid key-value pairs like user=alice.
            : static const regex USER_KV_RE(
            :     R"(\b(?:user|uid)=([A-Za-z0-9._@-]+)\b)", regex::optimize | 
regex::icase);
            : // Matches IPv4 addresses.
            : static const regex IPV4_RE(
            :     R"(\b(?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3})"
            :     R"((?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\b)",
            :     regex::optimize | regex::nosubs);
> Don't use the std:: prefix unless absolutely necessary to avoid compiler co
Done


http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc
File be/src/service/query-profile-redaction.cc:

http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@74
PS6, Line 74:     R"((?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\b)",
> We should probably redact IPV6 addresses also.
Done


http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@75
PS6, Line 75: regex::optimize | regex::nosubs);
> This regex seems broken, it seems to be R"(\b(?:(?:25[0-5]|2[0-4]\d|1\d\d|[
Done


http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@50
PS6, Line 50:     
R"(\b([A-Za-z][A-Za-z0-9-]*(?:\.[A-Za-z0-9-]+)*)\:(\d{2,5})\b)", 
regex::optimize);
            : // Matches the analyzed query subsection embedded in the textual 
plan.
            : static const regex ANALYZED_RE(
            :     R"(Analyzed query:\s*([\s\S]*?)\n\nF\d+:PLAN FRAGMENT)", 
regex::optimize);
            : // Matches fully-qualified identifiers with at least one dot 
(e.g. db.tbl.col).
            : static const regex QUALIFIED_ID_RE(
            :     
R"(\b([A-Za-z_][A-Za-z0-9_]*(?:\.[A-Za-z_][A-Za-z0-9_]*)+)\b)", 
regex::optimize);
            : // Matches table tokens following FROM/JOIN clauses.
            : static const regex FROM_JOIN_TABLE_RE(
            :     R"(\b(?:from|join)\s+([A-Za-z_][A-Za-z0-9_\.]*)\b)",
            :     regex::optimize | regex::icase);
            : // Matches snake_case identifiers that are candidates for column 
tokens.
            : static const regex SNAKE_CASE_ID_RE(
            :     R"(\b([A-Za-z_][A-Za-z0-9_]*_[A-Za-z0-9_]*)\b)", 
regex::optimize);
            : // Matches e-mail addresses.
            : static const regex EMAIL_RE(
            :     R"([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,})",
            :     regex::optimize | regex::nosubs);
            : // Matches user/userid key-value pairs like user=alice.
            : static const regex USER_KV_RE(
            :     R"(\b(?:user|uid)=([A-Za-z0-9._@-]+)\b)", regex::optimize | 
regex::icase);
            : // Matches IPv4 addresses.
            : static const regex IPV4_RE(
            :     R"(\b(?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3})"
            :     R"((?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\b)",
            :     regex::optimize | regex::nosubs);
            : // Matches IPv6 addresses, including compressed forms.
            : static const regex IPV6_RE(
> unit tests for theses regexes would be good.
Done


http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@98
PS6, Line 98:     const unordered_map<string, string>& original_to_alias) {
> we guarantee aliases are unique?
Yes


http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@106
PS6, Line 106:
> I'm not sure what the purpose is of this intermediate unordered_set.  Why n
Done


http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@118
PS6, Line 118:     if (!seen_matches.insert(token).second) continue;
> I think I missed why we need to sort the matched results?
Done


http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@118
PS6, Line 118:     if (!seen_matches.insert(token).second) continue;
> According to the comment at https://gerrit.cloudera.org/c/24282/4..5/be/src
Done


http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@126
PS6, Line 126: static unordered_map<string, string> BuildAliasMap(
> seen_tokens is unecessary as you can look up in the alias_map.
Done


http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@132
PS6, Line 132: string alias(prefix);
             :     alias.push_back('_');
> We could avoid stringstream altogether and do string manipulation using app
Done


http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@134
PS6, Line 134:     string idx_string = to_string(++idx);
> when possible use emplace.
Done


http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@166
PS6, Line 166:
> We should probably add a boundary limit here (maybe 10,000) to avoid a pote
Done


http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@197
PS6, Line 197:     if (from.empty()) continue;
> nit. The comment is better to sit above the code, we can move it above the
Done


http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@244
PS6, Line 244:   }
             :   return {entry["key"].GetString(), entry["value"].GetString()};
             : }
             :
             : // Recursively collects profile text contexts that contain query 
identifiers.
             : static vector<string> 
CollectIdentifierContextsFromJsonProfile(const Value& node) {
             :   vector<string> contexts;
             :   if (node.IsArray()) {
             :     for (const auto& item : node.GetArray()) {
             :       auto child_contexts = 
CollectIdentifierContextsFromJsonProfile(item);
             :       contexts.insert(
             :           contexts.end(), 
make_move_iterator(child_contexts.begin()),
             :           make_move_iterator(child_contexts.end()));
             :     }
             :     return contexts;
             :   }
             :   if (!node.IsObject()) return contexts;
             :   for (auto it = node.MemberBegin(); it != node.MemberEnd(); 
++it) {
             :     if (strcmp(it->name.GetString(), "info_strings") == 0) {
             :       if (!it->value.IsArray()) continue;
             :       for (const auto& entry : it->value.GetArray()) {
             :
> It seems we process the info_strings node twice here. Also, calling HasMemb
Done


http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@307
PS6, Line 307:       if (*value != '\0') values.push_back(value);
> Now that BuildApplyAndTrackAliases is a templatized function, there is no n
Done


http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@392
PS6, Line 392:     for (sregex_iterator it(context.cbegin(), context.cend(), 
SNAKE_CASE_ID_RE), end;
             :          it != end; ++it) {
> Is sorting required?
Comment no longer applicable


http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@405
PS6, Line 405:
             : // Redacts sensitive profile values and optionally records 
alias-to-original mappings.
             : static Status RedactQueryProfileWithAliases(const string_view& 
profile_text,
             :     const Value& source_json,
             :
> Do we expect all the 'Sql Statement' values to be exactly the same? I think
We used to iterate over all matches but based on earlier review comments 
decided to switch to only the first match because the SQL Statement section 
should only show up once in the profile (this is intended to find the original 
raw SQL statement because it may contain non-fully-qualified table/column 
names). As for other recreations of the SQL statement in the profile such as in 
the Analyzed Query section, we're okay with skipping those because we will 
catch any fully qualified table/column names in those sections during 
table/column redaction.



--
To view, visit http://gerrit.cloudera.org:8080/24282
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If0c5b4911a64888f319f212155df6e08c1800b32
Gerrit-Change-Number: 24282
Gerrit-PatchSet: 7
Gerrit-Owner: Gokul Kolady <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: Gokul Kolady <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Jason Fehr <[email protected]>
Gerrit-Reviewer: Yida Wu <[email protected]>
Gerrit-Comment-Date: Thu, 14 May 2026 23:50:49 +0000
Gerrit-HasComments: Yes

Reply via email to