Abhishek Rawat has posted comments on this change. ( http://gerrit.cloudera.org:8080/24282 )
Change subject: IMPALA-14961: Query Profile Redaction ...................................................................... Patch Set 6: (7 comments) http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc File be/src/service/query-profile-redaction.cc: http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@74 PS6, Line 74: static const std::regex IP_RE( We should probably redact IPV6 addresses also. http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@50 PS6, Line 50: // Matches hostnames followed by a port (e.g. coordinator.example.com:22000). : static const std::regex HOST_WITH_PORT_RE( : R"(\b([A-Za-z][A-Za-z0-9-]*(?:\.[A-Za-z0-9-]+)*)\:(\d{2,5})\b)", regex::optimize); : // Matches the analyzed query subsection embedded in the textual plan. : static const std::regex ANALYZED_RE( : R"(Analyzed query:\s*([\s\S]*?)\n\nF\d+:PLAN FRAGMENT)", regex::optimize); : // Matches fully-qualified identifiers with at least one dot (e.g. db.tbl.col). : static const std::regex QUALIFIED_ID_RE( : R"(\b([A-Za-z_][A-Za-z0-9_]*(?:\.[A-Za-z_][A-Za-z0-9_]*)+)\b)", regex::optimize); : // Matches table tokens following FROM/JOIN clauses. : static const std::regex FROM_JOIN_TABLE_RE( : R"(\b(?:from|join)\s+([A-Za-z_][A-Za-z0-9_\.]*)\b)", : regex::optimize | regex::icase); : // Matches snake_case identifiers that are candidates for column tokens. : static const std::regex SNAKE_CASE_ID_RE( : R"(\b([A-Za-z_][A-Za-z0-9_]*_[A-Za-z0-9_]*)\b)", regex::optimize); : // Matches e-mail addresses. : static const std::regex EMAIL_RE( : R"([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,})", : regex::optimize | regex::nosubs); : // Matches user/userid key-value pairs like user=alice. : static const std::regex USER_KV_RE( : R"(\b(?:user|uid)=([A-Za-z0-9._@-]+)\b)", regex::optimize | regex::icase); : // Matches IPv4 addresses. : static const std::regex IP_RE( : R"(\b(?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}" : R"((?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\b)", : regex::optimize | regex::nosubs); unit tests for theses regexes would be good. http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@98 PS6, Line 98: alias_to_original[entry.second] = entry.first; we guarantee aliases are unique? http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@118 PS6, Line 118: std::sort(results.begin(), results.end()); I think I missed why we need to sort the matched results? http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@126 PS6, Line 126: unordered_set<string> seen_tokens; seen_tokens is unecessary as you can look up in the alias_map. http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@132 PS6, Line 132: std::stringstream alias; : alias << prefix << "_" << std::setfill('0') << std::setw(3) << (++idx); We could avoid stringstream altogether and do string manipulation using append() ? http://gerrit.cloudera.org:8080/#/c/24282/6/be/src/service/query-profile-redaction.cc@134 PS6, Line 134: alias_map[token] = alias.str(); when possible use emplace. -- To view, visit http://gerrit.cloudera.org:8080/24282 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If0c5b4911a64888f319f212155df6e08c1800b32 Gerrit-Change-Number: 24282 Gerrit-PatchSet: 6 Gerrit-Owner: Gokul Kolady <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: Gokul Kolady <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Jason Fehr <[email protected]> Gerrit-Reviewer: Yida Wu <[email protected]> Gerrit-Comment-Date: Thu, 14 May 2026 22:25:08 +0000 Gerrit-HasComments: Yes
