Gokul Kolady has posted comments on this change. ( http://gerrit.cloudera.org:8080/24282 )
Change subject: IMPALA-14961: Query Profile Redaction ...................................................................... Patch Set 6: (34 comments) http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction-test.cc File be/src/service/query-profile-redaction-test.cc: http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction-test.cc@79 PS4, Line 79: TEST(QueryProfileRedactionTest, RedactedProfileMatchesGoldenForTpcds72) { > It seems we should add more tests here. One kind of test case I think we ne Added a unit test for this edge case http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction-test.cc File be/src/service/query-profile-redaction-test.cc: http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction-test.cc@124 PS5, Line 124: "value": "select decoy_table_002.id from decoy_table_002 " > line too long (127 > 90) Done http://gerrit.cloudera.org:8080/#/c/24282/3/be/src/service/query-profile-redaction.cc File be/src/service/query-profile-redaction.cc: http://gerrit.cloudera.org:8080/#/c/24282/3/be/src/service/query-profile-redaction.cc@138 PS3, Line 138: > Pass entries by reference. This method has been removed as a result of CR changes http://gerrit.cloudera.org:8080/#/c/24282/3/be/src/service/query-profile-redaction.cc@148 PS3, Line 148: if (a.first.size() != b.first.size()) return a.first.size() > b.first.size(); > Pass parameters by reference. This method has been removed as a result of CR changes http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc File be/src/service/query-profile-redaction.cc: http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@67 PS4, Line 67: static const std::regex EMAIL_RE( > Since alias_to_original basically points to entries in original_to_alias we Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@74 PS4, Line 74: static const std::regex IP_RE( : R"(\b(?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}" > No need to have a separate declaration for static functions. Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@80 PS4, Line 80: static string JsonEscapeString(const string_view& input) { > use unordered_set<string_view> since ordering is not required and is more e Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@95 PS4, Line 95: unordered_map<string_view, string_view> alias_to_original; > We could use one of these sections, maybe "Per Host Min Memory Reservation" Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@96 PS4, Line 96: alias_to_original.reserve(original_to_alias.size()); : for (const auto& entry : original_to_alias) { > Declare all regexs after line 49 like this: Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@99 PS4, Line 99: } > We probably don't need an ordered set here so maybe use unordered_set. Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@115 PS4, Line 115: vector<string> results; > We don't need a temporary set here. We can simply iterate through the input Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@129 PS4, Line 129: for (const string& token : tokens) { : if (token.empty() || seen_tokens.find(token) != seen_tokens.end()) continue; : seen_tokens.insert(token); : std::stringstream alias; : alias << prefix << "_" << std::setfill('0') << std::setw(3) << (++idx); : alias_map[token] = alias.str(); : } : r > The logic here seems vulnerable. For example, if the original tables (decoy Updated this logic to do two passes. The first pass unredacts aliases into temporary unique names that won't get confused for other aliases, then the second pass unredacts these temporary unique names into the original strings. http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@143 PS4, Line 143: for (const auto& entry : replacements) { > Not sure why we need a temporary vector here ? We can pass the map by refer Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@154 PS4, Line 154: static bool ReplacementTokenCollides(const string& token, const string& text, > Pass alias_to_original by reference to BuildAliasReverseMap to avoid temp c Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@154 PS4, Line 154: static bool ReplacementTokenCollides(const string& token, const string& text, > Actually, this looks like copy elision so no change needed. Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@164 PS4, Line 164: static string BuildReplacementToken(size_t idx, const string& text, > avoid local copy here, either pass string by reference to this function or Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@263 PS4, Line 263: contexts.end(), std::make_move_iterator(child_contexts.begin()), : std::make_move_it > use unordered_set. Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@266 PS4, Line 266: return contexts; > This function is non trivial. Maybe add comments explaining what each of th Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@328 PS4, Line 328: if (source_json.HasParseError() || !source_json.IsObject()) return {}; > We're making 2 copies here and in the following line. We need one copy for Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@334 PS4, Line 334: const vector<string>& contexts) { > This is also using copy elision so we are good. Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@334 PS4, Line 334: const vector<string>& contexts) { > Either pass sql_statements by reference to CollectInfoStringValuesByKeys or Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@336 PS4, Line 336: unordered_set<string_view> column_set; > Do we ever have multiple "Sql Statement" in the profile? Updated logic to only use the first SQL statement occurrence. http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@339 PS4, Line 339: for (const string& context : contexts) { > This seems unnecessary since we've already redacted above ? Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@377 PS4, Line 377: if (match.length(1) == 0) continue; > We can use an unordered_map unless we really need ordered replacements in L Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@387 PS4, Line 387: table_tokens.reserve(table_set.size()); > This seems redundant? We can directly pass profile_text.data() and profile_ Done http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@387 PS4, Line 387: table_tokens.reserve(table_set.size()); : for (const string_view token : table_set) table_tokens.emplace_back(token); : vector<string> column_tokens; : column_tokens.reserve(column_set.size()); : for (const string_view token : column_set) column_tokens.emplace_back(token); : s > We will call RedactQueryProfileWithAliases() later, but RedactQueryProfileW Removed redundant redaction http://gerrit.cloudera.org:8080/#/c/24282/4/be/src/service/query-profile-redaction.cc@393 PS4, Line 393: std::sort(column_tokens.begin(), column_tokens.end()); : return {table_tokens, column_tokens}; : } : > Is there a specific reason alias_to_original_ needs to be an ordered std::m Done http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction.cc File be/src/service/query-profile-redaction.cc: http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction.cc@75 PS5, Line 75: R"(\b(?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}" > line too long (93 > 90) Done http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction.cc@303 PS5, Line 303: static vector<string> ExtractHostTokensFromPerHostSections(const Value& source_json) { > line too long (93 > 90) Done http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction.cc@309 PS5, Line 309: CollectInfoStringValuesByKeys(source_json, HOST_SECTION_KEYS); > line too long (91 > 90) Done http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction.cc@344 PS5, Line 344: if (match.length(1) == 0) continue; > line too long (96 > 90) Done http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction.cc@414 PS5, Line 414: for (const string& value : user_values) username_tokens.insert(value); > line too long (93 > 90) Done http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction.cc@421 PS5, Line 421: redacted = BuildApplyAndTrackAliases( > line too long (91 > 90) Done http://gerrit.cloudera.org:8080/#/c/24282/5/be/src/service/query-profile-redaction.cc@422 PS5, Line 422: username_tokens, "user", redacted, alias_to_original); > line too long (93 > 90) Done -- To view, visit http://gerrit.cloudera.org:8080/24282 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If0c5b4911a64888f319f212155df6e08c1800b32 Gerrit-Change-Number: 24282 Gerrit-PatchSet: 6 Gerrit-Owner: Gokul Kolady <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: Gokul Kolady <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Jason Fehr <[email protected]> Gerrit-Reviewer: Yida Wu <[email protected]> Gerrit-Comment-Date: Thu, 14 May 2026 20:40:20 +0000 Gerrit-HasComments: Yes
