Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/14894 )
Change subject: IMPALA-9009: Core support for Ranger column masking ...................................................................... Patch Set 12: (7 comments) Thanks for your detailed reviews, guys! Will fix the bug found. About our column masking behaviors, there're more details in * experiment doc: https://docs.google.com/document/d/1LYk2wxT3GMw4ur5y9JBBykolfAs31P3gWRStk21PomM/edit?usp=sharing * Mailing-list thread: https://lists.apache.org/thread.html/93c4a6a7f20c88e9472067074c9746b99f0a47880aaed1417b9771ac%40%3Cdev.impala.apache.org%3E * design doc: https://docs.google.com/document/d/1GC7au6F5Snp8zQisRopOhKSjKsI1XPPg8S2foQxfJrA/edit# http://gerrit.cloudera.org:8080/#/c/14894/12/fe/src/main/java/org/apache/impala/analysis/TableRef.java File fe/src/main/java/org/apache/impala/analysis/TableRef.java: http://gerrit.cloudera.org:8080/#/c/14894/12/fe/src/main/java/org/apache/impala/analysis/TableRef.java@27 PS12, Line 27: import java.util.UUID; > Is this needed? Don't see it getting used anywhere in the diffs. Done http://gerrit.cloudera.org:8080/#/c/14894/12/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java File fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java: http://gerrit.cloudera.org:8080/#/c/14894/12/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java@283 PS12, Line 283: maskType > Do we need to add Preconditions here to check if maskType and maskTypeDef a Sure. Will add the Preconditions check for maskType. Actually if maskType is null, accessResult.isMaskEnabled() will return false. But we do need to verify it. For maskTypeDef, it can be null and we have dealt with it at line 296. http://gerrit.cloudera.org:8080/#/c/14894/12/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java@290 PS12, Line 290: mask > Curious to understand: Do these masks work as UDFs? How are they loaded? Yes, these builtin functions are being added in another patch: https://gerrit.cloudera.org/c/14963/ http://gerrit.cloudera.org:8080/#/c/14894/12/fe/src/test/java/org/apache/impala/authorization/AuthorizationStmtTest.java File fe/src/test/java/org/apache/impala/authorization/AuthorizationStmtTest.java: http://gerrit.cloudera.org:8080/#/c/14894/12/fe/src/test/java/org/apache/impala/authorization/AuthorizationStmtTest.java@a2923 PS12, Line 2923: > As I recall, the second argument to the function substr() starts from 1 ins Sure. It's added long time before in other patches and this doesn't trigger analysis exceptions. Will change 0 to 1 if needed. http://gerrit.cloudera.org:8080/#/c/14894/12/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test File testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test: http://gerrit.cloudera.org:8080/#/c/14894/12/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test@227 PS12, Line 227: 100 > If my understanding is correct, both the table 'alltypestiny' and the table Changing 100 to 1234 then the results is empty. Actually if Impala incorrectly performs column masking on the local view 'alltypes', 100 will be masked into 10000 (id => id * 100) and the results is empty. So I think 100 is better than 1234. http://gerrit.cloudera.org:8080/#/c/14894/12/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test@240 PS12, Line 240: 0aaaaaa > > I don't understand this value - the mask on alltypestiny.string_col should > > only add 'aaa', not 'aaaaaa'. Or I misunderstood something? > It looks like the 3rd column masking policy (the one concatenating the > content of {col} with 'aaa') is applied twice. One in "select id, bool_col, > string_col from alltypestiny" and the other in "select iv1.*, iv2.*, > v.string_col from iv1, iv2, alltypes_view v". I am not very sure if my > understanding is correct. Sorry that this is a bug... We should not apply the masking policy on alltypestiny.string_col twice. Didn't check the copy-pasted result carefully... > Intuitively, when column masking is enabled, the number of rows returned > should still be the same. No. As discussed in the mailing-list: https://lists.apache.org/thread.html/93c4a6a7f20c88e9472067074c9746b99f0a47880aaed1417b9771ac%40%3Cdev.impala.apache.org%3E, we choose to be consistent with Hive's behavior. So all expressions using a column with a masking policy will perform on its *masked* value. The number of rows could change due to this. > After some thoughts, I have another related question. According to the tests > provided in this file, it seems that the predicates involved in a join are > evaluated on the masked/transformed column values if there exist some > corresponding column masking policies. Yes, this is Hive's behavior. > If this is the case, I was wondering whether or not the order of tables in > the from clause would affect the results of the join. To be more specific, is > it guaranteed that the following two SQL statements produce the same set of > result given "any" set of column masking policies and "any" set of source > tables? Thanks! > > Query 1: select a.id, s.bool_col, t.bool_col, s.string_col, t.string_col from functional.alltypes a, functional.alltypestiny t, functional.alltypessmall s where a.id = t.id and t.id = s.id; > > Query 2: select a.id, s.bool_col, t.bool_col, s.string_col, t.string_col from functional.alltypes a, functional.alltypessmall s, functional.alltypestiny t where a.id = t.id and t.id = s.id; I think for INNER join, it's guaranteed since the join order doesn't affect the result. You can image 'alltypestiny' as a new table with masked values. And the same for 'alltypes' that also has column masking policies. http://gerrit.cloudera.org:8080/#/c/14894/12/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test@261 PS12, Line 261: Should not mask the columns when used in sql generations > I am a bit confused by this comment. Does that mean after creating the view Sorry to make you misunderstand the comment. "sql generation" here means generating the query string for CreateView or AlterView statements. For such kinds of statements, Impala will analyze their AST (e.g. for syntax check or some rewrites). Then regenerate the QueryStmt part by using toSql() of the AST node ( https://github.com/apache/impala/blob/497a17d/fe/src/main/java/org/apache/impala/analysis/CreateOrAlterViewStmtBase.java#L150) and use the generated sql to create/alter view in Hive. The comment means although we have column masking policies on table 'alltypestiny', we don't rewrite any queries on it used in CreateView or AlterView. So the result of ShowCreateView won't be something like CREATE VIEW $UNIQUE_DB.masked_view AS SELECT * FROM ( select id * 100 as id, bool_col, tinyint_col, ... from functional.alltypestiny) t -- To view, visit http://gerrit.cloudera.org:8080/14894 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4cad60e0e69ea573b7ecfc011b142c46ef52ed61 Gerrit-Change-Number: 14894 Gerrit-PatchSet: 12 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Fang-Yu Rao <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Kurt Deschler <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Vihang Karajgaonkar <[email protected]> Gerrit-Comment-Date: Tue, 07 Jan 2020 03:36:31 +0000 Gerrit-HasComments: Yes
