Thanks for your feedbacks, guys! So we finally decide to implement the same behavior as Hive's first. The Epic for Column Masking is here: https://issues.apache.org/jira/browse/IMPALA-8981 We'll start at custom masking types which don't depend on any builtin masking functions: https://issues.apache.org/jira/browse/IMPALA-9009
Cheers, Quanlong On Tue, Nov 19, 2019 at 11:53 AM Kurt Deschler <kdesc...@cloudera.com> wrote: > I got a little info from Guther on this. Apparently masking behavior was > being driven by specific costomer(s) at the time and was done for all > column references due to concerns about leaking data. Regardless of the > reasoning, we have to follow the semantics that Hive has at this point. We > could always provide the other [top-level select list only] mode later if > that was requested. > > On Thu, Nov 14, 2019 at 8:17 PM Shant Hovsepian <sh...@arcadiadata.com> > wrote: > > > Any sense what the consumers and end users have asked for regarding > > behavior? > > > > On Tue, Nov 12, 2019, 1:57 PM Todd Lipcon <t...@cloudera.com> wrote: > > > > > I'd agree that applying it at the innermost column ref makes the most > > sense > > > from a security perspective. Otherwise it's trivial to "binary search" > > your > > > way to the value of a masked column, even if the masking is > > > completely "xed" out. > > > > > > I'm surprised to hear that DB2 implements it otherwise, though quick > > > googling agrees with that. Perhaps the assumption there is that anyone > > who > > > is binary-searching to exposes data will be caught by audit or other > > > security features. > > > > > > -Todd > > > > > > On Tue, Nov 12, 2019 at 10:15 AM Tim Armstrong < > tarmstr...@cloudera.com> > > > wrote: > > > > > > > I think compatibility with Hive is pretty important - the default > > > > expectation will be that Ranger policies behave consistently across > SQL > > > > engines. I think it would be hard to argue for differing default > > > behaviour > > > > if it's in some sense less secure. > > > > > > > > On Tue, Nov 12, 2019 at 12:03 AM Gabor Kaszab < > gaborkas...@apache.org> > > > > wrote: > > > > > > > > > Hey Quanlong, > > > > > > > > > > For me it seems more important not to leak confidential information > > so > > > > I'd > > > > > vote for (a). I wonder what others think. > > > > > > > > > > Gabor > > > > > > > > > > On Mon, Nov 11, 2019 at 1:04 PM Quanlong Huang < > > > huangquanl...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > We are adding the support for Ranger column masking and need to > > > reach a > > > > > > consensus on the behavior design. > > > > > > > > > > > > A column masking policy is something like "only show last 4 chars > > of > > > > > phone > > > > > > column to user X". When user X reads the phone column, the value > > > woule > > > > be > > > > > > something like "xxxxx6789" instead of the real value "123456789". > > > > > > > > > > > > The behavior is clear when the query is simple. However, there're > > two > > > > > > different behaviors when the query contains subqueries. The key > > part > > > is > > > > > > where we should perform the masking, whether in the outer most > > select > > > > > list, > > > > > > or in the select list of the inner most subquery. > > > > > > > > > > > > To be specifit, consider these two queries: > > > > > > (1) subquery contains predicates on unmasked value > > > > > > SELECT concat(name, phone) FROM ( > > > > > > SELECT name, phone FROM customer WHERE phone = '123456789' > > > > > > ) t; > > > > > > (2) subquery contains predicates on masked value > > > > > > SELECT concat(name, phone) FROM ( > > > > > > SELECT name, phone FROM customer WHERE phone = 'xxxxx6789' > > > > > > ) t; > > > > > > > > > > > > Let's say there's actually one row in table 'customer' satisfying > > > > phone = > > > > > > '123456789'. When user X runs the queries, the two different > > > behaviors > > > > > are: > > > > > > (a) Query1 returns nothing. Query2 returns one result: > > > "Bobxxxxx6789". > > > > > > (b) Query1 returns one result: "Bobxxxxx6789". Query2 returns > > > nothing. > > > > > > > > > > > > Hive is in behavior (a) since it does a table masking that > replaces > > > the > > > > > > TableRef with a subquery containing masked columns. See more in > > > codes: > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/hive/blob/rel/release-3.1.2/ql/src/java/org/apache/hadoop/hive/ql/parse/TableMask.java#L86-L155 > > > > > > and some experiments I did: > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1LYk2wxT3GMw4ur5y9JBBykolfAs31P3gWRStk21PomM/edit?usp=sharing > > > > > > > > > > > > Kurt mentions that traditional dbs like DB2 are in behavior (b). > I > > > > think > > > > > we > > > > > > need to decide which behavior we'd like to support. The pros for > > > > behavior > > > > > > (a) is no security leak. Because user X can't guess whether there > > are > > > > > some > > > > > > customers with phone number '123456789'. The pros for behavior > (b) > > is > > > > > users > > > > > > don't need to rewrite their existing queries after admin applies > > > column > > > > > > masking policies. > > > > > > > > > > > > What do you think? > > > > > > > > > > > > Thanks, > > > > > > Quanlong > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Todd Lipcon > > > Software Engineer, Cloudera > > > > > >