Thanks for your feedbacks, guys!

So we finally decide to implement the same behavior as Hive's first. The
Epic for Column Masking is here:
https://issues.apache.org/jira/browse/IMPALA-8981
We'll start at custom masking types which don't depend on any builtin
masking functions: https://issues.apache.org/jira/browse/IMPALA-9009

Cheers,
Quanlong

On Tue, Nov 19, 2019 at 11:53 AM Kurt Deschler <kdesc...@cloudera.com>
wrote:

> I got a little info from Guther on this. Apparently masking behavior was
> being driven by specific costomer(s) at the time and was done for all
> column references due to concerns about leaking data. Regardless of the
> reasoning, we have to follow the semantics that Hive has at this point. We
> could always provide the other [top-level select list only] mode later if
> that was requested.
>
> On Thu, Nov 14, 2019 at 8:17 PM Shant Hovsepian <sh...@arcadiadata.com>
> wrote:
>
> > Any sense what the consumers and end users have asked for regarding
> > behavior?
> >
> > On Tue, Nov 12, 2019, 1:57 PM Todd Lipcon <t...@cloudera.com> wrote:
> >
> > > I'd agree that applying it at the innermost column ref makes the most
> > sense
> > > from a security perspective. Otherwise it's trivial to "binary search"
> > your
> > > way to the value of a masked column, even if the masking is
> > > completely "xed" out.
> > >
> > > I'm surprised to hear that DB2 implements it otherwise, though quick
> > > googling agrees with that. Perhaps the assumption there is that anyone
> > who
> > > is binary-searching to exposes data will be caught by audit or other
> > > security features.
> > >
> > > -Todd
> > >
> > > On Tue, Nov 12, 2019 at 10:15 AM Tim Armstrong <
> tarmstr...@cloudera.com>
> > > wrote:
> > >
> > > > I think compatibility with Hive is pretty important - the default
> > > > expectation will be that Ranger policies behave consistently across
> SQL
> > > > engines. I think it would be hard to argue for differing default
> > > behaviour
> > > > if it's in some sense less secure.
> > > >
> > > > On Tue, Nov 12, 2019 at 12:03 AM Gabor Kaszab <
> gaborkas...@apache.org>
> > > > wrote:
> > > >
> > > > > Hey Quanlong,
> > > > >
> > > > > For me it seems more important not to leak confidential information
> > so
> > > > I'd
> > > > > vote for (a). I wonder what others think.
> > > > >
> > > > > Gabor
> > > > >
> > > > > On Mon, Nov 11, 2019 at 1:04 PM Quanlong Huang <
> > > huangquanl...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > We are adding the support for Ranger column masking and need to
> > > reach a
> > > > > > consensus on the behavior design.
> > > > > >
> > > > > > A column masking policy is something like "only show last 4 chars
> > of
> > > > > phone
> > > > > > column to user X". When user X reads the phone column, the value
> > > woule
> > > > be
> > > > > > something like "xxxxx6789" instead of the real value "123456789".
> > > > > >
> > > > > > The behavior is clear when the query is simple. However, there're
> > two
> > > > > > different behaviors when the query contains subqueries. The key
> > part
> > > is
> > > > > > where we should perform the masking, whether in the outer most
> > select
> > > > > list,
> > > > > > or in the select list of the inner most subquery.
> > > > > >
> > > > > > To be specifit, consider these two queries:
> > > > > > (1) subquery contains predicates on unmasked value
> > > > > >   SELECT concat(name, phone) FROM (
> > > > > >     SELECT name, phone FROM customer WHERE phone = '123456789'
> > > > > >   ) t;
> > > > > > (2) subquery contains predicates on masked value
> > > > > >   SELECT concat(name, phone) FROM (
> > > > > >     SELECT name, phone FROM customer WHERE phone = 'xxxxx6789'
> > > > > >   ) t;
> > > > > >
> > > > > > Let's say there's actually one row in table 'customer' satisfying
> > > > phone =
> > > > > > '123456789'. When user X runs the queries, the two different
> > > behaviors
> > > > > are:
> > > > > > (a) Query1 returns nothing. Query2 returns one result:
> > > "Bobxxxxx6789".
> > > > > > (b) Query1 returns one result: "Bobxxxxx6789". Query2 returns
> > > nothing.
> > > > > >
> > > > > > Hive is in behavior (a) since it does a table masking that
> replaces
> > > the
> > > > > > TableRef with a subquery containing masked columns. See more in
> > > codes:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/hive/blob/rel/release-3.1.2/ql/src/java/org/apache/hadoop/hive/ql/parse/TableMask.java#L86-L155
> > > > > > and some experiments I did:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1LYk2wxT3GMw4ur5y9JBBykolfAs31P3gWRStk21PomM/edit?usp=sharing
> > > > > >
> > > > > > Kurt mentions that traditional dbs like DB2 are in behavior (b).
> I
> > > > think
> > > > > we
> > > > > > need to decide which behavior we'd like to support. The pros for
> > > > behavior
> > > > > > (a) is no security leak. Because user X can't guess whether there
> > are
> > > > > some
> > > > > > customers with phone number '123456789'. The pros for behavior
> (b)
> > is
> > > > > users
> > > > > > don't need to rewrite their existing queries after admin applies
> > > column
> > > > > > masking policies.
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > Thanks,
> > > > > > Quanlong
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
>

Reply via email to