[ 
https://issues.apache.org/jira/browse/IMPALA-13667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

woosuk.ro updated IMPALA-13667:
-------------------------------
    Description: 
h3. *Description*

When a Ranger {{mask_hash}} policy is applied to a column in Impala, each view 
that references that column triggers another call to {{{}mask_hash{}}}. As a 
result, the column is hashed repeatedly, causing multiple nested {{mask_hash}} 
operations instead of a single masking step.
h3. *Steps to Reproduce*

1. In Ranger, apply a {{mask_hash}} policy to a column (e.g., 
{{{}account_number{}}}) across a database.

2. Create a Base Table:
{code:java}
CREATE TABLE private_db.base_table (
    account_number STRING,
    other_column STRING
);{code}
3. Create a View Referencing the Base Table:
{code:java}
CREATE VIEW private_db.base_view AS
SELECT * FROM private_db.base_table;{code}
4. Query the View:
{code:java}
SELECT * FROM private_db.base_view;{code}
5. Observe the query plan or Ranger audit logs: multiple {{mask_hash}} calls 
are stacked.
h3. *Expected Behavior*

{{mask_hash}} should apply once per column per query, regardless of view layers.
----
h3. *Actual Behavior*

{{mask_hash}} is invoked multiple times (one for each view layer), causing 
repeated hashing.

*Ranger Audit Logs:*

*1. private_db.base_view account_number column masking*
{code:java}
{
    "access": "mask_hash",
    "resource": "private_db/base_view/account_number",
    "resType": "@column",
    "reqData": "SELECT * FROM temp_db.secondary_view"
}{code}
*2. private_db.base_table account_number column masking*
{code:java}
{     
    "access": "mask_hash",     
    "resource": "private_db/base_table/account_number",     
    "resType": "@column",     
    "reqData": "SELECT * FROM temp_db.secondary_view" 
}{code}
*Environment*
 - Impala: 4.4.0
 - Ranger: 2.3.0
 
 

  was:
h3. *Description*

When using Impala with Ranger for data masking, applying a {{mask_hash}} policy 
to columns in both tables and views results in the {{mask_hash}} function being 
nested multiple times. This behavior leads to redundant hashing operations. Is 
this intended behavior?
h3. *Steps to Reproduce*

*1. Apply Masking Policies:*
 * Apply a {{mask_hash}} policy to a specific column (e.g., 
{{{}account_number{}}}) across all tables in two databases, {{temp_db}} and 
{{{}private_db{}}}.

*2. Create a Base Table:*
{code:java}
CREATE TABLE private_db.base_table (
    account_number STRING,
    other_column STRING
);{code}
*3. Create a View Referencing the Base Table:*
{code:java}
CREATE VIEW private_db.base_view AS
SELECT * FROM private_db.base_table;{code}
*4. Create Another View Referencing the First View:*
{code:java}
CREATE VIEW temp_db.secondary_view AS
SELECT * FROM private_db.base_view;{code}
*5. Execute a Query on the Second View:*
{code:java}
SELECT * FROM temp_db.secondary_view;{code}
h3. *Expected Behavior*

The {{mask_hash}} function should be applied *once* to the {{account_number}} 
column, regardless of the number of view layers referencing the masked table or 
view.
----
h3. *Actual Behavior*

The {{mask_hash}} function is applied *three times* to the {{account_number}} 
column due to nested view references. This results in multiple layers of 
hashing, as observed in both the query execution plan and Ranger audit logs.

*Example Query Execution Plan:*
{code:java}
WARNING: The following tables are missing relevant table and/or column 
statistics.
private_db.base_table
Analyzed query: SELECT * FROM (SELECT mask_hash(account_number) account_number, 
my_account_number FROM
temp_db.secondary_view)F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1

 Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB 
thread-reservation=1
PLAN-ROOT SINK    output exprs: 
*mask_hash(mask_hash(mask_hash(account_number)))*, my_account_number    
mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB 
thread-reservation=0   {code}
*Ranger Audit Logs:*

*1. temp_db.secondary_view account_number column masking*
{code:java}
{
    "access": "mask_hash",
    "resource": "temp_db/secondary_view/account_number",
    "resType": "@column",
    "reqData": "SELECT * FROM temp_db.secondary_view"
}{code}
*2. private_db.base_view account_number column masking*
{code:java}
{
    "access": "mask_hash",
    "resource": "private_db/base_view/account_number",
    "resType": "@column",
    "reqData": "SELECT * FROM temp_db.secondary_view"
}{code}
*3. private_db.base_table account_number column masking*
{code:java}
{     
    "access": "mask_hash",     
    "resource": "private_db/base_table/account_number",     
    "resType": "@column",     
    "reqData": "SELECT * FROM temp_db.secondary_view" 
}{code}
*Environment*
 - Impala: 4.4.0
 - Ranger: 2.3.0
 
 

        Summary: Nested mask_hash Calls with Ranger Data Masking in Impala  
(was: Unexpected Nested mask_hash Functions When Using Views in Impala with 
Ranger)

> Nested mask_hash Calls with Ranger Data Masking in Impala
> ---------------------------------------------------------
>
>                 Key: IMPALA-13667
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13667
>             Project: IMPALA
>          Issue Type: Question
>          Components: Frontend
>            Reporter: woosuk.ro
>            Priority: Minor
>
> h3. *Description*
> When a Ranger {{mask_hash}} policy is applied to a column in Impala, each 
> view that references that column triggers another call to {{{}mask_hash{}}}. 
> As a result, the column is hashed repeatedly, causing multiple nested 
> {{mask_hash}} operations instead of a single masking step.
> h3. *Steps to Reproduce*
> 1. In Ranger, apply a {{mask_hash}} policy to a column (e.g., 
> {{{}account_number{}}}) across a database.
> 2. Create a Base Table:
> {code:java}
> CREATE TABLE private_db.base_table (
>     account_number STRING,
>     other_column STRING
> );{code}
> 3. Create a View Referencing the Base Table:
> {code:java}
> CREATE VIEW private_db.base_view AS
> SELECT * FROM private_db.base_table;{code}
> 4. Query the View:
> {code:java}
> SELECT * FROM private_db.base_view;{code}
> 5. Observe the query plan or Ranger audit logs: multiple {{mask_hash}} calls 
> are stacked.
> h3. *Expected Behavior*
> {{mask_hash}} should apply once per column per query, regardless of view 
> layers.
> ----
> h3. *Actual Behavior*
> {{mask_hash}} is invoked multiple times (one for each view layer), causing 
> repeated hashing.
> *Ranger Audit Logs:*
> *1. private_db.base_view account_number column masking*
> {code:java}
> {
>     "access": "mask_hash",
>     "resource": "private_db/base_view/account_number",
>     "resType": "@column",
>     "reqData": "SELECT * FROM temp_db.secondary_view"
> }{code}
> *2. private_db.base_table account_number column masking*
> {code:java}
> {     
>     "access": "mask_hash",     
>     "resource": "private_db/base_table/account_number",     
>     "resType": "@column",     
>     "reqData": "SELECT * FROM temp_db.secondary_view" 
> }{code}
> *Environment*
>  - Impala: 4.4.0
>  - Ranger: 2.3.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to