huangzhir commented on PR #4643:
URL: https://github.com/apache/kyuubi/pull/4643#issuecomment-1492853202

   Let me summarize how this issue came about and how Hive, Spark, and Trino 
handle it.
   
   Hive's data masking is implemented using the functions mask({col}), 
mask_show_last_n({col}, 4, 'x', 'x', 'x', -1, '1'), and 
mask_show_first_n({col}, 4, 'x', 'x', 'x', -1, '1') (see 
https://github.com/apache/ranger/blob/7f5b82bff2df72f20f5c41ba095406d354f8acf0/agents-common/src/main/resources/service-defs/ranger-servicedef-hive.json#L387).
 
   ```json 
                        {
                                "itemId": 3,
                                "name": "MASK_SHOW_FIRST_4",
                                "label": "Partial mask: show first 4",
                                "description": "Show first 4 characters; 
replace rest with 'x'",
                                "transformer": "mask_show_first_n({col}, 4, 
'x', 'x', 'x', -1, '1')"
                        }
   ```
   However, the implementation in the code ignores non-English character sets 
and simply returns the original data (see 
https://github.com/apache/hive/blob/7b3ecf617a6d46f48a3b6f77e0339fd4ad95a420/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMask.java#L262).
   ```java 
         default:
           if(maskedOtherChar != UNMASKED_VAL) {
             return maskedOtherChar;
           }
           break;
       }
   ```
   
   Regarding the related mask functions in Spark, there was a JIRA ticket 
(https://issues.apache.org/jira/browse/SPARK-23901) to add mask-related 
functions, but it was later decided that this was not a universal method, so 
the code was rolled back, and Spark does not currently have an implementation 
of mask-related functions.
   
   Trino also does not implement these mask-related functions, but instead uses 
the regexp_replace function for data masking. However, Trino's regexp_replace 
function supports lambda expressions (see 
https://github.com/apache/ranger/blob/a0224b2fdef999b3e23e2374080df94bf38557a4/agents-common/src/main/resources/service-defs/ranger-servicedef-trino.json#L393).
   ```json
   {
           "itemId": 2,
           "name": "MASK_SHOW_LAST_4",
           "label": "Partial mask: show last 4",
           "description": "Show last 4 characters; replace rest with 'X'",
           "transformer": "cast(regexp_replace({col}, '(.*)(.{4}$)', x -> 
regexp_replace(x[1], '.', 'X') || x[2]) as {type})"
         }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to