huangzhir commented on PR #4643:
URL: https://github.com/apache/kyuubi/pull/4643#issuecomment-1492853202
Let me summarize how this issue came about and how Hive, Spark, and Trino
handle it.
Hive's data masking is implemented using the functions mask({col}),
mask_show_last_n({col}, 4, 'x', 'x', 'x', -1, '1'), and
mask_show_first_n({col}, 4, 'x', 'x', 'x', -1, '1') (see
https://github.com/apache/ranger/blob/7f5b82bff2df72f20f5c41ba095406d354f8acf0/agents-common/src/main/resources/service-defs/ranger-servicedef-hive.json#L387).
```json
{
"itemId": 3,
"name": "MASK_SHOW_FIRST_4",
"label": "Partial mask: show first 4",
"description": "Show first 4 characters;
replace rest with 'x'",
"transformer": "mask_show_first_n({col}, 4,
'x', 'x', 'x', -1, '1')"
}
```
However, the implementation in the code ignores non-English character sets
and simply returns the original data (see
https://github.com/apache/hive/blob/7b3ecf617a6d46f48a3b6f77e0339fd4ad95a420/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMask.java#L262).
```java
default:
if(maskedOtherChar != UNMASKED_VAL) {
return maskedOtherChar;
}
break;
}
```
Regarding the related mask functions in Spark, there was a JIRA ticket
(https://issues.apache.org/jira/browse/SPARK-23901) to add mask-related
functions, but it was later decided that this was not a universal method, so
the code was rolled back, and Spark does not currently have an implementation
of mask-related functions.
Trino also does not implement these mask-related functions, but instead uses
the regexp_replace function for data masking. However, Trino's regexp_replace
function supports lambda expressions (see
https://github.com/apache/ranger/blob/a0224b2fdef999b3e23e2374080df94bf38557a4/agents-common/src/main/resources/service-defs/ranger-servicedef-trino.json#L393).
```json
{
"itemId": 2,
"name": "MASK_SHOW_LAST_4",
"label": "Partial mask: show last 4",
"description": "Show last 4 characters; replace rest with 'X'",
"transformer": "cast(regexp_replace({col}, '(.*)(.{4}$)', x ->
regexp_replace(x[1], '.', 'X') || x[2]) as {type})"
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]