[jira] [Commented] (PARQUET-2223) Parquet Data Masking for Column Encryption

ASF GitHub Bot (Jira) Thu, 05 Jan 2023 03:17:06 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654916#comment-17654916
 ]


ASF GitHub Bot commented on PARQUET-2223:
-----------------------------------------

wgtmac commented on code in PR #1016:
URL: https://github.com/apache/parquet-mr/pull/1016#discussion_r1062349914


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/DataMaskingUtil.java:
##########
@@ -0,0 +1,95 @@
+/*
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an
+ *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ *  KIND, either express or implied.  See the License for the
+ *  specific language governing permissions and limitations
+ *  under the License.
+ */
+package org.apache.parquet.hadoop.util;
+
+import org.apache.parquet.hadoop.metadata.ColumnPath;
+import org.apache.parquet.schema.GroupType;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.Type;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+public class DataMaskingUtil
+{

Review Comment:
   The left curly brace should be moved to the end of the above line.



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/example/GroupReadSupport.java:
##########
@@ -35,8 +37,21 @@ public org.apache.parquet.hadoop.api.ReadSupport.ReadContext 
init(
       Configuration configuration, Map<String, String> keyValueMetaData,
       MessageType fileSchema) {
     String partialSchemaString = 
configuration.get(ReadSupport.PARQUET_READ_SCHEMA);
-    MessageType requestedProjection = getSchemaForRead(fileSchema, 
partialSchemaString);
-    return new ReadContext(requestedProjection);
+    String removeColumns = 
configuration.get(DataMaskingUtil.DATA_MASKING_COLUMNS);

Review Comment:
   It seems that only the example `ReadSupport` has enabled skipping masked 
columns. Does it mean that other ReadSupport implementations are required to 
apply the same approach? If yes, does the `AvroReadSupport` class require the 
change, too?



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/DataMaskingUtil.java:
##########
@@ -0,0 +1,95 @@
+/*
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an
+ *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ *  KIND, either express or implied.  See the License for the
+ *  specific language governing permissions and limitations
+ *  under the License.
+ */
+package org.apache.parquet.hadoop.util;
+
+import org.apache.parquet.hadoop.metadata.ColumnPath;
+import org.apache.parquet.schema.GroupType;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.Type;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+public class DataMaskingUtil
+{
+  public final static String DATA_MASKING_COLUMNS = 
"parquet.data.masking.columns";
+  public final static String DELIMITER = ",";
+
+  public static MessageType removeColumnsInSchema(MessageType schema, String 
removeColumns)
+  {
+    if (removeColumns == null || removeColumns.isEmpty()) {
+      return schema;
+    }
+
+    Set<ColumnPath> paths = getColumnPaths(removeColumns);
+    List<String> currentPath = new ArrayList<>();
+    List<Type> prunedFields = removeColumnsInFields(schema.getFields(), 
currentPath, paths);
+    return new MessageType(schema.getName(), prunedFields);

Review Comment:
   What if all columns have been pruned and an empty schema is returned?



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/DataMaskingUtil.java:
##########
@@ -0,0 +1,95 @@
+/*
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an
+ *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ *  KIND, either express or implied.  See the License for the
+ *  specific language governing permissions and limitations
+ *  under the License.
+ */
+package org.apache.parquet.hadoop.util;
+
+import org.apache.parquet.hadoop.metadata.ColumnPath;
+import org.apache.parquet.schema.GroupType;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.Type;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+public class DataMaskingUtil
+{
+  public final static String DATA_MASKING_COLUMNS = 
"parquet.data.masking.columns";

Review Comment:
   Should we update the 
[doc](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md#class-readsupport)
 to reflect the new config?



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/DataMaskingUtil.java:
##########
@@ -0,0 +1,95 @@
+/*
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing,
+ *  software distributed under the License is distributed on an
+ *  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ *  KIND, either express or implied.  See the License for the
+ *  specific language governing permissions and limitations
+ *  under the License.
+ */
+package org.apache.parquet.hadoop.util;
+
+import org.apache.parquet.hadoop.metadata.ColumnPath;
+import org.apache.parquet.schema.GroupType;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.Type;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+public class DataMaskingUtil
+{
+  public final static String DATA_MASKING_COLUMNS = 
"parquet.data.masking.columns";
+  public final static String DELIMITER = ",";
+
+  public static MessageType removeColumnsInSchema(MessageType schema, String 
removeColumns)
+  {

Review Comment:
   ditto



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/example/GroupReadSupport.java:
##########
@@ -35,8 +37,21 @@ public org.apache.parquet.hadoop.api.ReadSupport.ReadContext 
init(
       Configuration configuration, Map<String, String> keyValueMetaData,
       MessageType fileSchema) {
     String partialSchemaString = 
configuration.get(ReadSupport.PARQUET_READ_SCHEMA);
-    MessageType requestedProjection = getSchemaForRead(fileSchema, 
partialSchemaString);
-    return new ReadContext(requestedProjection);
+    String removeColumns = 
configuration.get(DataMaskingUtil.DATA_MASKING_COLUMNS);
+
+    if (removeColumns == null) {
+      MessageType requestedProjection = getSchemaForRead(fileSchema, 
partialSchemaString);
+      return new ReadContext(requestedProjection);
+    }
+
+    if (partialSchemaString == null) {
+      return new ReadContext(DataMaskingUtil.removeColumnsInSchema(fileSchema, 
removeColumns));
+    } else {
+      MessageType updatedSchema = DataMaskingUtil.removeColumnsInSchema(
+        MessageTypeParser.parseMessageType(partialSchemaString), 
removeColumns);
+      MessageType requestedProjection = getSchemaForRead(fileSchema, 
updatedSchema.toString());

Review Comment:
   Why not calling `MessageType requestedProjection = 
getSchemaForRead(fileSchema, updatedSchema);`





> Parquet Data Masking for Column Encryption
> ------------------------------------------
>
>                 Key: PARQUET-2223
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2223
>             Project: Parquet
>          Issue Type: Task
>            Reporter: Jiashen Zhang
>            Priority: Minor
>
> h1. Background
> h2. What is Data Masking?
> Data masking is the process of obfuscating sensitive data. Instead of 
> revealing PII data, masking allows us to return NULLs, hashes or redacted 
> data in its place. With data masking, users who are in the correct permission 
> groups can retrieve the original data and users without permissions will 
> receive masked data.
> h2. Why do we need it?
>  * Fined-Grained Access Control
> h2. Why do we want to enhance data masking?
>  
> Users might not have all permissions for all columns, existing code doesn’t 
> have support for us to skip columns that users don’t have permissions to 
> access. This enhancement will add this support so that users can decide to 
> skip some columns to avoid decryption error.
> h1. Design Requirements
>  # Users can skip some columns with a configuration
> h1. Proposed solution
> Key idea is to modify the request schema by removing skipped columns from the 
> schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2223) Parquet Data Masking for Column Encryption

Reply via email to