[
https://issues.apache.org/jira/browse/PARQUET-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654916#comment-17654916
]
ASF GitHub Bot commented on PARQUET-2223:
-----------------------------------------
wgtmac commented on code in PR #1016:
URL: https://github.com/apache/parquet-mr/pull/1016#discussion_r1062349914
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/DataMaskingUtil.java:
##########
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.util;
+
+import org.apache.parquet.hadoop.metadata.ColumnPath;
+import org.apache.parquet.schema.GroupType;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.Type;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+public class DataMaskingUtil
+{
Review Comment:
The left curly brace should be moved to the end of the above line.
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/example/GroupReadSupport.java:
##########
@@ -35,8 +37,21 @@ public org.apache.parquet.hadoop.api.ReadSupport.ReadContext
init(
Configuration configuration, Map<String, String> keyValueMetaData,
MessageType fileSchema) {
String partialSchemaString =
configuration.get(ReadSupport.PARQUET_READ_SCHEMA);
- MessageType requestedProjection = getSchemaForRead(fileSchema,
partialSchemaString);
- return new ReadContext(requestedProjection);
+ String removeColumns =
configuration.get(DataMaskingUtil.DATA_MASKING_COLUMNS);
Review Comment:
It seems that only the example `ReadSupport` has enabled skipping masked
columns. Does it mean that other ReadSupport implementations are required to
apply the same approach? If yes, does the `AvroReadSupport` class require the
change, too?
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/DataMaskingUtil.java:
##########
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.util;
+
+import org.apache.parquet.hadoop.metadata.ColumnPath;
+import org.apache.parquet.schema.GroupType;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.Type;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+public class DataMaskingUtil
+{
+ public final static String DATA_MASKING_COLUMNS =
"parquet.data.masking.columns";
+ public final static String DELIMITER = ",";
+
+ public static MessageType removeColumnsInSchema(MessageType schema, String
removeColumns)
+ {
+ if (removeColumns == null || removeColumns.isEmpty()) {
+ return schema;
+ }
+
+ Set<ColumnPath> paths = getColumnPaths(removeColumns);
+ List<String> currentPath = new ArrayList<>();
+ List<Type> prunedFields = removeColumnsInFields(schema.getFields(),
currentPath, paths);
+ return new MessageType(schema.getName(), prunedFields);
Review Comment:
What if all columns have been pruned and an empty schema is returned?
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/DataMaskingUtil.java:
##########
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.util;
+
+import org.apache.parquet.hadoop.metadata.ColumnPath;
+import org.apache.parquet.schema.GroupType;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.Type;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+public class DataMaskingUtil
+{
+ public final static String DATA_MASKING_COLUMNS =
"parquet.data.masking.columns";
Review Comment:
Should we update the
[doc](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md#class-readsupport)
to reflect the new config?
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/DataMaskingUtil.java:
##########
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.util;
+
+import org.apache.parquet.hadoop.metadata.ColumnPath;
+import org.apache.parquet.schema.GroupType;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.Type;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+public class DataMaskingUtil
+{
+ public final static String DATA_MASKING_COLUMNS =
"parquet.data.masking.columns";
+ public final static String DELIMITER = ",";
+
+ public static MessageType removeColumnsInSchema(MessageType schema, String
removeColumns)
+ {
Review Comment:
ditto
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/example/GroupReadSupport.java:
##########
@@ -35,8 +37,21 @@ public org.apache.parquet.hadoop.api.ReadSupport.ReadContext
init(
Configuration configuration, Map<String, String> keyValueMetaData,
MessageType fileSchema) {
String partialSchemaString =
configuration.get(ReadSupport.PARQUET_READ_SCHEMA);
- MessageType requestedProjection = getSchemaForRead(fileSchema,
partialSchemaString);
- return new ReadContext(requestedProjection);
+ String removeColumns =
configuration.get(DataMaskingUtil.DATA_MASKING_COLUMNS);
+
+ if (removeColumns == null) {
+ MessageType requestedProjection = getSchemaForRead(fileSchema,
partialSchemaString);
+ return new ReadContext(requestedProjection);
+ }
+
+ if (partialSchemaString == null) {
+ return new ReadContext(DataMaskingUtil.removeColumnsInSchema(fileSchema,
removeColumns));
+ } else {
+ MessageType updatedSchema = DataMaskingUtil.removeColumnsInSchema(
+ MessageTypeParser.parseMessageType(partialSchemaString),
removeColumns);
+ MessageType requestedProjection = getSchemaForRead(fileSchema,
updatedSchema.toString());
Review Comment:
Why not calling `MessageType requestedProjection =
getSchemaForRead(fileSchema, updatedSchema);`
> Parquet Data Masking for Column Encryption
> ------------------------------------------
>
> Key: PARQUET-2223
> URL: https://issues.apache.org/jira/browse/PARQUET-2223
> Project: Parquet
> Issue Type: Task
> Reporter: Jiashen Zhang
> Priority: Minor
>
> h1. Background
> h2. What is Data Masking?
> Data masking is the process of obfuscating sensitive data. Instead of
> revealing PII data, masking allows us to return NULLs, hashes or redacted
> data in its place. With data masking, users who are in the correct permission
> groups can retrieve the original data and users without permissions will
> receive masked data.
> h2. Why do we need it?
> * Fined-Grained Access Control
> h2. Why do we want to enhance data masking?
>
> Users might not have all permissions for all columns, existing code doesn’t
> have support for us to skip columns that users don’t have permissions to
> access. This enhancement will add this support so that users can decide to
> skip some columns to avoid decryption error.
> h1. Design Requirements
> # Users can skip some columns with a configuration
> h1. Proposed solution
> Key idea is to modify the request schema by removing skipped columns from the
> schema.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)