[jira] [Updated] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli

2021-07-01 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-1792:
-
Fix Version/s: 1.12.0

> Add 'mask' command to parquet-tools/parquet-cli
> ---
>
> Key: PARQUET-1792
> URL: https://issues.apache.org/jira/browse/PARQUET-1792
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> Some personal data columns need to be masked instead of being 
> pruned(Parquet-1791). We need a tool to replace the raw data columns with 
> masked value. The masked value could be hash, null, redact etc.  For the 
> unchanged columns, they should be moved as a whole like 'merge', 'prune' 
> command in Parquet-tools. 
>  
> Implementing this feature in file format is 10X faster than doing it by 
> rewriting the table data in the query engine. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli

2021-07-01 Thread Sudhakar J Pyndi (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373000#comment-17373000
 ] 

Sudhakar J Pyndi commented on PARQUET-1792:
---

Could you please confirm the target version for this feature? Thank you

> Add 'mask' command to parquet-tools/parquet-cli
> ---
>
> Key: PARQUET-1792
> URL: https://issues.apache.org/jira/browse/PARQUET-1792
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Some personal data columns need to be masked instead of being 
> pruned(Parquet-1791). We need a tool to replace the raw data columns with 
> masked value. The masked value could be hash, null, redact etc.  For the 
> unchanged columns, they should be moved as a whole like 'merge', 'prune' 
> command in Parquet-tools. 
>  
> Implementing this feature in file format is 10X faster than doing it by 
> rewriting the table data in the query engine. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

2021-07-01 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372862#comment-17372862
 ] 

Xinli Shang commented on PARQUET-1681:
--

We chose to revert the behavior back to 1.8.1. It runs fine for a year or so. 
We will port the changes soon. 

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -
>
> Key: PARQUET-1681
> URL: https://issues.apache.org/jira/browse/PARQUET-1681
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.10.0, 1.9.1, 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Critical
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>    {
>   "name": "phones",
>   "type": [
> "null",
> {
>   "type": "array",
>   "items": {
> "type": "record",
> "name": "phones_items",
> "fields": [
>   
> { "name": "phone_number", 
> "type": [   "null",   
> "string" ], "default": null   
> }
> ]
>   }
> }
>   ],
>   "default": null
> }
> The code to read is as below 
>  val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
> reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)