[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

GitBox Fri, 31 Jul 2020 13:33:44 -0700


shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-667344642



   > I think, directly using `ExtType` is not really user friendly. As the 
"metadata" for a type is currently used only for encryption I would not expose 
it directly to the user. There are a couple of ways for creating a parquet 
schema (e.g. converting from an avro schema, using the type builders, parsing 
it from a string etc.). So the user needs an easy way to mark the columns to be 
encrypted and set the related encryption key.
   > I would suggest having a utility method that allows to set only the 
required data for an addressed field in a schema. For example: 
`XXXUtility.encryptColumn(MessageType schema, ColumnPath column, String 
encKey)`. The logic behind this utility can be hidden from the user.
   > 
   > I don't really like the decorator pattern. First, it requires a lot of 
boilerplate code to have all the methods being delegated. Second, if the super 
class is modified by having a new method it will not generate any 
errors/warnings if the decorator would not be updated but may cause serious 
issues at runtime.
   > Why do you not want to extend `Type` (or more `PrimitiveType`) directly? 
If you not want to bother the users with this metadata that is not tightly 
connected to the schema you might define it _package private_ and let only the 
mentioned utility reach it.
   > BTW, do we need this _metadata_ for the primitive types or it is enough to 
have a specific object that contains only the required values for encryption?
   
   Thanks for the comments! The idea is to piggyback the existing 
WriteSupport's schema conversion from other type’s schema e,g avro to Parquet 
schema to pass down the crypto settings. By doing this way, when we build  
FileEncryptionProperties object, we don't need RPC calls anymore to get it 
somewhere.  Here is an example of Apache Hudi's WriteSupport example 
https://github.com/shangxinli/parquet-write-supports/blob/master/src/main/java/com/uber/hoodie/avro/CryptoHoodieAvroWriteSupport.java.
  It extends HoodieAvroWriteSupport to translate the crypto settings from avro 
schema's metadata to Parquet schema’s metadata. The CryptoPropertiesFactory 
objects can use it like this 
https://github.com/apache/parquet-mr/pull/808/commits/9adb75a2356147a204d76c82eb39f43e9ee72b58#diff-886dda84b09a5aaf0bf7c1e718c1be02R61
   
   We have built the extended WriteSupport classes for each system like Hudi, 
Spark, Hive into a library along with the CryptoPropertiesFactory 
implementation classes into a library. That library is added at cluster-wide 
and all the jobs can have it without developers changing their code. The table 
owner only need to change schema settings in metastore like HMS or other 
centralized metastore by tagging a column is to be encrypted. For a company 
that has a lot of existing tables to be onboarded to this feature, it would be 
much easier. We also got an agreement with ORC on defining the format of the 
encryption settings https://issues.apache.org/jira/browse/HIVE-21848. 
   
   What we need from the Parquet Schema is a place to keep the metadata. A 
schema definition usually has a place to do so like Avro schema, Spark 
Schema(StructField), but I don't find Parquet schema has it. So it could be 
used for other purposes. 
   
   I agreed with you on the opinion of the decoration pattern. But I think 
adding metadata to the class Type directly is risky. I am afraid of breaking 
existing functionality as it is used so widely. Your concern of changing the 
class Type and forgetting to change ExtType class is valid, but for this 
particular case, we didn’t just add or reference existing methods without 
changing. If the methods of Type is changed, ExtType shouldn’t need to be 
changed. If there is adding or deletion of methods in Type, there will be build 
error so that people know to fix it. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-mr] shangxinli commented on pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

Reply via email to