shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-670094258


   Hi @ggershinsky, Thanks for your comments! 
   
   I think the current difference is how to transport the crypto settings from 
source to destination(EncryptionPropertiesFactory implementation). What you 
suggest is via 'Configuration' while our approach is hybrid('Configuration' for 
global settings 'Schema' for db/table/column settings). The concern for using 
'Configuration' for all the settings is that it is global and we need a 
well-designed namespace and protocols(string format) to locate 
db/table/column(nested), along with other table wide settings as the map key 
which will need a lot of string operations and is problematic. The hybrid 
solution is more compact as the metadata is attached to the table/column as a 
structure field. Think about if I want to encrypt footer for table1 but not for 
table2, and AES-GCM for table3 but AES-CTR for table4. If we use 
'Configuration', it is doable but see how complex to define the string format 
as the protocol to convey this setting. And sometimes, a job needs to deal with 
different meta sto
 res, which would require a prefix before 'db'. 
   
   Regarding changes to existing code, I understand if we set Configuration, 
then no need to change the code after that. But somewhere need to put the 
settings into Configuration and it will be code change in some cases. Think 
about somebody(might not be engineers) sets a table/column to be encrypted in 
HMS or other centralized metastore. Then that setting needs to converted into 
settings in Configuration, then code change is unavoidable, correct? We prefer 
the changes happening in a plugin owned by a dedicated team instead of in the 
feature team's codebase that are so many. That is why we extend 
ParquetWriteSupport and build it as a plugin. 
   
   There is an option that in the extended ParquetWriteSupport, we can put it 
to Configuration instead of Parquet schema. As mentioned above, we think Schema 
settings in this case would be more compact and less error-prone than in 
Configuration. 
   
   In my opinion, your solution is better to use at the time when users create 
a table and want to set some columns to be encrypted. They only need to set it 
in Configuration. But it is difficult to use if you already have a lot of 
tables that are not encrypted today. Due to some initiatives, they need to be 
encrypted. Somebody(might not be the table owner, it could be security team, or 
legal team etc) change the settings in the megastore for a table/column, the 
framework would just start the encryption for that table/column. That is why we 
called it schema controlled because when the schema defined in metastore is 
changed,  it controlled the encryption of the downstream.  
     
   I slightly disagree with your statement that it is not Schema activated. I 
think it depends on the definition of Schema. In our definition, the settings 
for a table or a column including name, type, nullable, crypto settings are all 
considered as part of the schema.  Of course, you can define it another way but 
I think the name or definition is not important. We should focus on whether or 
not it makes sense for users to use.       
   
   Essentially this change is mainly introducing a map field for Parquet 
schema, while all other changes are in tests(or plugin). As mentioned earlier, 
if we look at the schema definition in other systems like Avro, Spark, they all 
have associated metadata. I think it is the right thing to do to add that 
metadata field in Parquet schema to align with other system's schema.  
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to