shangxinli commented on pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#issuecomment-670094258
Hi @ggershinsky, Thanks for your comments!
I think the current difference is how to transport the crypto settings from
source to destination(EncryptionPropertiesFactory implementation). What you
suggest is via 'Configuration' while our approach is hybrid('Configuration' for
global settings 'Schema' for db/table/column settings). The concern for using
'Configuration' for all the settings is that it is global and we need a
well-designed namespace and protocols(string format) to locate
db/table/column(nested), along with other table wide settings as the map key
which will need a lot of string operations and is problematic. The hybrid
solution is more compact as the metadata is attached to the table/column as a
structure field. Think about if I want to encrypt footer for table1 but not for
table2, and AES-GCM for table3 but AES-CTR for table4. If we use
'Configuration', it is doable but see how complex to define the string format
as the protocol to convey this setting. And sometimes, a job needs to deal with
different meta sto
res, which would require a prefix before 'db'.
Regarding changes to existing code, I understand if we set Configuration,
then no need to change the code after that. But somewhere need to put the
settings into Configuration and it will be code change in some cases. Think
about somebody(might not be engineers) sets a table/column to be encrypted in
HMS or other centralized metastore. Then that setting needs to converted into
settings in Configuration, then code change is unavoidable, correct? We prefer
the changes happening in a plugin owned by a dedicated team instead of in the
feature team's codebase that are so many. That is why we extend
ParquetWriteSupport and build it as a plugin.
There is an option that in the extended ParquetWriteSupport, we can put it
to Configuration instead of Parquet schema. As mentioned above, we think Schema
settings in this case would be more compact and less error-prone than in
Configuration.
In my opinion, your solution is better to use at the time when users create
a table and want to set some columns to be encrypted. They only need to set it
in Configuration. But it is difficult to use if you already have a lot of
tables that are not encrypted today. Due to some initiatives, they need to be
encrypted. Somebody(might not be the table owner, it could be security team, or
legal team etc) change the settings in the megastore for a table/column, the
framework would just start the encryption for that table/column. That is why we
called it schema controlled because when the schema defined in metastore is
changed, it controlled the encryption of the downstream.
I slightly disagree with your statement that it is not Schema activated. I
think it depends on the definition of Schema. In our definition, the settings
for a table or a column including name, type, nullable, crypto settings are all
considered as part of the schema. Of course, you can define it another way but
I think the name or definition is not important. We should focus on whether or
not it makes sense for users to use.
Essentially this change is mainly introducing a map field for Parquet
schema, while all other changes are in tests(or plugin). As mentioned earlier,
if we look at the schema definition in other systems like Avro, Spark, they all
have associated metadata. I think it is the right thing to do to add that
metadata field in Parquet schema to align with other system's schema.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]