[jira] [Commented] (HADOOP-17855) S3A: Allow SSE configurations per object path

Steve Loughran (Jira) Mon, 23 Aug 2021 08:16:27 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-17855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403227#comment-17403227
 ]


Steve Loughran commented on HADOOP-17855:
-----------------------------------------

bq. Do you know if there are any precedent in the project that loads extra 
configuration files? I'm asking because I think that having the path-to-key 
mappings in a file would be the most adequate strategy because you can easily 
have 1000 different keys/partitions x 100 tables and that would be hard to fit 
in a core-site.xml.

oh, there's a lot of things which load config files. Be aware that adding new 
default resources (as YarnConfig and HdfsConfig construction triggers) is 
troublesome and you (a) shouldn't do this and (b) should load the new configs 
with loadDefaults = false. Oh, there's bonus config fun as the encryption 
secrets can all be provided by hadoop credential providers, from JCEKs files to 
Hadoop KMS services. We'd want that too, so that all encryption secrets would 
optionally be hidden. More of an issue if SSE-C were to be enabled.

bq. > when we do things with directories, we often create markers in parent 
dirs. This complicates life as we'd have to choose which to use there too

bq. My understanding is that markers contain only metadata. In my biased 
opinion, users won't care much about the encryption settings on it.

well, we need to make sure that they aren't encrypted then.

bq. > S3A Delegation tokens pass down all encryption settings so that you can 
submit work into a shared cluster where all encryption options including your 
secrets come with the job. This will need to be extended.

bq. > this'd be left completely out of the delegation token info passed into 
the cluster. Up to the cluster deployer to deal with this. The default 
encryption settings would be passed in this way.

bq. Does S3A Delegation token control which encryption settings to use? It 
seems to me it should be concerned only about the authentication to S3.

yes. And it is there so you can have a cluster in EC2 with general access to 
some shared lib dirs *and no knowledge of job-specific buckets*, when you 
submit a job your client side settings of (encryption, secrets) come with the 
job.

I think "clusters with encryption plugins defined locally/for buckets"" would 
be a special case here; the DT would somehow indicate this was required (e.g. 
new enum) and the S3A fs client at the far end would fail if the binding class 
wasn't declared/found in its core-site config (i.e. there'd be no attempt to 
propagate the classname &c of the extension)


bq. > would you support different SSE options (SSE-C vs SSE-KMS)? SSE-KMS is 
the only sensible option, really.

bq. I'm more interested in SSE-KMS but I do see value in supporting SS3-C as 
well for the same reasons. Users might want to use a tenant-generated key to 
encrypt paths in a table partitioned by tenant for example.

SSE-C is really antisocial in a bucket (and painful for s3a) because with the 
other SSE options you can always decrypt data (including file markers) without 
needing the key. For SSE-C every read needs to know it.


That said, we may realise in future that it is really important/useful to 
support it. Depends on the cost and complexity. SSE-C means that the binding 
would need to be active by the time of constructing the first GET request


bq. > we could have some plugin point which returned the encryption settings 
for each path being written to, would be used when creating a request (i.e in 
RequestFactoryImpl) to choose settings in PUT/initiate MPU, copy. There's some 
complexity there related to TransferManager though... copy is going to be 
trouble.

bq. Exactly, this plugin point can be the single point that resolve all the 
logic regarding encryption configurations (per path, per bucket, global, etc) 
and return the correct settings to be used.

bq. > It'd be (another) hadoop AbstractService created during initialize(), but 
we'd make its serviceStart() operation async, so anything it does (load a 
config file, bind to some service) wouldn't block normal initialization...the 
config is only needed on the first write call

bq. Not super familiar with initialization details but I think it will be 
easier to just load all you need and fail quickly if there is a config problem 
instead of waiting for the first call to do so.


Issues there

# reading it in, especially from another bucket, will take time, and slow down 
all IO, *even when just reading data*
# that FileSystem.initialize() process can be a serialization bottleneck in 
FileSystem.get(); 
causes performance problems when many threads (hive, spark) all try and create 
the same FS instance.
# It's not needed for work which just reads a bucket (assuming SSE-C isn't 
supported)

We've seen major speedups in hive just by eliminating the single HEAD request 
to probe for bucket existence; it's now implicitly done when requests are made. 
Yes, it's not so immediate, *but it's faster*



> S3A: Allow SSE configurations per object path
> ---------------------------------------------
>
>                 Key: HADOOP-17855
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17855
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.3.1
>            Reporter: Mike Dias
>            Priority: Major
>
> Currently, we can map the SSE configurations at bucket level only:
> {code:java}
> <property>
>   <name>fs.s3a.bucket.ireland-dev.server-side-encryption-algorithm</name>
>   <value>SSE-KMS</value>
> </property>
> <property>
>   <name>fs.s3a.bucket.ireland-dev.server-side-encryption.key</name>
>   
> <value>arn:aws:kms:eu-west-1:98067faff834c:key/071a86ff-8881-4ba0-9230-95af6d01ca01</value>
> </property>
> {code}
> But sometimes we want to encrypt data in different paths with different keys 
> within the same bucket. For example, a partitioned table might benefit from 
> encrypting each partition with a different key when the partition represents 
> a customer or a country.
> [S3 already can encrypt using different keys/configurations at the object 
> level|https://aws.amazon.com/premiumsupport/knowledge-center/s3-encrypt-specific-folder/],
>  so what we need to do on Hadoop is to provide a way to map which key to use. 
> One idea could be mapping them in the XML config:
>  
> {code:java}
> <property>
>   <name>fs.s3a.server-side-encryption.paths</name>
>   
> <value>s3://bucket/my_table/country=ireland,s3://bucket/my_table/country=uk, 
> s3://bucket/my_table/country=germany</value>
> </property>
> <property>
>   <name>fs.s3a.server-side-encryption.path-keys</name>
>   
> <value>arn:aws:kms:eu-west-1:90ireland09:key/ireland-key,arn:aws:kms:eu-west-1:980uk0993c:key/uk-key,arn:aws:kms:eu-west-1:98germany089:key/germany-key</value>
> </property>
> {code}
> Or potentially fetch the mappings from the filesystem:
>  
> {code:java}
> <property>
>   <name>fs.s3a.server-side-encryption.mappings</name>
>   <value>s3://bucket/configs/encryption_mappings.json</value>
> </property> {code}
> where encryption_mappings.json could be something like this:
>  
> {code:java}
> { 
>    "path": "s3://bucket/customer_table/customerId=abc123", 
>    "algorithm": "SSE-KMS",
>    "key": "arn:aws:kms:eu-west-1:933993746:key/abc123-key"
> }
> ...
> { 
>    "path": "s3://bucket/customer_table/customerId=xyx987", 
>    "algorithm": "SSE-KMS",
>    "key": "arn:aws:kms:eu-west-1:933993746:key/xyx987-key"
> }
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-17855) S3A: Allow SSE configurations per object path

Reply via email to