[jira] [Comment Edited] (HADOOP-13336) S3A to support per-bucket configuration

Steve Loughran (JIRA) Wed, 14 Dec 2016 02:12:02 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-13336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15742062#comment-15742062
 ]


Steve Loughran edited comment on HADOOP-13336 at 12/14/16 10:10 AM:
--------------------------------------------------------------------

This also matters for HADOOP-13345, where different buckets will have different 
MD caching policies, including "none", so increasing its priority.

Possibilities —all of which assume falling back to the s3a standard options as 
default. This means: no way to undefine an option.

h3. per-bucket config. 

Lets you define everything for a bucket. 

Examples

* {{s3a://olap2/data/2017}} : s3a URL {{s3a://olap2/data/2017}}, with config 
set {{fs.s3a.bucket.olap2}} in configuration
* {{s3a://landsat}} : s3a URL {{s3a://landsat}}, with config set 
{{fs.s3a.bucket.landsat}} for anonymous credentials and no dynamo



Pro
* Conceptually simple
* easy to get started
* trivial to move between other s3 clients, just change the prefix/redeclare 
the prefix binding

Con
* Expensive/complicated to maintain configurations.
* Need to delve into the configuration file to see what the mappings are. I can 
see this mattering a lot in support calls related to authentication.

h3. config via domain name in URL

This is what swift does: you define a domain, with the domain defining 
everything.


* {{s3a://olap2.dynamo/data/2017}} with config sett {{fs.s3a.binding.dynamo}}
* {{s3a://landsat.anon}} with config set {{fs.s3a.binding.anon}} for anonymous 
credentials and no dynamo

Pro:
* shared config across multiple buckets
* easy to see when buckets have different config options without having delve 
into the configuration file to see what the mappings are.
* Matches {{swift://}}
* Similar-ish to {{wasb}}

Con:
* the need to explicitly declare a domain stops you transparently moving a 
bucket to a different set of options, unless you add a way to also bind a 
bucket to a "configuration domain", behind the scenes.
* S3 supports FQDNs already
* not going to be compatible with previous versions, external s3 clients, (e.g. 
EMR)

h3. Config via user:pass property in URL

This is a bit like Azure, where the FQDN defines the binding, and the username 
defines the bucket. Here I'm proposing the ability to define a new user which 
declares the binding info.

Examples

* {{s3a://dynamo@olap2/data/2017}} : s3a URL {{s3a://olap2/data/2017}}, with 
config set {{fs.s3a.binding.dynamo}}
* {{s3a://anon@landsat}} : s3a URL {{s3a://landsat}}, with config set 
{{fs.s3a.binding.anon}} for anonymous credentials.


Pro:
* Better for sharing configuration options across buckets
* consistent model with the AWSID:secret mechanism today
* see at a glance what the configuration set used is, easy to change.
* no complications related to domain naming
* Easy to switch between configuration sets on the command line, without adding 
new properties.

Con:
* needs different URLs if you don't want the default.

h3. Fundamentally rework Hadoop configuration to support a hierarchical 
configuration mechanism.

I'm not really proposing this, just wanted to mention it as the nominal 
ultimate option, instead of what we have today with different things (HA, 
Swift, Azure, etc), all defining different mechanisms for tuning customisation.

(2016-12-10: updated by fixing landsat config option name in the 
per-bucket-config example)


was (Author: [email protected]):
This also matters for HADOOP-13345, where different buckets will have different 
MD caching policies, including "none", so increasing its priority.

Possibilities —all of which assume fallling back to the s3a standard options as 
default. This means: no way to undefine an option.

h3. per-bucket config. 

Lets you define everything for a bucket. 

Examples

* {{s3a://olap2/data/2017}} : s3a URL {{s3a://olap2/data/2017}}, with config 
set {{fs.s3a.bucket.olap2}} in configuration
* {{s3a://landsat}} : s3a URL {{s3a://landsat}}, with config set 
{{fs.s3a.landsat}} for anonymous credentials and no dynamo



Pro
* Conceptually simple
* easy to get started
* trivial to move between other s3 clients, just change the prefix/redeclare 
the prefix binding

Con
* Expensive/complicated to maintain configurations.
* Need to delve into the configuration file to see what the mappings are. I can 
see this mattering a lot in support calls related to authentication.

h3. config via domain name in URL

This is what swift does: you define a domain, with the domain defining 
everything.


* {{s3a://olap2.dynamo/data/2017}} with config sett {{fs.s3a.binding.dynamo}}
* {{s3a://landsat.anon}} with config set {{fs.s3a.binding.anon}} for anonymous 
credentials and no dynamo

Pro:
* shared config across multiple buckets
* easy to see when buckets have different config options without having delve 
into the configuration file to see what the mappings are.
* Matches {{swift://}}
* Similar-ish to {{wasb}}

Con:
* the need to explicitly declare a domain stops you transparently moving a 
bucket to a different set of options, unless you add a way to also bind a 
bucket to a "configuration domain", behind the scenes.
* S3 supports FQDNs already
* not going to be compatible with previous versions, external s3 clients, (e.g. 
EMR)

h3. Config via user:pass property in URL

This is a bit like Azure, where the FQDN defines the binding, and the username 
defines the bucket. Here I'm proposing the ability to define a new user which 
declares the binding info.

Examples

* {{s3a://dynamo@olap2/data/2017}} : s3a URL {{s3a://olap2/data/2017}}, with 
config set {{fs.s3a.binding.dynamo}}
* {{s3a://anon@landsat}} : s3a URL {{s3a://landsat}}, with config set 
{{fs.s3a.binding.anon}} for anonymous credentials.


Pro:
* Better for sharing configuration options across buckets
* consistent model with the AWSID:secret mechanism today
* see at a glance what the configuration set used is, easy to change.
* no complications related to domain naming
* Easy to switch between configuration sets on the command line, without adding 
new properties.

Con:
* needs different URLs if you don't want the default.

h3. Fundamentally rework Hadoop configuration to support a hierarchical 
configuration mechanism.

I'm not really proposing this, just wanted to mention it as the nominal 
ultimate option, instead of what we have today with different things (HA, 
Swift, Azure, etc), all defining different mechanisms for tuning customisation.



> S3A to support per-bucket configuration
> ---------------------------------------
>
>                 Key: HADOOP-13336
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13336
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>
> S3a now supports different regions, by way of declaring the endpoint —but you 
> can't do things like read in one region, write back in another (e.g. a distcp 
> backup), because only one region can be specified in a configuration.
> If s3a supported region declaration in the URL, e.g. s3a://b1.frankfurt 
> s3a://b2.seol , then this would be possible. 
> Swift does this with a full filesystem binding/config: endpoints, username, 
> etc, in the XML file. Would we need to do that much? It'd be simpler 
> initially to use a domain suffix of a URL to set the region of a bucket from 
> the domain and have the aws library sort the details out itself, maybe with 
> some config options for working with non-AWS infra



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HADOOP-13336) S3A to support per-bucket configuration

Reply via email to