[
https://issues.apache.org/jira/browse/HADOOP-13336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15742062#comment-15742062
]
Steve Loughran edited comment on HADOOP-13336 at 12/14/16 10:10 AM:
--------------------------------------------------------------------
This also matters for HADOOP-13345, where different buckets will have different
MD caching policies, including "none", so increasing its priority.
Possibilities —all of which assume falling back to the s3a standard options as
default. This means: no way to undefine an option.
h3. per-bucket config.
Lets you define everything for a bucket.
Examples
* {{s3a://olap2/data/2017}} : s3a URL {{s3a://olap2/data/2017}}, with config
set {{fs.s3a.bucket.olap2}} in configuration
* {{s3a://landsat}} : s3a URL {{s3a://landsat}}, with config set
{{fs.s3a.bucket.landsat}} for anonymous credentials and no dynamo
Pro
* Conceptually simple
* easy to get started
* trivial to move between other s3 clients, just change the prefix/redeclare
the prefix binding
Con
* Expensive/complicated to maintain configurations.
* Need to delve into the configuration file to see what the mappings are. I can
see this mattering a lot in support calls related to authentication.
h3. config via domain name in URL
This is what swift does: you define a domain, with the domain defining
everything.
* {{s3a://olap2.dynamo/data/2017}} with config sett {{fs.s3a.binding.dynamo}}
* {{s3a://landsat.anon}} with config set {{fs.s3a.binding.anon}} for anonymous
credentials and no dynamo
Pro:
* shared config across multiple buckets
* easy to see when buckets have different config options without having delve
into the configuration file to see what the mappings are.
* Matches {{swift://}}
* Similar-ish to {{wasb}}
Con:
* the need to explicitly declare a domain stops you transparently moving a
bucket to a different set of options, unless you add a way to also bind a
bucket to a "configuration domain", behind the scenes.
* S3 supports FQDNs already
* not going to be compatible with previous versions, external s3 clients, (e.g.
EMR)
h3. Config via user:pass property in URL
This is a bit like Azure, where the FQDN defines the binding, and the username
defines the bucket. Here I'm proposing the ability to define a new user which
declares the binding info.
Examples
* {{s3a://dynamo@olap2/data/2017}} : s3a URL {{s3a://olap2/data/2017}}, with
config set {{fs.s3a.binding.dynamo}}
* {{s3a://anon@landsat}} : s3a URL {{s3a://landsat}}, with config set
{{fs.s3a.binding.anon}} for anonymous credentials.
Pro:
* Better for sharing configuration options across buckets
* consistent model with the AWSID:secret mechanism today
* see at a glance what the configuration set used is, easy to change.
* no complications related to domain naming
* Easy to switch between configuration sets on the command line, without adding
new properties.
Con:
* needs different URLs if you don't want the default.
h3. Fundamentally rework Hadoop configuration to support a hierarchical
configuration mechanism.
I'm not really proposing this, just wanted to mention it as the nominal
ultimate option, instead of what we have today with different things (HA,
Swift, Azure, etc), all defining different mechanisms for tuning customisation.
(2016-12-10: updated by fixing landsat config option name in the
per-bucket-config example)
was (Author: [email protected]):
This also matters for HADOOP-13345, where different buckets will have different
MD caching policies, including "none", so increasing its priority.
Possibilities —all of which assume fallling back to the s3a standard options as
default. This means: no way to undefine an option.
h3. per-bucket config.
Lets you define everything for a bucket.
Examples
* {{s3a://olap2/data/2017}} : s3a URL {{s3a://olap2/data/2017}}, with config
set {{fs.s3a.bucket.olap2}} in configuration
* {{s3a://landsat}} : s3a URL {{s3a://landsat}}, with config set
{{fs.s3a.landsat}} for anonymous credentials and no dynamo
Pro
* Conceptually simple
* easy to get started
* trivial to move between other s3 clients, just change the prefix/redeclare
the prefix binding
Con
* Expensive/complicated to maintain configurations.
* Need to delve into the configuration file to see what the mappings are. I can
see this mattering a lot in support calls related to authentication.
h3. config via domain name in URL
This is what swift does: you define a domain, with the domain defining
everything.
* {{s3a://olap2.dynamo/data/2017}} with config sett {{fs.s3a.binding.dynamo}}
* {{s3a://landsat.anon}} with config set {{fs.s3a.binding.anon}} for anonymous
credentials and no dynamo
Pro:
* shared config across multiple buckets
* easy to see when buckets have different config options without having delve
into the configuration file to see what the mappings are.
* Matches {{swift://}}
* Similar-ish to {{wasb}}
Con:
* the need to explicitly declare a domain stops you transparently moving a
bucket to a different set of options, unless you add a way to also bind a
bucket to a "configuration domain", behind the scenes.
* S3 supports FQDNs already
* not going to be compatible with previous versions, external s3 clients, (e.g.
EMR)
h3. Config via user:pass property in URL
This is a bit like Azure, where the FQDN defines the binding, and the username
defines the bucket. Here I'm proposing the ability to define a new user which
declares the binding info.
Examples
* {{s3a://dynamo@olap2/data/2017}} : s3a URL {{s3a://olap2/data/2017}}, with
config set {{fs.s3a.binding.dynamo}}
* {{s3a://anon@landsat}} : s3a URL {{s3a://landsat}}, with config set
{{fs.s3a.binding.anon}} for anonymous credentials.
Pro:
* Better for sharing configuration options across buckets
* consistent model with the AWSID:secret mechanism today
* see at a glance what the configuration set used is, easy to change.
* no complications related to domain naming
* Easy to switch between configuration sets on the command line, without adding
new properties.
Con:
* needs different URLs if you don't want the default.
h3. Fundamentally rework Hadoop configuration to support a hierarchical
configuration mechanism.
I'm not really proposing this, just wanted to mention it as the nominal
ultimate option, instead of what we have today with different things (HA,
Swift, Azure, etc), all defining different mechanisms for tuning customisation.
> S3A to support per-bucket configuration
> ---------------------------------------
>
> Key: HADOOP-13336
> URL: https://issues.apache.org/jira/browse/HADOOP-13336
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 2.8.0
> Reporter: Steve Loughran
>
> S3a now supports different regions, by way of declaring the endpoint —but you
> can't do things like read in one region, write back in another (e.g. a distcp
> backup), because only one region can be specified in a configuration.
> If s3a supported region declaration in the URL, e.g. s3a://b1.frankfurt
> s3a://b2.seol , then this would be possible.
> Swift does this with a full filesystem binding/config: endpoints, username,
> etc, in the XML file. Would we need to do that much? It'd be simpler
> initially to use a domain suffix of a URL to set the region of a bucket from
> the domain and have the aws library sort the details out itself, maybe with
> some config options for working with non-AWS infra
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]