[ 
https://issues.apache.org/jira/browse/HADOOP-14138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958744#comment-15958744
 ] 

Steve Loughran commented on HADOOP-14138:
-----------------------------------------

bq. why should s3a entries exist in core-default.xml?

Because that's where you set default values which are then overridden in 
core-site.xml. We don't have any notion of per-FS resources other than 
{{hdfs-default.xml}} and {{hdfs-site.xml}}. By putting defaults 

bq. core-default is supposed to contain defaults for most config values, and 
serves as documentation.

Exactly. And because it is loaded before core-site.xml, there is a 
straightforward, easy to understand override mechanism. 

bq. If someone wants to use s3a, I'd expect them to explicitly set it up in 
their Configuration,

Well, no. Because that removes the ability for you set options in core-site or 
elsewhere, including but not limited to {{fs.s3a.endpoint}}, all the [fs.s3a. 
security 
settings|http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#S3A_Authentication_methods],
 along with many others. 

bq. or rely on the ServiceLoader approach - which this jira reverses.

The service loader mech for S3a was pulled because


# it was a performance hit, especially once we shifted to the fully shaded AWS 
JAR, that is: the one which stops breaking downstream apps due to forced 
jackson upgrades, and 
# the service loader itself was a bit of trouble. HADOOP-12636 is the key one: 
In Hadoop 2.7.2, if you had hadoop-aws.jar on the CP but not amazon-s3-sdk, 
clients would fail on startup with a class not found exception during FS static 
init, *even if s3a wasn't used*; HADOOP-13323 removed the caught-but-logged 
entry from loggin gat warn to debug, because even that stack was causing 
confusion
# Finally, as the service loader doesn't register {{FileContext}} bindings, so 
if you used that API to talk to filesystems, those core-default entries were 
mandatory.

Because the fs.s3a.impl declaration was already in core-default, the 
consequence of this introspection was at best, startup delays, at worst, 
[startup 
failures|http://stackoverflow.com/questions/30426245/apache-spark-classloader-cannot-find-classdef-in-the-jar].
 So we pulled it. Now any classloader delays are postponed until the first s3a, 
wasb, adl, swift FS instance is created, which happens if and only if the 
caller uses the class.


You have to consider the current service loader a first pass; HADOOP-14132 
discusses how to do it better: scan a zero-dependency class file which declares 
schemas. It could list a per-fs XML resource, but the problem which arises 
there is the ordering of resources: the FS scan always takes place after the 
core-default/core-site load, and as {{Configuration.addDefaultResource()}} 
doesn't let you declare an ordering of defaults, any per-fs resource load would 
stamp over core-default. We'd need to change allow {{addDefaultResource()}} to 
permit a list of before-resources and after-resources to be defined.

Yes, the consequence of this change is that the {{fs.s3a.impl}} class isn't 
automatically, but if core-default isn't loading, then your code is inevitably 
going to break in some other way, I'd suspect security being a key point.

> Remove S3A ref from META-INF service discovery, rely on existing core-default 
> entry
> -----------------------------------------------------------------------------------
>
>                 Key: HADOOP-14138
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14138
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Critical
>             Fix For: 2.8.0, 2.7.4, 3.0.0-alpha3
>
>         Attachments: HADOOP-14138.001.patch, HADOOP-14138-branch-2-001.patch
>
>
> As discussed in HADOOP-14132, the shaded AWS library is killing performance 
> starting all hadoop operations, due to classloading on FS service discovery.
> This is despite the fact that there is an entry for fs.s3a.impl in 
> core-default.xml, *we don't need service discovery here*
> Proposed:
> # cut the entry from 
> {{/hadoop-aws/src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem}}
> # when HADOOP-14132 is in, move to that, including declaring an XML file 
> exclusively for s3a entries
> I want this one in first as its a major performance regression, and one we 
> coula actually backport to 2.7.x, just to improve load time slightly there too



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to