[
https://issues.apache.org/jira/browse/HADOOP-15797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631774#comment-16631774
]
Allen Wittenauer commented on HADOOP-15797:
-------------------------------------------
First, let's put the side the issue of s3guard. It breaks things, as we'll see
in a bit.
Second, let's also remember that the whole point of this code is to pull things
OUT of the default classpath.
Now, what does 'builtin' and 'optional' mean?
builtin = required by a command. For example, hadoop distcp requires it's jar
at runtime. It is not needed any other time, so it doesn't make sense to put
it AND ANY DEPENDENCIES on the classpath all the time.
optional = optional features the USER wants to enable. All of these features
need to always be available at runtime. Prior to s3guard, this was ALL of the
non-core file systems: S3, Azure, etc, etc. Users enable these features using
the HADOOP_OPTIONAL_TOOLS environment variable. Again, if I don't access S3
from my cluster, I don't want the AWS jars AND ANY DEPENDENCIES on the
classpath.
It's also worthwhile pointing out that removing all of these jars from the
default classpath, in addition to allowing more user freedom, greatly speeds
the system up when measured across all java launches.
That said, it is now easy to see the problem that s3guard presents and how it
is an outlier. s3guard is a built-in command that depends upon components are
also optional. IMO: using s3guard to determine any sort of functionality for
the rest of the system is completely and totally wrong.
That said, what makes anyone think that "hadoop_add_to_classpath_tools
hadoop-azure" should work? optional bits come as shellprofiles, not as hooks
for built-ins. I mean the documentation here literally says:
{code}
## @description Run libexec/tools/module.sh to add to the classpath
## @description environment
{code}
If you want per-user settings for this (which is also weird, but whatever),
then modifying .hadoop-env is the way to go.
> optional / builtin modules confused for cloud storage
> -----------------------------------------------------
>
> Key: HADOOP-15797
> URL: https://issues.apache.org/jira/browse/HADOOP-15797
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/adl, fs/azure, fs/s3
> Affects Versions: 3.2.0, 3.1.1
> Reporter: Sean Mackrory
> Priority: Major
>
> Throwing this in your .hadooprc results in hadoop-aws being in the classpath
> but not hadoop-azure*:
> {quote}
> hadoop_add_to_classpath_tools hadoop-aws
> hadoop_add_to_classpath_tools hadoop-azure
> hadoop_add_to_classpath_tools hadoop-azure-datalake
> {quote}
> It would seem that the core issue is that that requires the module to have
> listed it's dependencies in MODULE_NAME.tools-builtin.txt, whereas the Azure
> connectors only have them listed in MODULE_NAME.tools-optional.txt. S3 does
> both, and there's a comment in it's POM about how it needs to do this because
> of the "hadoop s3guard" CLI.
> Maybe there's some history that I'm missing here, but I think what's wrong
> here is that hadoop_add_to_classpath should get what it needs from optional
> modules. builtin modules shouldn't even need hadoop_add_to_classpath to be
> added anyway.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]