[GitHub] spark pull request #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile wi...

steveloughran Thu, 29 Mar 2018 06:37:00 -0700

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20923#discussion_r178057319
  
    --- Diff: hadoop-cloud/pom.xml ---
    @@ -177,6 +214,188 @@
                 </exclusion>
               </exclusions>
             </dependency>
    +        <!--
    +          the AWS module pulls in jackson; its transitive dependencies can 
create
    +          intra-jackson-module version problems.
    +          -->
    +        <dependency>
    +          <groupId>org.apache.hadoop</groupId>
    +          <artifactId>hadoop-aws</artifactId>
    +          <version>${hadoop.version}</version>
    +          <scope>${hadoop.deps.scope}</scope>
    +          <exclusions>
    +            <exclusion>
    +              <groupId>org.apache.hadoop</groupId>
    +              <artifactId>hadoop-common</artifactId>
    +            </exclusion>
    +            <exclusion>
    +              <groupId>commons-logging</groupId>
    +              <artifactId>commons-logging</artifactId>
    +            </exclusion>
    +            <exclusion>
    +              <groupId>org.codehaus.jackson</groupId>
    +              <artifactId>jackson-mapper-asl</artifactId>
    +            </exclusion>
    +            <exclusion>
    +              <groupId>org.codehaus.jackson</groupId>
    +              <artifactId>jackson-core-asl</artifactId>
    +            </exclusion>
    +            <exclusion>
    +              <groupId>com.fasterxml.jackson.core</groupId>
    +              <artifactId>jackson-core</artifactId>
    +            </exclusion>
    +            <exclusion>
    +              <groupId>com.fasterxml.jackson.core</groupId>
    +              <artifactId>jackson-databind</artifactId>
    +            </exclusion>
    +            <exclusion>
    +              <groupId>com.fasterxml.jackson.core</groupId>
    +              <artifactId>jackson-annotations</artifactId>
    +            </exclusion>
    +          </exclusions>
    +        </dependency>
    +        <dependency>
    +          <groupId>org.apache.hadoop</groupId>
    +          <artifactId>hadoop-openstack</artifactId>
    +          <version>${hadoop.version}</version>
    +          <scope>${hadoop.deps.scope}</scope>
    +          <exclusions>
    +            <exclusion>
    +              <groupId>org.apache.hadoop</groupId>
    +              <artifactId>hadoop-common</artifactId>
    +            </exclusion>
    +            <exclusion>
    +              <groupId>commons-logging</groupId>
    +              <artifactId>commons-logging</artifactId>
    +            </exclusion>
    +            <exclusion>
    +              <groupId>junit</groupId>
    +              <artifactId>junit</artifactId>
    +            </exclusion>
    +            <exclusion>
    +              <groupId>org.mockito</groupId>
    +              <artifactId>mockito-all</artifactId>
    +            </exclusion>
    +          </exclusions>
    +        </dependency>
    +
    +        <!--
    +        Add joda time to ensure that anything downstream which doesn't 
pull in spark-hive
    +        gets the correct joda time artifact, so doesn't have auth failures 
on later Java 8 JVMs
    +        -->
    +        <dependency>
    +          <groupId>joda-time</groupId>
    +          <artifactId>joda-time</artifactId>
    +          <scope>${hadoop.deps.scope}</scope>
    +        </dependency>
    +        <!-- explicitly declare the jackson artifacts desired -->
    +        <dependency>
    +          <groupId>com.fasterxml.jackson.core</groupId>
    +          <artifactId>jackson-databind</artifactId>
    +          <scope>${hadoop.deps.scope}</scope>
    +        </dependency>
    +        <dependency>
    +          <groupId>com.fasterxml.jackson.core</groupId>
    +          <artifactId>jackson-annotations</artifactId>
    +          <scope>${hadoop.deps.scope}</scope>
    +        </dependency>
    +        <dependency>
    +          <groupId>com.fasterxml.jackson.dataformat</groupId>
    +          <artifactId>jackson-dataformat-cbor</artifactId>
    +          <version>${fasterxml.jackson.version}</version>
    +        </dependency>
    +        <!--Explicit declaration to force in Spark version into transitive 
dependencies -->
    +        <dependency>
    +          <groupId>org.apache.httpcomponents</groupId>
    +          <artifactId>httpclient</artifactId>
    +          <scope>${hadoop.deps.scope}</scope>
    +        </dependency>
    +        <!--Explicit declaration to force in Spark version into transitive 
dependencies -->
    +        <dependency>
    +          <groupId>org.apache.httpcomponents</groupId>
    +          <artifactId>httpcore</artifactId>
    +          <scope>${hadoop.deps.scope}</scope>
    +        </dependency>
    +      </dependencies>
    +    </profile>
    +
    +    <!--
    +     Hadoop 3 simplifies the classpath, and adds a new committer base 
class which
    +     enables store-specific committers.
    +    -->
    +    <profile>
    +      <id>hadoop-3</id>
    +      <properties>
    +        <extra.source.dir>src/hadoop-3/main/scala</extra.source.dir>
    +        
<extra.testsource.dir>src/hadoop-3/test/scala</extra.testsource.dir>
    +      </properties>
    +
    +      <build>
    +        <plugins>
    +          <!-- Include a source dir depending on the Scala version -->
    +          <plugin>
    +            <groupId>org.codehaus.mojo</groupId>
    +            <artifactId>build-helper-maven-plugin</artifactId>
    +            <executions>
    +              <execution>
    +                <id>add-scala-sources</id>
    +                <phase>generate-sources</phase>
    +                <goals>
    +                  <goal>add-source</goal>
    +                </goals>
    +                <configuration>
    +                  <sources>
    +                    <source>${extra.source.dir}</source>
    +                  </sources>
    +                </configuration>
    +              </execution>
    +              <execution>
    +                <id>add-scala-test-sources</id>
    +                <phase>generate-test-sources</phase>
    +                <goals>
    +                  <goal>add-test-source</goal>
    +                </goals>
    +                <configuration>
    +                  <sources>
    +                    <source>${extra.testsource.dir}</source>
    +                  </sources>
    +                </configuration>
    +              </execution>
    +            </executions>
    +          </plugin>
    +        </plugins>
    +
    +      </build>
    +      <dependencies>
    +
    +        <!--
    +        There's now a hadoop-cloud-storage which transitively pulls in the 
store JARs,
    +        but it still needs some selective exclusion across versions, 
especially 3.0.x.
    --- End diff --
    
    Excluding hadoop-client means there's no need to worry about any of the 
stuff explicitly excluded from hadoop-client in the spark root pom (asm/asm, 
jackson, etc).
    
    Hadoop 3.0.1 declares hadoop-client as a compile time dependency of 
[hadoop-cloud-storage](https://github.com/apache/hadoop/blob/branch-3.0.1/hadoop-cloud-storage-project/hadoop-cloud-storage/pom.xml)
    
    From 3.0.2+ it's been cut down to provided, and added `azure-datalake` as a 
dependency [commit 
3c03672e](https://github.com/apache/hadoop/commit/3c03672e876ddbd6a6425ea1a056ad13adc309ea),
 so it's complete w.r.t ASF connectors.
    There's also a fix for the aws shaded SDK to exclude netty 
[HADOOP-15264](https://github.com/apache/hadoop/commit/e015e009897e481edc79f4ba72e2c53610b178a3),
 because of 
[aws-sdk-java/issues/1488](https://github.com/aws/aws-sdk-java/issues/1488). 
    
    The individual hadoop cloud modules (hadoop-aws, hadoop-azure, ...) have 
also downgraded hadoop-client to being provided, so if you pull in any of 
those, you will only get the extra artifacts needed to connect to the relevant 
cloud endpoint, and are expected to pull in the same hadoop-client version 
elsewhere for things to work.
    
    Here's the dependency list for spark-hadoop-cloud and 3.0.2-SNAPSHOT; 3.1 
will be the same unless there's a last minute update to one of the external 
SDKs or jetty.
    
    ```
    [INFO] +- org.apache.hadoop:hadoop-cloud-storage:jar:3.0.2-SNAPSHOT:compile
    [INFO] |  +- org.apache.hadoop:hadoop-aliyun:jar:3.0.2-SNAPSHOT:compile
    [INFO] |  |  \- com.aliyun.oss:aliyun-sdk-oss:jar:2.8.3:compile
    [INFO] |  |     \- org.jdom:jdom:jar:1.1:compile
    [INFO] |  +- org.apache.hadoop:hadoop-aws:jar:3.0.2-SNAPSHOT:compile
    [INFO] |  |  \- com.amazonaws:aws-java-sdk-bundle:jar:1.11.271:compile
    [INFO] |  +- org.apache.hadoop:hadoop-azure:jar:3.0.2-SNAPSHOT:compile
    [INFO] |  |  +- com.microsoft.azure:azure-storage:jar:5.4.0:compile
    [INFO] |  |  |  \- com.microsoft.azure:azure-keyvault-core:jar:0.8.0:compile
    [INFO] |  |  \- 
org.eclipse.jetty:jetty-util-ajax:jar:9.3.19.v20170502:compile
    [INFO] |  +- 
org.apache.hadoop:hadoop-azure-datalake:jar:3.0.2-SNAPSHOT:compile
    [INFO] |  |  \- 
com.microsoft.azure:azure-data-lake-store-sdk:jar:2.2.5:compile
    [INFO] |  \- org.apache.hadoop:hadoop-openstack:jar:3.0.2-SNAPSHOT:compile
    ```
    
    Given that Hadoop 3.0.2+ is downgrading hadoop-client to provided, and 
that's the minimum version this patch will build against, then the exclusion is 
mostly superfluous: there to block regressions than actually keep it out.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20923: [SPARK-23807][BUILD][WIP] Add Hadoop 3 profile wi...

Reply via email to