[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-33212:
-----------------------------
    Description: 
Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
hadoop-common, hadoop-client etc. Benefits include:
 * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer versions 
of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava conflicts, 
Spark depends on Hadoop to not leaking dependencies.
 * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
client-side and server-side Hadoop APIs from modules such as hadoop-common, 
hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
use public/client API from Hadoop side.
 * Provides a better isolation from Hadoop dependencies. In future Spark can 
better evolve without worrying about dependencies pulled from Hadoop side 
(which used to be a lot).

*There are some behavior changes introduced with this JIRA, when people use 
Spark compiled with Hadoop 3.x:*
- Users now need to make sure class path contains `hadoop-client-api` and 
`hadoop-client-runtime` jars when they deploy Spark with the `hadoop-provided` 
option. In addition, it is high recommended that they put these two jars before 
other Hadoop jars in the class path. Otherwise, conflicts such as from Guava 
could happen if classes are loaded from the other non-shaded Hadoop jars.
- Since the new shaded Hadoop clients no longer include 3rd party dependencies. 
Users who used to depend on these now need to explicitly put the jars in their 
class path.


  was:
Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
hadoop-common, hadoop-client etc. Benefits include:
 * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer versions 
of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava conflicts, 
Spark depends on Hadoop to not leaking dependencies.
 * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
client-side and server-side Hadoop APIs from modules such as hadoop-common, 
hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
use public/client API from Hadoop side.
 * Provides a better isolation from Hadoop dependencies. In future Spark can 
better evolve without worrying about dependencies pulled from Hadoop side 
(which used to be a lot).

There are some behavior changes introduced with this JIRA, when people use 
Spark compiled with Hadoop 3.x:
- Users now need to make sure class path contains `hadoop-client-api` and 
`hadoop-client-runtime` jars when they deploy Spark with the `hadoop-provided` 
option. In addition, it is high recommended that they put these two jars before 
other Hadoop jars in the class path. Otherwise, conflicts such as from Guava 
could happen if classes are loaded from the other non-shaded Hadoop jars.
- Since the new shaded Hadoop clients no longer include 3rd party dependencies. 
Users who used to depend on these now need to explicitly put the jars in their 
class path.



> Move to shaded clients for Hadoop 3.x profile
> ---------------------------------------------
>
>                 Key: SPARK-33212
>                 URL: https://issues.apache.org/jira/browse/SPARK-33212
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, Spark Submit, SQL, YARN
>    Affects Versions: 3.0.1
>            Reporter: Chao Sun
>            Assignee: Chao Sun
>            Priority: Major
>              Labels: releasenotes
>             Fix For: 3.1.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to