RE: Hadoop as Compile time dependency in Sqoop2

Zhou, Richard Thu, 11 Dec 2014 00:19:13 -0800

Hi Jarcec:
Thank you very much for your clarification about the history.

The root cause for why we want to change "provided" to "compile" is to 
implement "Delegation Token Support" [1], review board [2]. The status in 
Hadoop is showed below.
Hadoop 2.5.1 or before: all classes used to implement Kerberos support is in 
Hadoop-auth component, which depends only several libs with non-Hadoop related 
lib. And it is added in Sqoop client side (shell component [3]) as "compile" as 
we agreed before.
Hadoop 2.6.0: There is a refactor to support delegation token in Hadoop [4]. 
Most components in Hadoop, such as RM, Httpfs and Kms, have rewritten 
authentication mechanism to use delegation token. However, all delegation token 
related class is in Hadoop-common instead of Hadoop-auth, because it uses 
UserGroupInfomation class.

So if Sqoop need to support delegation token, it has to include Hadoop-common 
lib, because I believe that copying code is an unacceptable solution. Even 
using Hadoop shims, which is a good solution to support different version of 
Hadoop (I am +1 on writing a Hadoop shims in Sqoop like pig, hive etc.), the 
Hadoop-common is also a dependency. For example, the client side (beeline) in 
hive depends on Hadoop-common lib [5]. So I don't think it is a big problem to 
add Hadoop-common in.

Additionally, I agree with Abe that wire compatibility is another reason to 
change "provided" to "compile", since it is in "Unstable" state. There will be 
a potential problem in the future.

So I prefer to add Hadoop-common lib as "compile" to make "Delegation Token 
Support" happen.

Add [email protected].

Links:
1: https://issues.apache.org/jira/browse/SQOOP-1776 
2: https://reviews.apache.org/r/28795/ 
3: https://github.com/apache/sqoop/blob/sqoop2/shell/pom.xml#L75 
4: https://issues.apache.org/jira/browse/HADOOP-10771 
5: https://github.com/apache/hive/blob/trunk/beeline/pom.xml#L133 

Richard

-----Original Message-----
From: Jarek Jarcec Cecho [mailto:[email protected]] On Behalf Of Jarek Jarcec 
Cecho
Sent: Thursday, December 11, 2014 1:43 PM
To: [email protected]
Subject: Re: Hadoop as Compile time dependency in Sqoop2

Hi Abe,
thank you very much for surfacing the question. I think that there is a several 
twists to it, so my apologies as this will be a long answer :)

When we’ve started working on Sqoop 2 few years back, we’ve intentionally 
pushed the Hadoop dependency as far from shared libraries as possible. The 
intention was that no component in common or core should be depending nor use 
any Hadoop APIs and those should be isolated to separate modules 
(execution/submission engine). The reason for that is that Hadoop doesn’t have 
particularly good track of keeping backward compatibility and it has bitten a 
lot of projects in the past. For example every single project that I know of 
that is using MR needs to have a shim layer that is dealing with the API 
differences (Pig [1], Hive [2], …) . The only exception to this that I’m aware 
of is Sqoop 1, where we did not had to introduce shims is only because we 
(shamelessly) copied code from Hadoop to our own code base. Nevertheless we 
have places where we had to do that detection nevertheless [3]. I’m sure that 
Hadoop is getting better as the project matures, but I would still advise being 
careful of using various Hadoop APIs and limit that usage to the extend needed. 
There will be obviously situations where we want to use Hadoop API to make our 
life simpler, such as reusing their security implementation and that will be 
hopefully fine.

Whereas we can be pretty sure that Sqoop Server will have Hadoop libraries on 
the class-path and the concern there was more about introducing backward 
incompatible changes that is hopefully less important nowadays, not introducing 
Hadoop dependency on client side had a different reason. Hadoop common is quite 
important jar that have huge number of dependencies - check out the list at 
it’s pom file [4]. This is a problem because the Sqoop client is meant to be 
small and easily reusable wheres depending on Hadoop will force the application 
developer to certain library versions that are dictated by Hadoop (like guava, 
commons-*). And that forces people to do various weird things such as using 
custom class loaders to isolate those libraries from main application and 
making the situation in most cases even worst, because Hadoop libraries assumes 
“ownership” of the underlaying JVM and run a lot of eternal threads per 
class-loader. Hence I would advise being double careful when introducing 
dependency on Hadoop (common) for our client.

I’m wondering what we’re trying to achieve by moving the dependency from 
“provided” to “compile”? Do we want to just ensure that it’s always on the 
Server side or is the intent to get it to the client?

Jarcec

Links:
1: https://github.com/apache/pig/tree/trunk/shims/src
2: https://github.com/apache/hive/tree/trunk/shims
3: 
https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/mapreduce/hcat/SqoopHCatUtilities.java#L962
4: 
http://search.maven.org/#artifactdetails%7Corg.apache.hadoop%7Chadoop-common%7C2.6.0%7Cjar

> On Dec 10, 2014, at 7:56 AM, Abraham Elmahrek <[email protected]> wrote:
> 
> Hey guys,
> 
> With the work being done in Sqoop2 involving authentication, there are 
> a few classes that are being used from hadoop auth and eventually 
> hadoop common.
> 
> I'd like to gauge how folks feel about including the hadoop libraries 
> as a "compile" time dependency rather than "provided". The reasons being:
> 
>   1. Hadoop maintains wire compatibility within a major version:
>   
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Compatibility.html#Wire_compatibility
>   2. UserGroupInformation and other useful interfaces are marked as
>   "Evolving" or "Unstable":
>   
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/InterfaceClassification.html
>   .
> 
> I've been looking around and it seems most projects include Hadoop as 
> a compile time dependency:
> 
>   1. Kite -
>   
> https://github.com/kite-sdk/kite/blob/master/kite-hadoop-dependencies/cdh5/pom.xml
>   2. Flume - https://github.com/apache/flume/blob/trunk/pom.xml
>   3. Oozie - https://github.com/apache/oozie/tree/master/hadooplibs
>   4. hive - https://github.com/apache/hive/blob/trunk/pom.xml#L1067
> 
> IMO wire compatibility is easier to maintain than Java API compatibility.
> There may be features in future Hadoop releases that we'll want to use 
> on the security side as well.
> 
> -Abe

RE: Hadoop as Compile time dependency in Sqoop2

Reply via email to