Re: Hadoop as Compile time dependency in Sqoop2

Abraham Elmahrek Thu, 11 Dec 2014 08:43:02 -0800

Jarcec,

Sorry to bud in... you make a good point on the client side. Would you mind
if we discussed the server side a bit? Re-using the same mechanism on the
server side does require "compile" scope dependencies on Hadoop. Would that
be ok? Are the concerns mainly around the client?


-Abe

On Thu, Dec 11, 2014 at 10:30 AM, Jarek Jarcec Cecho <[email protected]>
wrote:

> Got it Richard, thank you very much for the nice summary! I’m wondering
> what is the use case for delegation tokens on client side? Is it to support
> integration with Oozie?
>
> I do know that Beeline is depending on Hadoop common and that is actually
> a very good example. I’ve seen sufficient number of users struggling with
> this dependency - using various workarounds for the classpath issue, having
> need to copy over Hadoop configuration files from real cluster (because
> otherwise portion of the security didn’t work at all, something with
> auth_to_local rules) and a lot of more. That is why I’m advising being
> careful here.
>
> Jarcec
>
> > On Dec 11, 2014, at 12:17 AM, Zhou, Richard <[email protected]>
> wrote:
> >
> > Hi Jarcec:
> > Thank you very much for your clarification about the history.
> >
> > The root cause for why we want to change "provided" to "compile" is to
> implement "Delegation Token Support" [1], review board [2]. The status in
> Hadoop is showed below.
> > Hadoop 2.5.1 or before: all classes used to implement Kerberos support
> is in Hadoop-auth component, which depends only several libs with
> non-Hadoop related lib. And it is added in Sqoop client side (shell
> component [3]) as "compile" as we agreed before.
> > Hadoop 2.6.0: There is a refactor to support delegation token in Hadoop
> [4]. Most components in Hadoop, such as RM, Httpfs and Kms, have rewritten
> authentication mechanism to use delegation token. However, all delegation
> token related class is in Hadoop-common instead of Hadoop-auth, because it
> uses UserGroupInfomation class.
> >
> > So if Sqoop need to support delegation token, it has to include
> Hadoop-common lib, because I believe that copying code is an unacceptable
> solution. Even using Hadoop shims, which is a good solution to support
> different version of Hadoop (I am +1 on writing a Hadoop shims in Sqoop
> like pig, hive etc.), the Hadoop-common is also a dependency. For example,
> the client side (beeline) in hive depends on Hadoop-common lib [5]. So I
> don't think it is a big problem to add Hadoop-common in.
> >
> > Additionally, I agree with Abe that wire compatibility is another reason
> to change "provided" to "compile", since it is in "Unstable" state. There
> will be a potential problem in the future.
> >
> > So I prefer to add Hadoop-common lib as "compile" to make "Delegation
> Token Support" happen.
> >
> > Add [email protected].
> >
> > Links:
> > 1: https://issues.apache.org/jira/browse/SQOOP-1776
> > 2: https://reviews.apache.org/r/28795/
> > 3: https://github.com/apache/sqoop/blob/sqoop2/shell/pom.xml#L75
> > 4: https://issues.apache.org/jira/browse/HADOOP-10771
> > 5: https://github.com/apache/hive/blob/trunk/beeline/pom.xml#L133
> >
> > Richard
> >
> > -----Original Message-----
> > From: Jarek Jarcec Cecho [mailto:[email protected]] On Behalf Of Jarek
> Jarcec Cecho
> > Sent: Thursday, December 11, 2014 1:43 PM
> > To: [email protected]
> > Subject: Re: Hadoop as Compile time dependency in Sqoop2
> >
> > Hi Abe,
> > thank you very much for surfacing the question. I think that there is a
> several twists to it, so my apologies as this will be a long answer :)
> >
> > When we’ve started working on Sqoop 2 few years back, we’ve
> intentionally pushed the Hadoop dependency as far from shared libraries as
> possible. The intention was that no component in common or core should be
> depending nor use any Hadoop APIs and those should be isolated to separate
> modules (execution/submission engine). The reason for that is that Hadoop
> doesn’t have particularly good track of keeping backward compatibility and
> it has bitten a lot of projects in the past. For example every single
> project that I know of that is using MR needs to have a shim layer that is
> dealing with the API differences (Pig [1], Hive [2], …) . The only
> exception to this that I’m aware of is Sqoop 1, where we did not had to
> introduce shims is only because we (shamelessly) copied code from Hadoop to
> our own code base. Nevertheless we have places where we had to do that
> detection nevertheless [3]. I’m sure that Hadoop is getting better as the
> project matures, but I would still advise being careful of using various
> Hadoop APIs and limit that usage to the extend needed. There will be
> obviously situations where we want to use Hadoop API to make our life
> simpler, such as reusing their security implementation and that will be
> hopefully fine.
> >
> > Whereas we can be pretty sure that Sqoop Server will have Hadoop
> libraries on the class-path and the concern there was more about
> introducing backward incompatible changes that is hopefully less important
> nowadays, not introducing Hadoop dependency on client side had a different
> reason. Hadoop common is quite important jar that have huge number of
> dependencies - check out the list at it’s pom file [4]. This is a problem
> because the Sqoop client is meant to be small and easily reusable wheres
> depending on Hadoop will force the application developer to certain library
> versions that are dictated by Hadoop (like guava, commons-*). And that
> forces people to do various weird things such as using custom class loaders
> to isolate those libraries from main application and making the situation
> in most cases even worst, because Hadoop libraries assumes “ownership” of
> the underlaying JVM and run a lot of eternal threads per class-loader.
> Hence I would advise being double careful when introducing dependency on
> Hadoop (common) for our client.
> >
> > I’m wondering what we’re trying to achieve by moving the dependency from
> “provided” to “compile”? Do we want to just ensure that it’s always on the
> Server side or is the intent to get it to the client?
> >
> > Jarcec
> >
> > Links:
> > 1: https://github.com/apache/pig/tree/trunk/shims/src
> > 2: https://github.com/apache/hive/tree/trunk/shims
> > 3:
> https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/mapreduce/hcat/SqoopHCatUtilities.java#L962
> > 4:
> http://search.maven.org/#artifactdetails%7Corg.apache.hadoop%7Chadoop-common%7C2.6.0%7Cjar
> >
> >> On Dec 10, 2014, at 7:56 AM, Abraham Elmahrek <[email protected]> wrote:
> >>
> >> Hey guys,
> >>
> >> With the work being done in Sqoop2 involving authentication, there are
> >> a few classes that are being used from hadoop auth and eventually
> >> hadoop common.
> >>
> >> I'd like to gauge how folks feel about including the hadoop libraries
> >> as a "compile" time dependency rather than "provided". The reasons
> being:
> >>
> >>  1. Hadoop maintains wire compatibility within a major version:
> >>
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Compatibility.html#Wire_compatibility
> >>  2. UserGroupInformation and other useful interfaces are marked as
> >>  "Evolving" or "Unstable":
> >>
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/InterfaceClassification.html
> >>  .
> >>
> >> I've been looking around and it seems most projects include Hadoop as
> >> a compile time dependency:
> >>
> >>  1. Kite -
> >>
> https://github.com/kite-sdk/kite/blob/master/kite-hadoop-dependencies/cdh5/pom.xml
> >>  2. Flume - https://github.com/apache/flume/blob/trunk/pom.xml
> >>  3. Oozie - https://github.com/apache/oozie/tree/master/hadooplibs
> >>  4. hive - https://github.com/apache/hive/blob/trunk/pom.xml#L1067
> >>
> >> IMO wire compatibility is easier to maintain than Java API
> compatibility.
> >> There may be features in future Hadoop releases that we'll want to use
> >> on the security side as well.
> >>
> >> -Abe
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "intel-sqoop" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected].
> > To post to this group, send email to [email protected].
> > To view this discussion on the web visit
> https://groups.google.com/a/cloudera.org/d/msgid/intel-sqoop/7F91673573F5D241AFCE8EDD6A313D24572C34%40SHSMSX103.ccr.corp.intel.com
> .
>
>

Re: Hadoop as Compile time dependency in Sqoop2

Reply via email to