I'll have to do a bit of experimentation to better understand packaging and dependencies. If we make hadoop-common a compile time requirement conditionally in sqoop-core, this should affect the classpath of the other components in the server? In the DependencyManagement section of the root pom, it would still be marked as provided?
-Abe On Thu, Dec 11, 2014 at 5:50 PM, Jarek Jarcec Cecho <[email protected]> wrote: > > Got it, so the proposal is really to ship Hadoop libraries as part of our > distribution (tarball) and not let users to configure Sqoop using existing > ones. I personally don’t feel entirely comfortable doing so as I’m afraid > that a lot of troubles will pop up on the way (given my experience), but > I’m open to give it a try. Just to be on the same page, we want to package > the Hadoop-common with server only right? So I’m assuming that the > “compile” dependency will be on sqoop-core rather then sqoop-common (that > is shared between client and server). > > Jarcec > > > On Dec 11, 2014, at 3:34 PM, Abraham Elmahrek <[email protected]> wrote: > > > > Jarcec, > > > > I believe that providing delegation support requires using a class on the > > server side that is only available in hadoop-common as of Hadoop 2.6.0 > [1]. > > This seems like reason enough to change from "provided" to "compile" > given > > the feature may not exist in previous versions of Hadoop2. > > > > Also, requiring that Sqoop2 must be used with Hadoop 2.6.0 or newer > doesn't > > seem like a great idea. It delegates hadoop version management to the > users > > of Sqoop2, where it might be better to be handled by devs? > > > > 1. https://issues.apache.org/jira/browse/HADOOP-11083 > > > > On Thu, Dec 11, 2014 at 4:50 PM, Jarek Jarcec Cecho <[email protected]> > > wrote: > >> > >> Nope not at all Abe, I also feel that client and server changes should > be > >> discussed separately as there are different reasons/concerns of why or > why > >> not introduce Hadoop dependencies there. > >> > >> For the server side and for the security portion, I feel that we had > good > >> discussion with Richard while back and I do not longer have concerns > about > >> using those APIs. I’ll advise caution nevertheless. What we are trying > to > >> achieve by changing the scope from “provided” to “compile” here? To my > best > >> knowledge [1] the difference is only that “provided” means that the > >> dependency is not retrieved and stored in resulting package and that > users > >> have to add it manually after installation. I’m not immediately seeing > any > >> impact on the code though. > >> > >> Jarcec > >> > >> Links: > >> 1: > >> > http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html > >> > >>> On Dec 11, 2014, at 8:41 AM, Abraham Elmahrek <[email protected]> > wrote: > >>> > >>> Jarcec, > >>> > >>> Sorry to bud in... you make a good point on the client side. Would you > >> mind > >>> if we discussed the server side a bit? Re-using the same mechanism on > the > >>> server side does require "compile" scope dependencies on Hadoop. Would > >> that > >>> be ok? Are the concerns mainly around the client? > >>> > >>> -Abe > >>> > >>> On Thu, Dec 11, 2014 at 10:30 AM, Jarek Jarcec Cecho < > [email protected]> > >>> wrote: > >>> > >>>> Got it Richard, thank you very much for the nice summary! I’m > wondering > >>>> what is the use case for delegation tokens on client side? Is it to > >> support > >>>> integration with Oozie? > >>>> > >>>> I do know that Beeline is depending on Hadoop common and that is > >> actually > >>>> a very good example. I’ve seen sufficient number of users struggling > >> with > >>>> this dependency - using various workarounds for the classpath issue, > >> having > >>>> need to copy over Hadoop configuration files from real cluster > (because > >>>> otherwise portion of the security didn’t work at all, something with > >>>> auth_to_local rules) and a lot of more. That is why I’m advising being > >>>> careful here. > >>>> > >>>> Jarcec > >>>> > >>>>> On Dec 11, 2014, at 12:17 AM, Zhou, Richard <[email protected]> > >>>> wrote: > >>>>> > >>>>> Hi Jarcec: > >>>>> Thank you very much for your clarification about the history. > >>>>> > >>>>> The root cause for why we want to change "provided" to "compile" is > to > >>>> implement "Delegation Token Support" [1], review board [2]. The status > >> in > >>>> Hadoop is showed below. > >>>>> Hadoop 2.5.1 or before: all classes used to implement Kerberos > support > >>>> is in Hadoop-auth component, which depends only several libs with > >>>> non-Hadoop related lib. And it is added in Sqoop client side (shell > >>>> component [3]) as "compile" as we agreed before. > >>>>> Hadoop 2.6.0: There is a refactor to support delegation token in > Hadoop > >>>> [4]. Most components in Hadoop, such as RM, Httpfs and Kms, have > >> rewritten > >>>> authentication mechanism to use delegation token. However, all > >> delegation > >>>> token related class is in Hadoop-common instead of Hadoop-auth, > because > >> it > >>>> uses UserGroupInfomation class. > >>>>> > >>>>> So if Sqoop need to support delegation token, it has to include > >>>> Hadoop-common lib, because I believe that copying code is an > >> unacceptable > >>>> solution. Even using Hadoop shims, which is a good solution to support > >>>> different version of Hadoop (I am +1 on writing a Hadoop shims in > Sqoop > >>>> like pig, hive etc.), the Hadoop-common is also a dependency. For > >> example, > >>>> the client side (beeline) in hive depends on Hadoop-common lib [5]. > So I > >>>> don't think it is a big problem to add Hadoop-common in. > >>>>> > >>>>> Additionally, I agree with Abe that wire compatibility is another > >> reason > >>>> to change "provided" to "compile", since it is in "Unstable" state. > >> There > >>>> will be a potential problem in the future. > >>>>> > >>>>> So I prefer to add Hadoop-common lib as "compile" to make "Delegation > >>>> Token Support" happen. > >>>>> > >>>>> Add [email protected]. > >>>>> > >>>>> Links: > >>>>> 1: https://issues.apache.org/jira/browse/SQOOP-1776 > >>>>> 2: https://reviews.apache.org/r/28795/ > >>>>> 3: https://github.com/apache/sqoop/blob/sqoop2/shell/pom.xml#L75 > >>>>> 4: https://issues.apache.org/jira/browse/HADOOP-10771 > >>>>> 5: https://github.com/apache/hive/blob/trunk/beeline/pom.xml#L133 > >>>>> > >>>>> Richard > >>>>> > >>>>> -----Original Message----- > >>>>> From: Jarek Jarcec Cecho [mailto:[email protected]] On Behalf Of > Jarek > >>>> Jarcec Cecho > >>>>> Sent: Thursday, December 11, 2014 1:43 PM > >>>>> To: [email protected] > >>>>> Subject: Re: Hadoop as Compile time dependency in Sqoop2 > >>>>> > >>>>> Hi Abe, > >>>>> thank you very much for surfacing the question. I think that there > is a > >>>> several twists to it, so my apologies as this will be a long answer :) > >>>>> > >>>>> When we’ve started working on Sqoop 2 few years back, we’ve > >>>> intentionally pushed the Hadoop dependency as far from shared > libraries > >> as > >>>> possible. The intention was that no component in common or core should > >> be > >>>> depending nor use any Hadoop APIs and those should be isolated to > >> separate > >>>> modules (execution/submission engine). The reason for that is that > >> Hadoop > >>>> doesn’t have particularly good track of keeping backward compatibility > >> and > >>>> it has bitten a lot of projects in the past. For example every single > >>>> project that I know of that is using MR needs to have a shim layer > that > >> is > >>>> dealing with the API differences (Pig [1], Hive [2], …) . The only > >>>> exception to this that I’m aware of is Sqoop 1, where we did not had > to > >>>> introduce shims is only because we (shamelessly) copied code from > >> Hadoop to > >>>> our own code base. Nevertheless we have places where we had to do that > >>>> detection nevertheless [3]. I’m sure that Hadoop is getting better as > >> the > >>>> project matures, but I would still advise being careful of using > various > >>>> Hadoop APIs and limit that usage to the extend needed. There will be > >>>> obviously situations where we want to use Hadoop API to make our life > >>>> simpler, such as reusing their security implementation and that will > be > >>>> hopefully fine. > >>>>> > >>>>> Whereas we can be pretty sure that Sqoop Server will have Hadoop > >>>> libraries on the class-path and the concern there was more about > >>>> introducing backward incompatible changes that is hopefully less > >> important > >>>> nowadays, not introducing Hadoop dependency on client side had a > >> different > >>>> reason. Hadoop common is quite important jar that have huge number of > >>>> dependencies - check out the list at it’s pom file [4]. This is a > >> problem > >>>> because the Sqoop client is meant to be small and easily reusable > wheres > >>>> depending on Hadoop will force the application developer to certain > >> library > >>>> versions that are dictated by Hadoop (like guava, commons-*). And that > >>>> forces people to do various weird things such as using custom class > >> loaders > >>>> to isolate those libraries from main application and making the > >> situation > >>>> in most cases even worst, because Hadoop libraries assumes “ownership” > >> of > >>>> the underlaying JVM and run a lot of eternal threads per class-loader. > >>>> Hence I would advise being double careful when introducing dependency > on > >>>> Hadoop (common) for our client. > >>>>> > >>>>> I’m wondering what we’re trying to achieve by moving the dependency > >> from > >>>> “provided” to “compile”? Do we want to just ensure that it’s always on > >> the > >>>> Server side or is the intent to get it to the client? > >>>>> > >>>>> Jarcec > >>>>> > >>>>> Links: > >>>>> 1: https://github.com/apache/pig/tree/trunk/shims/src > >>>>> 2: https://github.com/apache/hive/tree/trunk/shims > >>>>> 3: > >>>> > >> > https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/mapreduce/hcat/SqoopHCatUtilities.java#L962 > >>>>> 4: > >>>> > >> > http://search.maven.org/#artifactdetails%7Corg.apache.hadoop%7Chadoop-common%7C2.6.0%7Cjar > >>>>> > >>>>>> On Dec 10, 2014, at 7:56 AM, Abraham Elmahrek <[email protected]> > >> wrote: > >>>>>> > >>>>>> Hey guys, > >>>>>> > >>>>>> With the work being done in Sqoop2 involving authentication, there > are > >>>>>> a few classes that are being used from hadoop auth and eventually > >>>>>> hadoop common. > >>>>>> > >>>>>> I'd like to gauge how folks feel about including the hadoop > libraries > >>>>>> as a "compile" time dependency rather than "provided". The reasons > >>>> being: > >>>>>> > >>>>>> 1. Hadoop maintains wire compatibility within a major version: > >>>>>> > >>>> > >> > http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Compatibility.html#Wire_compatibility > >>>>>> 2. UserGroupInformation and other useful interfaces are marked as > >>>>>> "Evolving" or "Unstable": > >>>>>> > >>>> > >> > http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/InterfaceClassification.html > >>>>>> . > >>>>>> > >>>>>> I've been looking around and it seems most projects include Hadoop > as > >>>>>> a compile time dependency: > >>>>>> > >>>>>> 1. Kite - > >>>>>> > >>>> > >> > https://github.com/kite-sdk/kite/blob/master/kite-hadoop-dependencies/cdh5/pom.xml > >>>>>> 2. Flume - https://github.com/apache/flume/blob/trunk/pom.xml > >>>>>> 3. Oozie - https://github.com/apache/oozie/tree/master/hadooplibs > >>>>>> 4. hive - https://github.com/apache/hive/blob/trunk/pom.xml#L1067 > >>>>>> > >>>>>> IMO wire compatibility is easier to maintain than Java API > >>>> compatibility. > >>>>>> There may be features in future Hadoop releases that we'll want to > use > >>>>>> on the security side as well. > >>>>>> > >>>>>> -Abe > >>>>> > >>>>> -- > >>>>> You received this message because you are subscribed to the Google > >>>> Groups "intel-sqoop" group. > >>>>> To unsubscribe from this group and stop receiving emails from it, > send > >>>> an email to [email protected]. > >>>>> To post to this group, send email to [email protected]. > >>>>> To view this discussion on the web visit > >>>> > >> > https://groups.google.com/a/cloudera.org/d/msgid/intel-sqoop/7F91673573F5D241AFCE8EDD6A313D24572C34%40SHSMSX103.ccr.corp.intel.com > >>>> . > >>>> > >>>> > >> > >> > >
