My apologies if my previous email has been vague, indeed I do agree with current approach of including Hadoop 2.6.0 jars in our distribution.
I just want us to be careful, that’s all :) Jarcec > On Dec 12, 2014, at 5:55 PM, Abraham Elmahrek <[email protected]> wrote: > > Hey Richard, > > I think Jarcec is agreeing here and that it's worth trying out. Let's move > forward with the current design? > > -Abe > > On Thu, Dec 11, 2014 at 9:13 PM, Zhou, Richard <[email protected]> > wrote: >> >> Hi Jarcec & Abe: >> >> Thank you for your nice clarification. And I have got several opinions >> about it. >> >> >> >> 1. As the Hadoop dependency is “provided” in the Sqoop server, and >> the real classpath is set in catalina.properties, how to avoid >> compatibility mistakes? Let’s say, Sqoop server is built successfully with >> Hadoop 2.5.1, showed in root pom.xml, whilst the real cluster is Hadoop >> 2.5.0. Having said that there are only minor changes between these two >> minor release, there should be some unexpected exception still. >> >> 2. If “compile” is used in both client and server side. The wire >> compatibility is confirmed in authentication communication between client >> and server. But another compatibility surfaces, that the Sqoop server >> depends different version of Hadoop-common. 2.6.0 from “compile”, and 2.5.0 >> (real cluster version) from classpath in catalina.properties. >> >> 3. As Abe said, if we use partially “compile” (in client side) and >> partially “provided”( in server side), I agree that there must be some wire >> compatibility issues. >> >> 4. The best solution to resolve all compatibility issues is that >> use “provided” in client and server side with Hadoop-common lib from >> classpath in real cluster, which must be 2.6.0 or later. However, it is >> impossible to make all users use Hadoop 2.6.0 or later. >> >> >> >> So, I am considering that is it a little rush to support delegation token >> currently? Since it is the latest feature in Hadoop 2.6.0, whilst Sqoop >> only support 2.5.1. Having said sooner or later Hadoop 2.6.0 will be >> supported in the near future, Sqoop must support Hadoop 2.5.1 or before for >> a long time as well. Maybe we should re-open delegation token support at >> Hadoop 3.* period, as delegation token should be supported that time. And >> as for Kerberos support task (SQOOP-1525), it could be finished with doAs >> function completed. Actually this code is ready, and the reason I have not >> uploaded for review is that I think delegation token is a better solution >> to handle this. There is no need to commit doAs code and then rewrite with >> delegation token. As for delegation token support, it could be put into >> improvement of Kerberos support. >> >> >> >> Richard >> >> >> >> *From:* Abraham Elmahrek [mailto:[email protected]] >> *Sent:* Friday, December 12, 2014 8:14 AM >> *To:* [email protected] >> *Cc:* Zhou, Richard >> >> *Subject:* Re: Hadoop as Compile time dependency in Sqoop2 >> >> >> >> I'll have to do a bit of experimentation to better understand packaging >> and dependencies. If we make hadoop-common a compile time requirement >> conditionally in sqoop-core, this should affect the classpath of the other >> components in the server? In the DependencyManagement section of the root >> pom, it would still be marked as provided? >> >> >> >> -Abe >> >> >> >> On Thu, Dec 11, 2014 at 5:50 PM, Jarek Jarcec Cecho <[email protected]> >> wrote: >> >> Got it, so the proposal is really to ship Hadoop libraries as part of our >> distribution (tarball) and not let users to configure Sqoop using existing >> ones. I personally don’t feel entirely comfortable doing so as I’m afraid >> that a lot of troubles will pop up on the way (given my experience), but >> I’m open to give it a try. Just to be on the same page, we want to package >> the Hadoop-common with server only right? So I’m assuming that the >> “compile” dependency will be on sqoop-core rather then sqoop-common (that >> is shared between client and server). >> >> Jarcec >> >> >>> On Dec 11, 2014, at 3:34 PM, Abraham Elmahrek <[email protected]> wrote: >>> >>> Jarcec, >>> >>> I believe that providing delegation support requires using a class on the >>> server side that is only available in hadoop-common as of Hadoop 2.6.0 >> [1]. >>> This seems like reason enough to change from "provided" to "compile" >> given >>> the feature may not exist in previous versions of Hadoop2. >>> >>> Also, requiring that Sqoop2 must be used with Hadoop 2.6.0 or newer >> doesn't >>> seem like a great idea. It delegates hadoop version management to the >> users >>> of Sqoop2, where it might be better to be handled by devs? >>> >>> 1. https://issues.apache.org/jira/browse/HADOOP-11083 >>> >>> On Thu, Dec 11, 2014 at 4:50 PM, Jarek Jarcec Cecho <[email protected]> >>> wrote: >>>> >>>> Nope not at all Abe, I also feel that client and server changes should >> be >>>> discussed separately as there are different reasons/concerns of why or >> why >>>> not introduce Hadoop dependencies there. >>>> >>>> For the server side and for the security portion, I feel that we had >> good >>>> discussion with Richard while back and I do not longer have concerns >> about >>>> using those APIs. I’ll advise caution nevertheless. What we are trying >> to >>>> achieve by changing the scope from “provided” to “compile” here? To my >> best >>>> knowledge [1] the difference is only that “provided” means that the >>>> dependency is not retrieved and stored in resulting package and that >> users >>>> have to add it manually after installation. I’m not immediately seeing >> any >>>> impact on the code though. >>>> >>>> Jarcec >>>> >>>> Links: >>>> 1: >>>> >> http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html >>>> >>>>> On Dec 11, 2014, at 8:41 AM, Abraham Elmahrek <[email protected]> >> wrote: >>>>> >>>>> Jarcec, >>>>> >>>>> Sorry to bud in... you make a good point on the client side. Would you >>>> mind >>>>> if we discussed the server side a bit? Re-using the same mechanism on >> the >>>>> server side does require "compile" scope dependencies on Hadoop. Would >>>> that >>>>> be ok? Are the concerns mainly around the client? >>>>> >>>>> -Abe >>>>> >>>>> On Thu, Dec 11, 2014 at 10:30 AM, Jarek Jarcec Cecho < >> [email protected]> >>>>> wrote: >>>>> >>>>>> Got it Richard, thank you very much for the nice summary! I’m >> wondering >>>>>> what is the use case for delegation tokens on client side? Is it to >>>> support >>>>>> integration with Oozie? >>>>>> >>>>>> I do know that Beeline is depending on Hadoop common and that is >>>> actually >>>>>> a very good example. I’ve seen sufficient number of users struggling >>>> with >>>>>> this dependency - using various workarounds for the classpath issue, >>>> having >>>>>> need to copy over Hadoop configuration files from real cluster >> (because >>>>>> otherwise portion of the security didn’t work at all, something with >>>>>> auth_to_local rules) and a lot of more. That is why I’m advising being >>>>>> careful here. >>>>>> >>>>>> Jarcec >>>>>> >>>>>>> On Dec 11, 2014, at 12:17 AM, Zhou, Richard <[email protected]> >>>>>> wrote: >>>>>>> >>>>>>> Hi Jarcec: >>>>>>> Thank you very much for your clarification about the history. >>>>>>> >>>>>>> The root cause for why we want to change "provided" to "compile" is >> to >>>>>> implement "Delegation Token Support" [1], review board [2]. The status >>>> in >>>>>> Hadoop is showed below. >>>>>>> Hadoop 2.5.1 or before: all classes used to implement Kerberos >> support >>>>>> is in Hadoop-auth component, which depends only several libs with >>>>>> non-Hadoop related lib. And it is added in Sqoop client side (shell >>>>>> component [3]) as "compile" as we agreed before. >>>>>>> Hadoop 2.6.0: There is a refactor to support delegation token in >> Hadoop >>>>>> [4]. Most components in Hadoop, such as RM, Httpfs and Kms, have >>>> rewritten >>>>>> authentication mechanism to use delegation token. However, all >>>> delegation >>>>>> token related class is in Hadoop-common instead of Hadoop-auth, >> because >>>> it >>>>>> uses UserGroupInfomation class. >>>>>>> >>>>>>> So if Sqoop need to support delegation token, it has to include >>>>>> Hadoop-common lib, because I believe that copying code is an >>>> unacceptable >>>>>> solution. Even using Hadoop shims, which is a good solution to support >>>>>> different version of Hadoop (I am +1 on writing a Hadoop shims in >> Sqoop >>>>>> like pig, hive etc.), the Hadoop-common is also a dependency. For >>>> example, >>>>>> the client side (beeline) in hive depends on Hadoop-common lib [5]. >> So I >>>>>> don't think it is a big problem to add Hadoop-common in. >>>>>>> >>>>>>> Additionally, I agree with Abe that wire compatibility is another >>>> reason >>>>>> to change "provided" to "compile", since it is in "Unstable" state. >>>> There >>>>>> will be a potential problem in the future. >>>>>>> >>>>>>> So I prefer to add Hadoop-common lib as "compile" to make "Delegation >>>>>> Token Support" happen. >>>>>>> >>>>>>> Add [email protected]. >>>>>>> >>>>>>> Links: >>>>>>> 1: https://issues.apache.org/jira/browse/SQOOP-1776 >>>>>>> 2: https://reviews.apache.org/r/28795/ >>>>>>> 3: https://github.com/apache/sqoop/blob/sqoop2/shell/pom.xml#L75 >>>>>>> 4: https://issues.apache.org/jira/browse/HADOOP-10771 >>>>>>> 5: https://github.com/apache/hive/blob/trunk/beeline/pom.xml#L133 >>>>>>> >>>>>>> Richard >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Jarek Jarcec Cecho [mailto:[email protected]] On Behalf Of >> Jarek >>>>>> Jarcec Cecho >>>>>>> Sent: Thursday, December 11, 2014 1:43 PM >>>>>>> To: [email protected] >>>>>>> Subject: Re: Hadoop as Compile time dependency in Sqoop2 >>>>>>> >>>>>>> Hi Abe, >>>>>>> thank you very much for surfacing the question. I think that there >> is a >>>>>> several twists to it, so my apologies as this will be a long answer :) >>>>>>> >>>>>>> When we’ve started working on Sqoop 2 few years back, we’ve >>>>>> intentionally pushed the Hadoop dependency as far from shared >> libraries >>>> as >>>>>> possible. The intention was that no component in common or core should >>>> be >>>>>> depending nor use any Hadoop APIs and those should be isolated to >>>> separate >>>>>> modules (execution/submission engine). The reason for that is that >>>> Hadoop >>>>>> doesn’t have particularly good track of keeping backward compatibility >>>> and >>>>>> it has bitten a lot of projects in the past. For example every single >>>>>> project that I know of that is using MR needs to have a shim layer >> that >>>> is >>>>>> dealing with the API differences (Pig [1], Hive [2], …) . The only >>>>>> exception to this that I’m aware of is Sqoop 1, where we did not had >> to >>>>>> introduce shims is only because we (shamelessly) copied code from >>>> Hadoop to >>>>>> our own code base. Nevertheless we have places where we had to do that >>>>>> detection nevertheless [3]. I’m sure that Hadoop is getting better as >>>> the >>>>>> project matures, but I would still advise being careful of using >> various >>>>>> Hadoop APIs and limit that usage to the extend needed. There will be >>>>>> obviously situations where we want to use Hadoop API to make our life >>>>>> simpler, such as reusing their security implementation and that will >> be >>>>>> hopefully fine. >>>>>>> >>>>>>> Whereas we can be pretty sure that Sqoop Server will have Hadoop >>>>>> libraries on the class-path and the concern there was more about >>>>>> introducing backward incompatible changes that is hopefully less >>>> important >>>>>> nowadays, not introducing Hadoop dependency on client side had a >>>> different >>>>>> reason. Hadoop common is quite important jar that have huge number of >>>>>> dependencies - check out the list at it’s pom file [4]. This is a >>>> problem >>>>>> because the Sqoop client is meant to be small and easily reusable >> wheres >>>>>> depending on Hadoop will force the application developer to certain >>>> library >>>>>> versions that are dictated by Hadoop (like guava, commons-*). And that >>>>>> forces people to do various weird things such as using custom class >>>> loaders >>>>>> to isolate those libraries from main application and making the >>>> situation >>>>>> in most cases even worst, because Hadoop libraries assumes “ownership” >>>> of >>>>>> the underlaying JVM and run a lot of eternal threads per class-loader. >>>>>> Hence I would advise being double careful when introducing dependency >> on >>>>>> Hadoop (common) for our client. >>>>>>> >>>>>>> I’m wondering what we’re trying to achieve by moving the dependency >>>> from >>>>>> “provided” to “compile”? Do we want to just ensure that it’s always on >>>> the >>>>>> Server side or is the intent to get it to the client? >>>>>>> >>>>>>> Jarcec >>>>>>> >>>>>>> Links: >>>>>>> 1: https://github.com/apache/pig/tree/trunk/shims/src >>>>>>> 2: https://github.com/apache/hive/tree/trunk/shims >>>>>>> 3: >>>>>> >>>> >> https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/mapreduce/hcat/SqoopHCatUtilities.java#L962 >>>>>>> 4: >>>>>> >>>> >> http://search.maven.org/#artifactdetails%7Corg.apache.hadoop%7Chadoop-common%7C2.6.0%7Cjar >>>>>>> >>>>>>>> On Dec 10, 2014, at 7:56 AM, Abraham Elmahrek <[email protected]> >>>> wrote: >>>>>>>> >>>>>>>> Hey guys, >>>>>>>> >>>>>>>> With the work being done in Sqoop2 involving authentication, there >> are >>>>>>>> a few classes that are being used from hadoop auth and eventually >>>>>>>> hadoop common. >>>>>>>> >>>>>>>> I'd like to gauge how folks feel about including the hadoop >> libraries >>>>>>>> as a "compile" time dependency rather than "provided". The reasons >>>>>> being: >>>>>>>> >>>>>>>> 1. Hadoop maintains wire compatibility within a major version: >>>>>>>> >>>>>> >>>> >> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Compatibility.html#Wire_compatibility >>>>>>>> 2. UserGroupInformation and other useful interfaces are marked as >>>>>>>> "Evolving" or "Unstable": >>>>>>>> >>>>>> >>>> >> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/InterfaceClassification.html >>>>>>>> . >>>>>>>> >>>>>>>> I've been looking around and it seems most projects include Hadoop >> as >>>>>>>> a compile time dependency: >>>>>>>> >>>>>>>> 1. Kite - >>>>>>>> >>>>>> >>>> >> https://github.com/kite-sdk/kite/blob/master/kite-hadoop-dependencies/cdh5/pom.xml >>>>>>>> 2. Flume - https://github.com/apache/flume/blob/trunk/pom.xml >>>>>>>> 3. Oozie - https://github.com/apache/oozie/tree/master/hadooplibs >>>>>>>> 4. hive - https://github.com/apache/hive/blob/trunk/pom.xml#L1067 >>>>>>>> >>>>>>>> IMO wire compatibility is easier to maintain than Java API >>>>>> compatibility. >>>>>>>> There may be features in future Hadoop releases that we'll want to >> use >>>>>>>> on the security side as well. >>>>>>>> >>>>>>>> -Abe >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "intel-sqoop" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >> send >>>>>> an email to [email protected]. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> To view this discussion on the web visit >>>>>> >>>> >> https://groups.google.com/a/cloudera.org/d/msgid/intel-sqoop/7F91673573F5D241AFCE8EDD6A313D24572C34%40SHSMSX103.ccr.corp.intel.com >>>>>> .
