My apologies if my previous email has been vague, indeed I do agree with 
current approach of including Hadoop 2.6.0 jars in our distribution.

I just want us to be careful, that’s all :)

Jarcec

> On Dec 12, 2014, at 5:55 PM, Abraham Elmahrek <[email protected]> wrote:
> 
> Hey Richard,
> 
> I think Jarcec is agreeing here and that it's worth trying out. Let's move
> forward with the current design?
> 
> -Abe
> 
> On Thu, Dec 11, 2014 at 9:13 PM, Zhou, Richard <[email protected]>
> wrote:
>> 
>> Hi Jarcec & Abe:
>> 
>> Thank you for your nice clarification. And I have got several opinions
>> about it.
>> 
>> 
>> 
>> 1.       As the Hadoop dependency is “provided” in the Sqoop server, and
>> the real classpath is set in catalina.properties, how to avoid
>> compatibility mistakes? Let’s say, Sqoop server is built successfully with
>> Hadoop 2.5.1, showed in root pom.xml, whilst the real cluster is Hadoop
>> 2.5.0. Having said that there are only minor changes between these two
>> minor release, there should be some unexpected exception still.
>> 
>> 2.       If “compile” is used in both client and server side. The wire
>> compatibility is confirmed in authentication communication between client
>> and server. But another compatibility surfaces, that the Sqoop server
>> depends different version of Hadoop-common. 2.6.0 from “compile”, and 2.5.0
>> (real cluster version) from classpath in catalina.properties.
>> 
>> 3.       As Abe said, if we use partially “compile” (in client side) and
>> partially “provided”( in server side), I agree that there must be some wire
>> compatibility issues.
>> 
>> 4.       The best solution to resolve all compatibility issues is that
>> use “provided” in client and server side with Hadoop-common lib from
>> classpath in real cluster, which must be 2.6.0 or later. However, it is
>> impossible to make all users use Hadoop 2.6.0 or later.
>> 
>> 
>> 
>> So, I am considering that is it a little rush to support delegation token
>> currently? Since it is the latest feature in Hadoop 2.6.0, whilst Sqoop
>> only support 2.5.1. Having said sooner or later Hadoop 2.6.0 will be
>> supported in the near future, Sqoop must support Hadoop 2.5.1 or before for
>> a long time as well. Maybe we should re-open delegation token support at
>> Hadoop 3.* period, as delegation token should be supported that time. And
>> as for Kerberos support task (SQOOP-1525), it could be finished with doAs
>> function completed. Actually this code is ready, and the reason I have not
>> uploaded for review is that I think delegation token is a better solution
>> to handle this. There is no need to commit doAs code and then rewrite with
>> delegation token. As for delegation token support, it could be put into
>> improvement of Kerberos support.
>> 
>> 
>> 
>> Richard
>> 
>> 
>> 
>> *From:* Abraham Elmahrek [mailto:[email protected]]
>> *Sent:* Friday, December 12, 2014 8:14 AM
>> *To:* [email protected]
>> *Cc:* Zhou, Richard
>> 
>> *Subject:* Re: Hadoop as Compile time dependency in Sqoop2
>> 
>> 
>> 
>> I'll have to do a bit of experimentation to better understand packaging
>> and dependencies. If we make hadoop-common a compile time requirement
>> conditionally in sqoop-core, this should affect the classpath of the other
>> components in the server? In the DependencyManagement section of the root
>> pom, it would still be marked as provided?
>> 
>> 
>> 
>> -Abe
>> 
>> 
>> 
>> On Thu, Dec 11, 2014 at 5:50 PM, Jarek Jarcec Cecho <[email protected]>
>> wrote:
>> 
>> Got it, so the proposal is really to ship Hadoop libraries as part of our
>> distribution (tarball) and not let users to configure Sqoop using existing
>> ones. I personally don’t feel entirely comfortable doing so as I’m afraid
>> that a lot of troubles will pop up on the way (given my experience), but
>> I’m open to give it a try. Just to be on the same page, we want to package
>> the Hadoop-common with server only right? So I’m assuming that the
>> “compile” dependency will be on sqoop-core rather then sqoop-common (that
>> is shared between client and server).
>> 
>> Jarcec
>> 
>> 
>>> On Dec 11, 2014, at 3:34 PM, Abraham Elmahrek <[email protected]> wrote:
>>> 
>>> Jarcec,
>>> 
>>> I believe that providing delegation support requires using a class on the
>>> server side that is only available in hadoop-common as of Hadoop 2.6.0
>> [1].
>>> This seems like reason enough to change from "provided" to "compile"
>> given
>>> the feature may not exist in previous versions of Hadoop2.
>>> 
>>> Also, requiring that Sqoop2 must be used with Hadoop 2.6.0 or newer
>> doesn't
>>> seem like a great idea. It delegates hadoop version management to the
>> users
>>> of Sqoop2, where it might be better to be handled by devs?
>>> 
>>> 1. https://issues.apache.org/jira/browse/HADOOP-11083
>>> 
>>> On Thu, Dec 11, 2014 at 4:50 PM, Jarek Jarcec Cecho <[email protected]>
>>> wrote:
>>>> 
>>>> Nope not at all Abe, I also feel that client and server changes should
>> be
>>>> discussed separately as there are different reasons/concerns of why or
>> why
>>>> not introduce Hadoop dependencies there.
>>>> 
>>>> For the server side and for the security portion, I feel that we had
>> good
>>>> discussion with Richard while back and I do not longer have concerns
>> about
>>>> using those APIs. I’ll advise caution nevertheless. What we are trying
>> to
>>>> achieve by changing the scope from “provided” to “compile” here? To my
>> best
>>>> knowledge [1] the difference is only that “provided” means that the
>>>> dependency is not retrieved and stored in resulting package and that
>> users
>>>> have to add it manually after installation. I’m not immediately seeing
>> any
>>>> impact on the code though.
>>>> 
>>>> Jarcec
>>>> 
>>>> Links:
>>>> 1:
>>>> 
>> http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
>>>> 
>>>>> On Dec 11, 2014, at 8:41 AM, Abraham Elmahrek <[email protected]>
>> wrote:
>>>>> 
>>>>> Jarcec,
>>>>> 
>>>>> Sorry to bud in... you make a good point on the client side. Would you
>>>> mind
>>>>> if we discussed the server side a bit? Re-using the same mechanism on
>> the
>>>>> server side does require "compile" scope dependencies on Hadoop. Would
>>>> that
>>>>> be ok? Are the concerns mainly around the client?
>>>>> 
>>>>> -Abe
>>>>> 
>>>>> On Thu, Dec 11, 2014 at 10:30 AM, Jarek Jarcec Cecho <
>> [email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Got it Richard, thank you very much for the nice summary! I’m
>> wondering
>>>>>> what is the use case for delegation tokens on client side? Is it to
>>>> support
>>>>>> integration with Oozie?
>>>>>> 
>>>>>> I do know that Beeline is depending on Hadoop common and that is
>>>> actually
>>>>>> a very good example. I’ve seen sufficient number of users struggling
>>>> with
>>>>>> this dependency - using various workarounds for the classpath issue,
>>>> having
>>>>>> need to copy over Hadoop configuration files from real cluster
>> (because
>>>>>> otherwise portion of the security didn’t work at all, something with
>>>>>> auth_to_local rules) and a lot of more. That is why I’m advising being
>>>>>> careful here.
>>>>>> 
>>>>>> Jarcec
>>>>>> 
>>>>>>> On Dec 11, 2014, at 12:17 AM, Zhou, Richard <[email protected]>
>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi Jarcec:
>>>>>>> Thank you very much for your clarification about the history.
>>>>>>> 
>>>>>>> The root cause for why we want to change "provided" to "compile" is
>> to
>>>>>> implement "Delegation Token Support" [1], review board [2]. The status
>>>> in
>>>>>> Hadoop is showed below.
>>>>>>> Hadoop 2.5.1 or before: all classes used to implement Kerberos
>> support
>>>>>> is in Hadoop-auth component, which depends only several libs with
>>>>>> non-Hadoop related lib. And it is added in Sqoop client side (shell
>>>>>> component [3]) as "compile" as we agreed before.
>>>>>>> Hadoop 2.6.0: There is a refactor to support delegation token in
>> Hadoop
>>>>>> [4]. Most components in Hadoop, such as RM, Httpfs and Kms, have
>>>> rewritten
>>>>>> authentication mechanism to use delegation token. However, all
>>>> delegation
>>>>>> token related class is in Hadoop-common instead of Hadoop-auth,
>> because
>>>> it
>>>>>> uses UserGroupInfomation class.
>>>>>>> 
>>>>>>> So if Sqoop need to support delegation token, it has to include
>>>>>> Hadoop-common lib, because I believe that copying code is an
>>>> unacceptable
>>>>>> solution. Even using Hadoop shims, which is a good solution to support
>>>>>> different version of Hadoop (I am +1 on writing a Hadoop shims in
>> Sqoop
>>>>>> like pig, hive etc.), the Hadoop-common is also a dependency. For
>>>> example,
>>>>>> the client side (beeline) in hive depends on Hadoop-common lib [5].
>> So I
>>>>>> don't think it is a big problem to add Hadoop-common in.
>>>>>>> 
>>>>>>> Additionally, I agree with Abe that wire compatibility is another
>>>> reason
>>>>>> to change "provided" to "compile", since it is in "Unstable" state.
>>>> There
>>>>>> will be a potential problem in the future.
>>>>>>> 
>>>>>>> So I prefer to add Hadoop-common lib as "compile" to make "Delegation
>>>>>> Token Support" happen.
>>>>>>> 
>>>>>>> Add [email protected].
>>>>>>> 
>>>>>>> Links:
>>>>>>> 1: https://issues.apache.org/jira/browse/SQOOP-1776
>>>>>>> 2: https://reviews.apache.org/r/28795/
>>>>>>> 3: https://github.com/apache/sqoop/blob/sqoop2/shell/pom.xml#L75
>>>>>>> 4: https://issues.apache.org/jira/browse/HADOOP-10771
>>>>>>> 5: https://github.com/apache/hive/blob/trunk/beeline/pom.xml#L133
>>>>>>> 
>>>>>>> Richard
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Jarek Jarcec Cecho [mailto:[email protected]] On Behalf Of
>> Jarek
>>>>>> Jarcec Cecho
>>>>>>> Sent: Thursday, December 11, 2014 1:43 PM
>>>>>>> To: [email protected]
>>>>>>> Subject: Re: Hadoop as Compile time dependency in Sqoop2
>>>>>>> 
>>>>>>> Hi Abe,
>>>>>>> thank you very much for surfacing the question. I think that there
>> is a
>>>>>> several twists to it, so my apologies as this will be a long answer :)
>>>>>>> 
>>>>>>> When we’ve started working on Sqoop 2 few years back, we’ve
>>>>>> intentionally pushed the Hadoop dependency as far from shared
>> libraries
>>>> as
>>>>>> possible. The intention was that no component in common or core should
>>>> be
>>>>>> depending nor use any Hadoop APIs and those should be isolated to
>>>> separate
>>>>>> modules (execution/submission engine). The reason for that is that
>>>> Hadoop
>>>>>> doesn’t have particularly good track of keeping backward compatibility
>>>> and
>>>>>> it has bitten a lot of projects in the past. For example every single
>>>>>> project that I know of that is using MR needs to have a shim layer
>> that
>>>> is
>>>>>> dealing with the API differences (Pig [1], Hive [2], …) . The only
>>>>>> exception to this that I’m aware of is Sqoop 1, where we did not had
>> to
>>>>>> introduce shims is only because we (shamelessly) copied code from
>>>> Hadoop to
>>>>>> our own code base. Nevertheless we have places where we had to do that
>>>>>> detection nevertheless [3]. I’m sure that Hadoop is getting better as
>>>> the
>>>>>> project matures, but I would still advise being careful of using
>> various
>>>>>> Hadoop APIs and limit that usage to the extend needed. There will be
>>>>>> obviously situations where we want to use Hadoop API to make our life
>>>>>> simpler, such as reusing their security implementation and that will
>> be
>>>>>> hopefully fine.
>>>>>>> 
>>>>>>> Whereas we can be pretty sure that Sqoop Server will have Hadoop
>>>>>> libraries on the class-path and the concern there was more about
>>>>>> introducing backward incompatible changes that is hopefully less
>>>> important
>>>>>> nowadays, not introducing Hadoop dependency on client side had a
>>>> different
>>>>>> reason. Hadoop common is quite important jar that have huge number of
>>>>>> dependencies - check out the list at it’s pom file [4]. This is a
>>>> problem
>>>>>> because the Sqoop client is meant to be small and easily reusable
>> wheres
>>>>>> depending on Hadoop will force the application developer to certain
>>>> library
>>>>>> versions that are dictated by Hadoop (like guava, commons-*). And that
>>>>>> forces people to do various weird things such as using custom class
>>>> loaders
>>>>>> to isolate those libraries from main application and making the
>>>> situation
>>>>>> in most cases even worst, because Hadoop libraries assumes “ownership”
>>>> of
>>>>>> the underlaying JVM and run a lot of eternal threads per class-loader.
>>>>>> Hence I would advise being double careful when introducing dependency
>> on
>>>>>> Hadoop (common) for our client.
>>>>>>> 
>>>>>>> I’m wondering what we’re trying to achieve by moving the dependency
>>>> from
>>>>>> “provided” to “compile”? Do we want to just ensure that it’s always on
>>>> the
>>>>>> Server side or is the intent to get it to the client?
>>>>>>> 
>>>>>>> Jarcec
>>>>>>> 
>>>>>>> Links:
>>>>>>> 1: https://github.com/apache/pig/tree/trunk/shims/src
>>>>>>> 2: https://github.com/apache/hive/tree/trunk/shims
>>>>>>> 3:
>>>>>> 
>>>> 
>> https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/mapreduce/hcat/SqoopHCatUtilities.java#L962
>>>>>>> 4:
>>>>>> 
>>>> 
>> http://search.maven.org/#artifactdetails%7Corg.apache.hadoop%7Chadoop-common%7C2.6.0%7Cjar
>>>>>>> 
>>>>>>>> On Dec 10, 2014, at 7:56 AM, Abraham Elmahrek <[email protected]>
>>>> wrote:
>>>>>>>> 
>>>>>>>> Hey guys,
>>>>>>>> 
>>>>>>>> With the work being done in Sqoop2 involving authentication, there
>> are
>>>>>>>> a few classes that are being used from hadoop auth and eventually
>>>>>>>> hadoop common.
>>>>>>>> 
>>>>>>>> I'd like to gauge how folks feel about including the hadoop
>> libraries
>>>>>>>> as a "compile" time dependency rather than "provided". The reasons
>>>>>> being:
>>>>>>>> 
>>>>>>>> 1. Hadoop maintains wire compatibility within a major version:
>>>>>>>> 
>>>>>> 
>>>> 
>> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Compatibility.html#Wire_compatibility
>>>>>>>> 2. UserGroupInformation and other useful interfaces are marked as
>>>>>>>> "Evolving" or "Unstable":
>>>>>>>> 
>>>>>> 
>>>> 
>> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/InterfaceClassification.html
>>>>>>>> .
>>>>>>>> 
>>>>>>>> I've been looking around and it seems most projects include Hadoop
>> as
>>>>>>>> a compile time dependency:
>>>>>>>> 
>>>>>>>> 1. Kite -
>>>>>>>> 
>>>>>> 
>>>> 
>> https://github.com/kite-sdk/kite/blob/master/kite-hadoop-dependencies/cdh5/pom.xml
>>>>>>>> 2. Flume - https://github.com/apache/flume/blob/trunk/pom.xml
>>>>>>>> 3. Oozie - https://github.com/apache/oozie/tree/master/hadooplibs
>>>>>>>> 4. hive - https://github.com/apache/hive/blob/trunk/pom.xml#L1067
>>>>>>>> 
>>>>>>>> IMO wire compatibility is easier to maintain than Java API
>>>>>> compatibility.
>>>>>>>> There may be features in future Hadoop releases that we'll want to
>> use
>>>>>>>> on the security side as well.
>>>>>>>> 
>>>>>>>> -Abe
>>>>>>> 
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "intel-sqoop" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>> send
>>>>>> an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> To view this discussion on the web visit
>>>>>> 
>>>> 
>> https://groups.google.com/a/cloudera.org/d/msgid/intel-sqoop/7F91673573F5D241AFCE8EDD6A313D24572C34%40SHSMSX103.ccr.corp.intel.com
>>>>>> .

Reply via email to