RE: Hadoop encryption module as Apache Chimera incubator project

Chen, Haifeng Thu, 04 Feb 2016 18:07:08 -0800

> [Chirs] Yes, but even if the artifact is widely consumed, as a TLP it would 
> need to sustain a community. If the scope is too narrow, then it will quickly 
> fall into maintenance mode, its contributors will move on, and it will retire 
> to the attic. Alone, I doubt its viability as a TLP. So as a first option, 
> donating only this code to Apache Commons would accomplish some immediate 
> goals in a sustainable forum.
Totally agree. As a TLP it needs nice scope and roadmap to sustain a 
development community.


Thanks,
Haifeng

-----Original Message-----
From: Chris Douglas [mailto:[email protected]] 
Sent: Friday, February 5, 2016 6:28 AM
To: [email protected]
Cc: [email protected]
Subject: Re: Hadoop encryption module as Apache Chimera incubator project

On Thu, Feb 4, 2016 at 12:06 PM, Gangumalla, Uma <[email protected]> 
wrote:

> [UMA] Ok. Great. You are right. I have cc¹ed to hadoop common. (You 
> mean to cc Apache commons as well?)

I meant, if you start a discussion with Apache Commons, please CC 
common-dev@hadoop to coordinate.

> [UMA] Right now we plan to have encryption libraries are the only 
> one¹s we planned and as we see lot of interest from other projects 
> like spark to use them. I see some challenges when we bring lot of 
> code(other common
> codes) into this project is that, they all would have different 
> requirements and may be different expected timelines for release etc. 
> Some projects may just wanted to use encryption interfaces alone but not all.
> As they are completely independent codes, may be better to scope out 
> clearly.

Yes, but even if the artifact is widely consumed, as a TLP it would need to 
sustain a community. If the scope is too narrow, then it will quickly fall into 
maintenance mode, its contributors will move on, and it will retire to the 
attic. Alone, I doubt its viability as a TLP. So as a first option, donating 
only this code to Apache Commons would accomplish some immediate goals in a 
sustainable forum.

APR has a similar scope. As a second option, that may also be a reasonable 
home, particularly if some of the native bits could integrate with APR.

If the scope is broader, the effort could sustain prolonged development. The 
current code is developing a strategy for packing native libraries on multiple 
platforms, a capability that, say, the native compression codecs (AFAIK) still 
lack. While java.nio is improving, many projects would benefit from a better, 
native interface to the filesystem (e.g., NativeIO). We could avoid duplicating 
effort and collaborate on a common library.

As a third option, Hadoop already implements some useful native libraries, 
which is why a subproject might be a sound course. That would enable the 
subproject to coordinate with Hadoop on migrating its native functionality to a 
separable, reusable component, then move to a TLP when we can rely on it 
exclusively (if it has a well-defined, independent community). It could control 
its release cadence and limit its dependencies.

Finally, this is beside the point if nobody is interested in doing the work on 
such a project. It's rude to pull code out of Hadoop and donate it to another 
project so Spark can avoid a dependency, but this instance seems reasonable to 
me. -C

[1] https://apr.apache.org/

> On 2/3/16, 6:46 PM, "Chen, Haifeng" <[email protected]> wrote:
>
>>Thanks Chris.
>>
>>>> I went through the repository, and now understand the reasoning 
>>>>that would locate this code in Apache Commons. This isn't proposing 
>>>>to extract much of the implementation and it takes none of the 
>>>>integration. It's limited to interfaces to crypto libraries and 
>>>>streams/configuration.
>>Exactly.
>>
>>>> Chimera would be a boutique TLP, unless we wanted to draw out more 
>>>>of the integration and tooling. Is that a goal you're interested in 
>>>>pursuing? There's a tension between keeping this focused and 
>>>>including enough functionality to make it viable as an independent 
>>>>component.
>>The Chimera goal was for providing useful, common and optimized 
>>cryptographic functionalities. I would prefer that it is still focused 
>>in this clear scope. Multiple domain requirements will put more 
>>challenges and uncertainties in where and how it should go, thus more 
>>risk in stalling.
>>
>>>> If the encryption libraries are the only ones you're interested in 
>>>>pulling out, then Apache Commons does seem like a better target than 
>>>>a separate project.
>>Yes. Just mentioned above, the library will be positioned in 
>>cryptographic.
>>
>>
>>Thanks,
>>
>>-----Original Message-----
>>From: Chris Douglas [mailto:[email protected]]
>>Sent: Thursday, February 4, 2016 7:26 AM
>>To: [email protected]
>>Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>>project
>>
>>I went through the repository, and now understand the reasoning that 
>>would locate this code in Apache Commons. This isn't proposing to 
>>extract much of the implementation and it takes none of the 
>>integration. It's limited to interfaces to crypto libraries and 
>>streams/configuration. It might be a reasonable fit for commons-codec, 
>>but that's a pretty sparse library and driving the release cadence 
>>might be more complicated. It'd be worth discussing on their lists (please 
>>also CC common-dev@).
>>
>>Chimera would be a boutique TLP, unless we wanted to draw out more of 
>>the integration and tooling. Is that a goal you're interested in pursuing?
>>There's a tension between keeping this focused and including enough 
>>functionality to make it viable as an independent component. By way of 
>>example, Hadoop's common project requires too many dependencies and 
>>carries too much historical baggage for other projects to rely on.
>>I agree with Colin/Steve: we don't want this to grow into another 
>>guava-like dependency that creates more work in conflicts than it 
>>saves in implementation...
>>
>>Would it make sense to also package some of the compression libraries, 
>>and maybe some of the text processing from MapReduce? Evolving some of 
>>this code to a common library with few/no dependencies would be 
>>generally useful. As a subproject, it could have a broader scope that 
>>could evolve into a viable TLP. If the encryption libraries are the 
>>only ones you're interested in pulling out, then Apache Commons does 
>>seem like a better target than a separate project. -C
>>
>>
>>On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <[email protected]> wrote:
>>> On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma 
>>> <[email protected]> wrote:
>>>>>Standing in the point of shared fundamental piece of code like 
>>>>>this, I do think Apache Commons might be the best direction which 
>>>>>we can try as the first effort. In this direction, we still need to 
>>>>>work with Apache Common community for buying in and accepting the proposal.
>>>> Make sense.
>>>
>>> Makes sense how?
>>>
>>>> For this we should define the independent release cycles for this 
>>>> project and it would just place under Hadoop tree if we all 
>>>> conclude with this option at the end.
>>>
>>> Yes.
>>>
>>>> [Chris]
>>>>>If Chimera is not successful as an independent project or stalls, 
>>>>>Hadoop and/or Spark and/or $project will have to reabsorb it as 
>>>>>maintainers.
>>>>>
>>>> I am not so strong on this point. If we assume project would be  
>>>>unsuccessful, it can be unsuccessful(less maintained) even under 
>>>>hadoop.
>>>> But if other projects depending on this piece then they would get  
>>>>less support. Of course right now we feel this piece of code is very  
>>>>important and we feel(expect) it can be successful as independent  
>>>>project, irrespective of whether it as separate project outside 
>>>>hadoop or inside.
>>>> So, I feel this point would not really influence to judge the 
>>>>discussion.
>>>
>>> Sure; code can idle anywhere, but that wasn't the point I was after.
>>> You propose to extract code from Hadoop, but if Chimera fails then 
>>> what recourse do we have among the other projects taking a 
>>> dependency on it? Splitting off another project is feasible, but 
>>> Chimera should be sustainable before this PMC can divest itself of 
>>> responsibility for security libraries. That's a pretty low bar.
>>>
>>> Bundling the library with the jar is helpful; I've used that before.
>>> It should prefer (updated) libraries from the environment, if 
>>> configured. Otherwise it's a pain (or impossible) for ops to patch 
>>> security bugs. -C
>>>
>>>>>-----Original Message-----
>>>>>From: Colin P. McCabe [mailto:[email protected]]
>>>>>Sent: Wednesday, February 3, 2016 4:56 AM
>>>>>To: [email protected]
>>>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>>>>>project
>>>>>
>>>>>It's great to see interest in improving this functionality.  I 
>>>>>think Chimera could be successful as an Apache project.  I don't 
>>>>>have a strong opinion one way or the other as to whether it belongs 
>>>>>as part of Hadoop or separate.
>>>>>
>>>>>I do think there will be some challenges splitting this 
>>>>>functionality out into a separate jar, because of the way our 
>>>>>CLASSPATH works right now.
>>>>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark 
>>>>>depends on Chimera 1.1.  Now Spark jobs have two different versions 
>>>>>fighting it out on the classpath, similar to the situation with 
>>>>>Guava and other libraries.  Perhaps if Chimera adopts a policy of 
>>>>>strong backwards compatibility, we can just always use the latest 
>>>>>jar, but it still seems likely that there will be problems.  There 
>>>>>are various classpath isolation ideas that could help here, but 
>>>>>they are big projects in their own right and we don't have a clear 
>>>>>timeline for them.  If this does end up being a separate jar, we 
>>>>>may need to shade it to avoid all these issues.
>>>>>
>>>>>Bundling the JNI glue code in the jar itself is an interesting 
>>>>>idea, which we have talked about before for libhadoop.so.  It 
>>>>>doesn't really have anything to do with the question of TLP vs. 
>>>>>non-TLP, of course.
>>>>>We could do that refactoring in Hadoop itself.  The really 
>>>>>complicated part of bundling JNI code in a jar is that you need to 
>>>>>create jars for every cross product of (JVM version, openssl 
>>>>>version, operating system).
>>>>>For example, you have the RHEL6 build for openJDK7 using openssl 
>>>>>1.0.1e.
>>>>>If you change any one thing-- say, change openJDK7 to Oracle JDK8, 
>>>>>then you might need to rebuild.  And certainly using Ubuntu would 
>>>>>be a rebuild.  And so forth.  This kind of clashes with Maven's 
>>>>>philosophy of pulling prebuilt jars from the internet.
>>>>>
>>>>>Kai Zheng's question about whether we would bundle openSSL's 
>>>>>libraries is a good one.  Given the high rate of new 
>>>>>vulnerabilities discovered in that library, it seems like bundling 
>>>>>would require Hadoop users and vendors to update very frequently, 
>>>>>much more frequently than Hadoop is traditionally updated.  So 
>>>>>probably we would not choose to bundle openssl.
>>>>>
>>>>>best,
>>>>>Colin
>>>>>
>>>>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas 
>>>>><[email protected]>
>>>>>wrote:
>>>>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>>>>>> There's also no reason why it should maintain dependencies on 
>>>>>> other parts of Hadoop, if those are separable. How is this 
>>>>>> solution inadequate?
>>>>>>
>>>>>> If Chimera is not successful as an independent project or stalls, 
>>>>>> Hadoop and/or Spark and/or $project will have to reabsorb it as 
>>>>>> maintainers. Projects have high mortality in early life, and a 
>>>>>> fight over inheritance/maintenance is something we'd like to avoid.
>>>>>> If, on the other hand, it develops enough of a community where it 
>>>>>> is obviously viable, then we can (and should) break it out as a 
>>>>>> TLP (as we have before). If other Apache projects take a 
>>>>>> dependency on Chimera, we're open to adding them to security@hadoop.
>>>>>>
>>>>>> Unlike Yetus, which was largely rewritten right before it was 
>>>>>> made into a TLP, security in Hadoop has a complicated pedigree. 
>>>>>> If Chimera eventually becomes a TLP, it seems fair to include 
>>>>>> those who work on it while it is a subproject. Declared upfront, 
>>>>>> that criterion is fairer than any post hoc justification, and 
>>>>>> will lead to a more accurate account of its community than a 
>>>>>> subset of the Hadoop PMC/committers that volunteer. -C
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng 
>>>>>><[email protected]>
>>>>>>wrote:
>>>>>>> Thanks to all folks providing feedbacks and participating the 
>>>>>>>discussions.
>>>>>>>
>>>>>>> @Owen, do you still have any concerns on going forward in the 
>>>>>>>direction of Apache Commons (or other options, TLP)?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Haifeng
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Chen, Haifeng [mailto:[email protected]]
>>>>>>> Sent: Saturday, January 30, 2016 10:52 AM
>>>>>>> To: [email protected]
>>>>>>> Subject: RE: Hadoop encryption module as Apache Chimera 
>>>>>>> incubator project
>>>>>>>
>>>>>>>>> I believe encryption is becoming a core part of Hadoop. I 
>>>>>>>>>think that moving core components out of Hadoop is bad from a 
>>>>>>>>>project management perspective.
>>>>>>>
>>>>>>>> Although it's certainly true that encryption capabilities (in 
>>>>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think 
>>>>>>>>that should really influence whether or not the 
>>>>>>>>non-Hadoop-specific encryption routines should be part of the 
>>>>>>>>Hadoop code base, or part of the code base of another project that 
>>>>>>>>Hadoop depends on.
>>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS 
>>>>>>>>encryption was first developed, HDFS probably would have just 
>>>>>>>>added that as a dependency and been done with it. I don't think 
>>>>>>>>we would've copy/pasted the code for Chimera into the Hadoop code base.
>>>>>>>
>>>>>>> Agree with ATM. I want to also make an additional clarification. 
>>>>>>>I agree that the encryption capabilities are becoming core to Hadoop.
>>>>>>>While this effort is to put common and shared encryption routines 
>>>>>>>such as crypto stream implementations into a scope which can be 
>>>>>>>widely shared across the Apache ecosystem. This doesn't move 
>>>>>>>Hadoop encryption out of Hadoop (that is not possible).
>>>>>>>
>>>>>>> Agree if we make it a separate and independent releases project 
>>>>>>>in Hadoop takes a step further than the existing approach and 
>>>>>>>solve some issues (such as libhadoop.so problem). Frankly 
>>>>>>>speaking, I think it is not the best option we can try. I also 
>>>>>>>expect that an independent release project within Hadoop core 
>>>>>>>will also complicate the existing release ideology of Hadoop release.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Haifeng
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Aaron T. Myers [mailto:[email protected]]
>>>>>>> Sent: Friday, January 29, 2016 9:51 AM
>>>>>>> To: [email protected]
>>>>>>> Subject: Re: Hadoop encryption module as Apache Chimera 
>>>>>>> incubator project
>>>>>>>
>>>>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley 
>>>>>>><[email protected]>
>>>>>>>wrote:
>>>>>>>
>>>>>>>> I believe encryption is becoming a core part of Hadoop. I think 
>>>>>>>>that  moving core components out of Hadoop is bad from a project 
>>>>>>>>management perspective.
>>>>>>>>
>>>>>>>
>>>>>>> Although it's certainly true that encryption capabilities (in 
>>>>>>>HDFS,  YARN,
>>>>>>> etc.) are becoming core to Hadoop, I don't think that should 
>>>>>>>really influence whether or not the non-Hadoop-specific 
>>>>>>>encryption routines should be part of the Hadoop code base, or 
>>>>>>>part of the code base of another project that Hadoop depends on. 
>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS 
>>>>>>>encryption was first developed, HDFS probably would have just 
>>>>>>>added that as a dependency and been done with it. I don't think 
>>>>>>>we would've copy/pasted the code for Chimera into the Hadoop code base.
>>>>>>>
>>>>>>>
>>>>>>>> To put it another way, a bug in the encryption routines will  
>>>>>>>>likely become a security problem that security@hadoop needs to 
>>>>>>>>hear about.
>>>>>>>>
>>>>>>> I don't think
>>>>>>>> adding a separate project in the middle of that communication 
>>>>>>>>chain  is a good idea. The same applies to data corruption 
>>>>>>>>problems, and so on...
>>>>>>>>
>>>>>>>
>>>>>>> Isn't the same true of all the libraries that Hadoop currently 
>>>>>>>depends upon? If the commons-httpclient library (or 
>>>>>>>commons-codec, or commons-io, or guava, or...) has a security 
>>>>>>>vulnerability, we need to know about it so that we can update our 
>>>>>>>dependency to a fixed version.
>>>>>>>This case doesn't seem materially different than that.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > It may be good to keep at generalized place(As in the 
>>>>>>>> > discussion, we thought that place could be Apache Commons).
>>>>>>>>
>>>>>>>>
>>>>>>>> Apache Commons is a collection of *Java* projects, so Chimera 
>>>>>>>> as a JNI-based library isn't a natural fit.
>>>>>>>>
>>>>>>>
>>>>>>> Could very well be that Apache Commons's charter would preclude 
>>>>>>>Chimera.
>>>>>>> You probably know better than I do about that.
>>>>>>>
>>>>>>>
>>>>>>>> Furthermore, Apache Commons doesn't have its own security list 
>>>>>>>> so problems will go to the generic [email protected].
>>>>>>>>
>>>>>>>
>>>>>>> That seems easy enough to remedy, if they wanted to, and besides 
>>>>>>>I'm not sure why that would influence this discussion. In my 
>>>>>>>experience projects that don't have a separate 
>>>>>>>[email protected] mailing list tend to just handle security 
>>>>>>>issues on their [email protected] mailing list, which seems fine to me.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>>>>>>
>>>>>>>
>>>>>>> I'm certainly not at all wedded to Apache Commons, that just 
>>>>>>>seemed like a natural place to put it to me. Could be that a 
>>>>>>>brand new TLP might make more sense.
>>>>>>>
>>>>>>> I *do* think that if other non-Hadoop projects want to make use 
>>>>>>>of Chimera, which as I understand it is the goal which started 
>>>>>>>this thread, then Chimera should exist outside of Hadoop so that:
>>>>>>>
>>>>>>> a) Projects that have nothing to do with Hadoop can just depend 
>>>>>>>directly on Chimera, which has nothing Hadoop-specific in there.
>>>>>>>
>>>>>>> b) The Hadoop project doesn't have to export/maintain/concern 
>>>>>>>itself with yet another publicly-consumed interface.
>>>>>>>
>>>>>>> c) Chimera can have its own (presumably much faster) release 
>>>>>>>cadence completely separate from Hadoop.
>>>>>>>
>>>>>>> --
>>>>>>> Aaron T. Myers
>>>>>>> Software Engineer, Cloudera
>>>>
>

RE: Hadoop encryption module as Apache Chimera incubator project

Reply via email to