> [Chirs] Yes, but even if the artifact is widely consumed, as a TLP it would > need to sustain a community. If the scope is too narrow, then it will quickly > fall into maintenance mode, its contributors will move on, and it will retire > to the attic. Alone, I doubt its viability as a TLP. So as a first option, > donating only this code to Apache Commons would accomplish some immediate > goals in a sustainable forum. Totally agree. As a TLP it needs nice scope and roadmap to sustain a development community.
Thanks, Haifeng -----Original Message----- From: Chris Douglas [mailto:cdoug...@apache.org] Sent: Friday, February 5, 2016 6:28 AM To: common-...@hadoop.apache.org Cc: hdfs-dev@hadoop.apache.org Subject: Re: Hadoop encryption module as Apache Chimera incubator project On Thu, Feb 4, 2016 at 12:06 PM, Gangumalla, Uma <uma.ganguma...@intel.com> wrote: > [UMA] Ok. Great. You are right. I have cc¹ed to hadoop common. (You > mean to cc Apache commons as well?) I meant, if you start a discussion with Apache Commons, please CC common-dev@hadoop to coordinate. > [UMA] Right now we plan to have encryption libraries are the only > one¹s we planned and as we see lot of interest from other projects > like spark to use them. I see some challenges when we bring lot of > code(other common > codes) into this project is that, they all would have different > requirements and may be different expected timelines for release etc. > Some projects may just wanted to use encryption interfaces alone but not all. > As they are completely independent codes, may be better to scope out > clearly. Yes, but even if the artifact is widely consumed, as a TLP it would need to sustain a community. If the scope is too narrow, then it will quickly fall into maintenance mode, its contributors will move on, and it will retire to the attic. Alone, I doubt its viability as a TLP. So as a first option, donating only this code to Apache Commons would accomplish some immediate goals in a sustainable forum. APR has a similar scope. As a second option, that may also be a reasonable home, particularly if some of the native bits could integrate with APR. If the scope is broader, the effort could sustain prolonged development. The current code is developing a strategy for packing native libraries on multiple platforms, a capability that, say, the native compression codecs (AFAIK) still lack. While java.nio is improving, many projects would benefit from a better, native interface to the filesystem (e.g., NativeIO). We could avoid duplicating effort and collaborate on a common library. As a third option, Hadoop already implements some useful native libraries, which is why a subproject might be a sound course. That would enable the subproject to coordinate with Hadoop on migrating its native functionality to a separable, reusable component, then move to a TLP when we can rely on it exclusively (if it has a well-defined, independent community). It could control its release cadence and limit its dependencies. Finally, this is beside the point if nobody is interested in doing the work on such a project. It's rude to pull code out of Hadoop and donate it to another project so Spark can avoid a dependency, but this instance seems reasonable to me. -C [1] https://apr.apache.org/ > On 2/3/16, 6:46 PM, "Chen, Haifeng" <haifeng.c...@intel.com> wrote: > >>Thanks Chris. >> >>>> I went through the repository, and now understand the reasoning >>>>that would locate this code in Apache Commons. This isn't proposing >>>>to extract much of the implementation and it takes none of the >>>>integration. It's limited to interfaces to crypto libraries and >>>>streams/configuration. >>Exactly. >> >>>> Chimera would be a boutique TLP, unless we wanted to draw out more >>>>of the integration and tooling. Is that a goal you're interested in >>>>pursuing? There's a tension between keeping this focused and >>>>including enough functionality to make it viable as an independent >>>>component. >>The Chimera goal was for providing useful, common and optimized >>cryptographic functionalities. I would prefer that it is still focused >>in this clear scope. Multiple domain requirements will put more >>challenges and uncertainties in where and how it should go, thus more >>risk in stalling. >> >>>> If the encryption libraries are the only ones you're interested in >>>>pulling out, then Apache Commons does seem like a better target than >>>>a separate project. >>Yes. Just mentioned above, the library will be positioned in >>cryptographic. >> >> >>Thanks, >> >>-----Original Message----- >>From: Chris Douglas [mailto:cdoug...@apache.org] >>Sent: Thursday, February 4, 2016 7:26 AM >>To: hdfs-dev@hadoop.apache.org >>Subject: Re: Hadoop encryption module as Apache Chimera incubator >>project >> >>I went through the repository, and now understand the reasoning that >>would locate this code in Apache Commons. This isn't proposing to >>extract much of the implementation and it takes none of the >>integration. It's limited to interfaces to crypto libraries and >>streams/configuration. It might be a reasonable fit for commons-codec, >>but that's a pretty sparse library and driving the release cadence >>might be more complicated. It'd be worth discussing on their lists (please >>also CC common-dev@). >> >>Chimera would be a boutique TLP, unless we wanted to draw out more of >>the integration and tooling. Is that a goal you're interested in pursuing? >>There's a tension between keeping this focused and including enough >>functionality to make it viable as an independent component. By way of >>example, Hadoop's common project requires too many dependencies and >>carries too much historical baggage for other projects to rely on. >>I agree with Colin/Steve: we don't want this to grow into another >>guava-like dependency that creates more work in conflicts than it >>saves in implementation... >> >>Would it make sense to also package some of the compression libraries, >>and maybe some of the text processing from MapReduce? Evolving some of >>this code to a common library with few/no dependencies would be >>generally useful. As a subproject, it could have a broader scope that >>could evolve into a viable TLP. If the encryption libraries are the >>only ones you're interested in pulling out, then Apache Commons does >>seem like a better target than a separate project. -C >> >> >>On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cdoug...@apache.org> wrote: >>> On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma >>> <uma.ganguma...@intel.com> wrote: >>>>>Standing in the point of shared fundamental piece of code like >>>>>this, I do think Apache Commons might be the best direction which >>>>>we can try as the first effort. In this direction, we still need to >>>>>work with Apache Common community for buying in and accepting the proposal. >>>> Make sense. >>> >>> Makes sense how? >>> >>>> For this we should define the independent release cycles for this >>>> project and it would just place under Hadoop tree if we all >>>> conclude with this option at the end. >>> >>> Yes. >>> >>>> [Chris] >>>>>If Chimera is not successful as an independent project or stalls, >>>>>Hadoop and/or Spark and/or $project will have to reabsorb it as >>>>>maintainers. >>>>> >>>> I am not so strong on this point. If we assume project would be >>>>unsuccessful, it can be unsuccessful(less maintained) even under >>>>hadoop. >>>> But if other projects depending on this piece then they would get >>>>less support. Of course right now we feel this piece of code is very >>>>important and we feel(expect) it can be successful as independent >>>>project, irrespective of whether it as separate project outside >>>>hadoop or inside. >>>> So, I feel this point would not really influence to judge the >>>>discussion. >>> >>> Sure; code can idle anywhere, but that wasn't the point I was after. >>> You propose to extract code from Hadoop, but if Chimera fails then >>> what recourse do we have among the other projects taking a >>> dependency on it? Splitting off another project is feasible, but >>> Chimera should be sustainable before this PMC can divest itself of >>> responsibility for security libraries. That's a pretty low bar. >>> >>> Bundling the library with the jar is helpful; I've used that before. >>> It should prefer (updated) libraries from the environment, if >>> configured. Otherwise it's a pain (or impossible) for ops to patch >>> security bugs. -C >>> >>>>>-----Original Message----- >>>>>From: Colin P. McCabe [mailto:cmcc...@apache.org] >>>>>Sent: Wednesday, February 3, 2016 4:56 AM >>>>>To: hdfs-dev@hadoop.apache.org >>>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator >>>>>project >>>>> >>>>>It's great to see interest in improving this functionality. I >>>>>think Chimera could be successful as an Apache project. I don't >>>>>have a strong opinion one way or the other as to whether it belongs >>>>>as part of Hadoop or separate. >>>>> >>>>>I do think there will be some challenges splitting this >>>>>functionality out into a separate jar, because of the way our >>>>>CLASSPATH works right now. >>>>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark >>>>>depends on Chimera 1.1. Now Spark jobs have two different versions >>>>>fighting it out on the classpath, similar to the situation with >>>>>Guava and other libraries. Perhaps if Chimera adopts a policy of >>>>>strong backwards compatibility, we can just always use the latest >>>>>jar, but it still seems likely that there will be problems. There >>>>>are various classpath isolation ideas that could help here, but >>>>>they are big projects in their own right and we don't have a clear >>>>>timeline for them. If this does end up being a separate jar, we >>>>>may need to shade it to avoid all these issues. >>>>> >>>>>Bundling the JNI glue code in the jar itself is an interesting >>>>>idea, which we have talked about before for libhadoop.so. It >>>>>doesn't really have anything to do with the question of TLP vs. >>>>>non-TLP, of course. >>>>>We could do that refactoring in Hadoop itself. The really >>>>>complicated part of bundling JNI code in a jar is that you need to >>>>>create jars for every cross product of (JVM version, openssl >>>>>version, operating system). >>>>>For example, you have the RHEL6 build for openJDK7 using openssl >>>>>1.0.1e. >>>>>If you change any one thing-- say, change openJDK7 to Oracle JDK8, >>>>>then you might need to rebuild. And certainly using Ubuntu would >>>>>be a rebuild. And so forth. This kind of clashes with Maven's >>>>>philosophy of pulling prebuilt jars from the internet. >>>>> >>>>>Kai Zheng's question about whether we would bundle openSSL's >>>>>libraries is a good one. Given the high rate of new >>>>>vulnerabilities discovered in that library, it seems like bundling >>>>>would require Hadoop users and vendors to update very frequently, >>>>>much more frequently than Hadoop is traditionally updated. So >>>>>probably we would not choose to bundle openssl. >>>>> >>>>>best, >>>>>Colin >>>>> >>>>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas >>>>><cdoug...@apache.org> >>>>>wrote: >>>>>> As a subproject of Hadoop, Chimera could maintain its own cadence. >>>>>> There's also no reason why it should maintain dependencies on >>>>>> other parts of Hadoop, if those are separable. How is this >>>>>> solution inadequate? >>>>>> >>>>>> If Chimera is not successful as an independent project or stalls, >>>>>> Hadoop and/or Spark and/or $project will have to reabsorb it as >>>>>> maintainers. Projects have high mortality in early life, and a >>>>>> fight over inheritance/maintenance is something we'd like to avoid. >>>>>> If, on the other hand, it develops enough of a community where it >>>>>> is obviously viable, then we can (and should) break it out as a >>>>>> TLP (as we have before). If other Apache projects take a >>>>>> dependency on Chimera, we're open to adding them to security@hadoop. >>>>>> >>>>>> Unlike Yetus, which was largely rewritten right before it was >>>>>> made into a TLP, security in Hadoop has a complicated pedigree. >>>>>> If Chimera eventually becomes a TLP, it seems fair to include >>>>>> those who work on it while it is a subproject. Declared upfront, >>>>>> that criterion is fairer than any post hoc justification, and >>>>>> will lead to a more accurate account of its community than a >>>>>> subset of the Hadoop PMC/committers that volunteer. -C >>>>>> >>>>>> >>>>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng >>>>>><haifeng.c...@intel.com> >>>>>>wrote: >>>>>>> Thanks to all folks providing feedbacks and participating the >>>>>>>discussions. >>>>>>> >>>>>>> @Owen, do you still have any concerns on going forward in the >>>>>>>direction of Apache Commons (or other options, TLP)? >>>>>>> >>>>>>> Thanks, >>>>>>> Haifeng >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Chen, Haifeng [mailto:haifeng.c...@intel.com] >>>>>>> Sent: Saturday, January 30, 2016 10:52 AM >>>>>>> To: hdfs-dev@hadoop.apache.org >>>>>>> Subject: RE: Hadoop encryption module as Apache Chimera >>>>>>> incubator project >>>>>>> >>>>>>>>> I believe encryption is becoming a core part of Hadoop. I >>>>>>>>>think that moving core components out of Hadoop is bad from a >>>>>>>>>project management perspective. >>>>>>> >>>>>>>> Although it's certainly true that encryption capabilities (in >>>>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think >>>>>>>>that should really influence whether or not the >>>>>>>>non-Hadoop-specific encryption routines should be part of the >>>>>>>>Hadoop code base, or part of the code base of another project that >>>>>>>>Hadoop depends on. >>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS >>>>>>>>encryption was first developed, HDFS probably would have just >>>>>>>>added that as a dependency and been done with it. I don't think >>>>>>>>we would've copy/pasted the code for Chimera into the Hadoop code base. >>>>>>> >>>>>>> Agree with ATM. I want to also make an additional clarification. >>>>>>>I agree that the encryption capabilities are becoming core to Hadoop. >>>>>>>While this effort is to put common and shared encryption routines >>>>>>>such as crypto stream implementations into a scope which can be >>>>>>>widely shared across the Apache ecosystem. This doesn't move >>>>>>>Hadoop encryption out of Hadoop (that is not possible). >>>>>>> >>>>>>> Agree if we make it a separate and independent releases project >>>>>>>in Hadoop takes a step further than the existing approach and >>>>>>>solve some issues (such as libhadoop.so problem). Frankly >>>>>>>speaking, I think it is not the best option we can try. I also >>>>>>>expect that an independent release project within Hadoop core >>>>>>>will also complicate the existing release ideology of Hadoop release. >>>>>>> >>>>>>> Thanks, >>>>>>> Haifeng >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Aaron T. Myers [mailto:a...@cloudera.com] >>>>>>> Sent: Friday, January 29, 2016 9:51 AM >>>>>>> To: hdfs-dev@hadoop.apache.org >>>>>>> Subject: Re: Hadoop encryption module as Apache Chimera >>>>>>> incubator project >>>>>>> >>>>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley >>>>>>><omal...@apache.org> >>>>>>>wrote: >>>>>>> >>>>>>>> I believe encryption is becoming a core part of Hadoop. I think >>>>>>>>that moving core components out of Hadoop is bad from a project >>>>>>>>management perspective. >>>>>>>> >>>>>>> >>>>>>> Although it's certainly true that encryption capabilities (in >>>>>>>HDFS, YARN, >>>>>>> etc.) are becoming core to Hadoop, I don't think that should >>>>>>>really influence whether or not the non-Hadoop-specific >>>>>>>encryption routines should be part of the Hadoop code base, or >>>>>>>part of the code base of another project that Hadoop depends on. >>>>>>>If Chimera had existed as a library hosted at ASF when HDFS >>>>>>>encryption was first developed, HDFS probably would have just >>>>>>>added that as a dependency and been done with it. I don't think >>>>>>>we would've copy/pasted the code for Chimera into the Hadoop code base. >>>>>>> >>>>>>> >>>>>>>> To put it another way, a bug in the encryption routines will >>>>>>>>likely become a security problem that security@hadoop needs to >>>>>>>>hear about. >>>>>>>> >>>>>>> I don't think >>>>>>>> adding a separate project in the middle of that communication >>>>>>>>chain is a good idea. The same applies to data corruption >>>>>>>>problems, and so on... >>>>>>>> >>>>>>> >>>>>>> Isn't the same true of all the libraries that Hadoop currently >>>>>>>depends upon? If the commons-httpclient library (or >>>>>>>commons-codec, or commons-io, or guava, or...) has a security >>>>>>>vulnerability, we need to know about it so that we can update our >>>>>>>dependency to a fixed version. >>>>>>>This case doesn't seem materially different than that. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> > It may be good to keep at generalized place(As in the >>>>>>>> > discussion, we thought that place could be Apache Commons). >>>>>>>> >>>>>>>> >>>>>>>> Apache Commons is a collection of *Java* projects, so Chimera >>>>>>>> as a JNI-based library isn't a natural fit. >>>>>>>> >>>>>>> >>>>>>> Could very well be that Apache Commons's charter would preclude >>>>>>>Chimera. >>>>>>> You probably know better than I do about that. >>>>>>> >>>>>>> >>>>>>>> Furthermore, Apache Commons doesn't have its own security list >>>>>>>> so problems will go to the generic secur...@apache.org. >>>>>>>> >>>>>>> >>>>>>> That seems easy enough to remedy, if they wanted to, and besides >>>>>>>I'm not sure why that would influence this discussion. In my >>>>>>>experience projects that don't have a separate >>>>>>>security@project.a.o mailing list tend to just handle security >>>>>>>issues on their private@project.a.o mailing list, which seems fine to me. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Why do you think that Apache Commons is a better home than Hadoop? >>>>>>>> >>>>>>> >>>>>>> I'm certainly not at all wedded to Apache Commons, that just >>>>>>>seemed like a natural place to put it to me. Could be that a >>>>>>>brand new TLP might make more sense. >>>>>>> >>>>>>> I *do* think that if other non-Hadoop projects want to make use >>>>>>>of Chimera, which as I understand it is the goal which started >>>>>>>this thread, then Chimera should exist outside of Hadoop so that: >>>>>>> >>>>>>> a) Projects that have nothing to do with Hadoop can just depend >>>>>>>directly on Chimera, which has nothing Hadoop-specific in there. >>>>>>> >>>>>>> b) The Hadoop project doesn't have to export/maintain/concern >>>>>>>itself with yet another publicly-consumed interface. >>>>>>> >>>>>>> c) Chimera can have its own (presumably much faster) release >>>>>>>cadence completely separate from Hadoop. >>>>>>> >>>>>>> -- >>>>>>> Aaron T. Myers >>>>>>> Software Engineer, Cloudera >>>> >