Re: [VOTE] Merging branch HDFS-7240 to trunk

sanjay Radia Fri, 09 Mar 2018 15:58:34 -0800

Joep,  You raise a number of points:

(1) Ozone vs and object stores. “Some users would choose Ozone as that layer, 
some might use S3, others GCS, or Azure, or something else”.
(2) How HDSL/Ozone fits into Hadoop and whether it is necessary.
(3) You raise the release issue which we will respond in a separate email.


Let me respond to 1 & 2:
***Wrt to (1) Ozone vs other object stores***
Neither HDFS or Ozone has any real role in cloud except for temp data. The cost 
of local disk or EBS is so high that long term data storage on HDFS or even 
Ozone is prohibitive.
So why the hell create the KV namespace? We need to stabilize the HDSL where 
data is stored.  - We are targeting Hive and SPark apps to stabilize HDSL using 
real Hadoop apps over OzoneFS.
But HDSL/Ozone is not feature compatible with HDFS so how will users even use 
it for real to stability. Users can run HDFS and Ozone side by side in same 
cluster and have two namespace (just like in Federation) and run apps on both: 
run some hive and spark apps on Ozone and others that need full HDFS feature 
(encryption) on HDFS. As it becomes stable they can start using HDSL/Ozone for 
production use for a portion of their data.



***Wrt to (2) HDSL/Ozone fitting into Hadoop and why the same repository***
Ozone KV is a temporary step. Real goal is to put NN on top of HDSL, We have 
shown how to do that in the roadmap that Konstantine and Chris D asked. 
Milestone 1 is feasible and doesn't require removal of FSN lock. We have also 
shown several cases of sharing other code in future (protocol engine). This 
co-development will be easier if in the same repo. Over time HDSL + ported NN  
will create a new HDFS and become feature compatible - some of the feature will 
come for free because they are in NN and will port over to the new NN, Some are 
in block layer (erasure code) and will have to be added to HDSL.

--- You compare with Yarn, HDFS and Common. HDFS and Yarn are independent but 
both depend on Hadoop common (e.g. HBase runs on HDFS without Yarn).   HDSL and 
Ozone will depend on Hadoop common, Indeed the new protocol engine of HDSL 
might move to Hadoop common or HDFS. We have made sure that there are no 
dependencies of HDFS on HDSL or currently.


***The Repo issue and conclusion***
HDFS community will need to work together as we evolve old HDFS to use HDSL, 
new protocol engine and Raft. and together evolve to a newer more powerful set 
of sub components. It is important that they are in same repo and that we can 
share code through both private interface. We are not trying to build a 
competing Object store but to improve HDFS and fixing scalability fundamentally 
is hard and we are asking for an environment for that to happen easily over the 
next year while heeding to the stability concerns of HDFS developers (eg we  
remove compile time dependency, maven profile). This work is not being done by 
members of foreign project trying to insert code in Hadoop, but by Hadoop/HDFS 
developers with given track record s and are active participation in Hadoop and 
HDFS. Our jobs depend on HDFS/Hadoop stability - destabilizing is the last 
thing we want to do; we have responded every constructive feedback 


sanjay


> On Mar 6, 2018, at 6:50 PM, J. Rottinghuis <jrottingh...@gmail.com> wrote:
> 
> Sorry for jumping in late into the fray of this discussion.
> 
> It seems Ozone is a large feature. I appreciate the development effort and
> the desire to get this into the hands of users.
> I understand the need to iterate quickly and to reduce overhead for
> development.
> I also agree that Hadoop can benefit from a quicker release cycle. For our
> part, this is a challenge as we have a large installation with multiple
> clusters and thousands of users. It is a constant balance between jumping
> to the newest release and the cost of this integration and test at our
> scale, especially when things aren't backwards compatible. We try to be
> good citizens and upstream our changes and contribute back.
> 
> The point was made that splitting the projects such as common and Yarn
> didn't work and had to be reverted. That was painful and a lot of work for
> those involved for sure. This project may be slightly different in that
> hadoop-common, Yarn and HDFS made for one consistent whole. One couldn't
> run a project without the other.
> 
> Having a separate block management layer with possibly multiple block
> implementation as pluggable under the covers would be a good future
> development for HDFS. Some users would choose Ozone as that layer, some
> might use S3, others GCS, or Azure, or something else.
> If the argument is made that nobody will be able to run Hadoop as a
> consistent stack without Ozone, then that would be a strong case to keep
> things in the same repo.
> 
> Obviously when people do want to use Ozone, then having it in the same repo
> is easier. The flipside is that, separate top-level project in the same
> repo or not, it adds to the Hadoop releases. If there is a change in Ozone
> and a new release needed, it would have to wait for a Hadoop release. Ditto
> if there is a Hadoop release and there is an issue with Ozone. The case
> that one could turn off Ozone through a Maven profile works only to some
> extend.
> If we have done a 3.x release with Ozone in it, would it make sense to do a
> 3.y release with y>x without Ozone in it? That would be weird.
> 
> This does sound like a Hadoop 4 feature. Compatibility with lots of new
> features in Hadoop 3 need to be worked out. We're still working on jumping
> to a Hadoop 2.9 release and then working on getting a step-store release to
> 3.0 to bridge compatibility issues. I'm afraid that adding a very large new
> feature into trunk now, essentially makes going to Hadoop 3 not viable for
> quite a while. That would be a bummer for all the feature work that has
> gone into Hadoop 3. Encryption and erasure encoding are very appealing
> features, especially in light of meeting GDPR requirements.
> 
> I'd argue to pull out those pieces that make sense in Hadoop 3, merge those
> in and keep the rest in a separate project. Iterate quickly in that
> separate project, you can have a separate set of committers, you can do
> separate release cycle. If that develops Ozone into _the_ new block layer
> for all use cases (even when people want to give up on encryption, erasure
> encoding, or feature parity is reached) then we can jump of that bridge
> when we reach it. I think adding a very large chunk of code that relatively
> few people in the community are familiar with isn't necessarily going to
> help Hadoop at this time.
> 
> Cheers,
> 
> Joep
> 
> On Tue, Mar 6, 2018 at 2:32 PM, Jitendra Pandey <jiten...@hortonworks.com>
> wrote:
> 
>> Hi Andrew,
>> 
>> I think we can eliminate the maintenance costs even in the same repo. We
>> can make following changes that incorporate suggestions from Daryn and Owen
>> as well.
>> 1. Hadoop-hdsl-project will be at the root of hadoop repo, in a separate
>> directory.
>> 2. There will be no dependencies from common, yarn and hdfs to hdsl/ozone.
>> 3. Based on Daryn’s suggestion, the Hdsl can be optionally (via config) be
>> loaded in DN as a pluggable module.
>>     If not loaded, there will be absolutely no code path through hdsl or
>> ozone.
>> 4. To further make it easier for folks building hadoop, we can support a
>> maven profile for hdsl/ozone. If the profile is deactivated hdsl/ozone will
>> not be built.
>>     For example, Cloudera can choose to skip even compiling/building
>> hdsl/ozone and therefore no maintenance overhead whatsoever.
>>     HADOOP-14453 has a patch that shows how it can be done.
>> 
>> Arguably, there are two kinds of maintenance costs. Costs for developers
>> and the cost for users.
>> - Developers: A maven profile as noted in point (3) and (4) above
>> completely addresses the concern for developers
>>                                 as there are no compile time dependencies
>> and further, they can choose not to build ozone/hdsl.
>> - User: Cost to users will be completely alleviated if ozone/hdsl is not
>> loaded as mentioned in point (3) above.
>> 
>> jitendra
>> 
>> From: Andrew Wang <andrew.w...@cloudera.com>
>> Date: Monday, March 5, 2018 at 3:54 PM
>> To: Wangda Tan <wheele...@gmail.com>
>> Cc: Owen O'Malley <owen.omal...@gmail.com>, Daryn Sharp
>> <da...@oath.com.invalid>, Jitendra Pandey <jiten...@hortonworks.com>,
>> hdfs-dev <hdfs-...@hadoop.apache.org>, "common-dev@hadoop.apache.org" <
>> common-dev@hadoop.apache.org>, "yarn-...@hadoop.apache.org" <
>> yarn-...@hadoop.apache.org>, "mapreduce-...@hadoop.apache.org" <
>> mapreduce-...@hadoop.apache.org>
>> Subject: Re: [VOTE] Merging branch HDFS-7240 to trunk
>> 
>> Hi Owen, Wangda,
>> 
>> Thanks for clearly laying out the subproject options, that helps the
>> discussion.
>> 
>> I'm all onboard with the idea of regular releases, and it's something I
>> tried to do with the 3.0 alphas and betas. The problem though isn't a lack
>> of commitment from feature developers like Sanjay or Jitendra; far from it!
>> I think every feature developer makes a reasonable effort to test their
>> code before it's merged. Yet, my experience as an RM is that more code
>> comes with more risk. I don't believe that Ozone is special or different in
>> this regard. It comes with a maintenance cost, not a maintenance benefit.
>> 
>> 
>> I'm advocating for #3: separate source, separate release. Since HDSL
>> stability and FSN/BM refactoring are still a ways out, I don't want to
>> incur a maintenance cost now. I sympathize with the sentiment that working
>> cross-repo is harder than within same repo, but the right tooling can make
>> this a lot easier (e.g. git submodule, Google's repo tool). We have
>> experience doing this internally here at Cloudera, and I'm happy to share
>> knowledge and possibly code.
>> 
>> Best,
>> Andrew
>> 
>> On Fri, Mar 2, 2018 at 4:41 PM, Wangda Tan <wheele...@gmail.com> wrote:
>> I like the idea of same source / same release and put Ozone's source under
>> a different directory.
>> 
>> Like Owen mentioned, It gonna be important for all parties to keep a
>> regular and shorter release cycle for Hadoop, e.g. 3-4 months between minor
>> releases. Users can try features and give feedbacks to stabilize feature
>> earlier; developers can be happier since efforts will be consumed by users
>> soon after features get merged. In addition to this, if features merged to
>> trunk after reasonable tests/review, Andrew's concern may not be a problem
>> anymore:
>> 
>> bq. Finally, I earnestly believe that Ozone/HDSL itself would benefit from
>> being a separate project. Ozone could release faster and iterate more
>> quickly if it wasn't hampered by Hadoop's release schedule and security and
>> compatibility requirements.
>> 
>> Thanks,
>> Wangda
>> 
>> 
>> On Fri, Mar 2, 2018 at 4:24 PM, Owen O'Malley <owen.omal...@gmail.com>
>> wrote:
>> On Thu, Mar 1, 2018 at 11:03 PM, Andrew Wang <andrew.w...@cloudera.com>
>> wrote:
>> 
>> Owen mentioned making a Hadoop subproject; we'd have to
>>> hash out what exactly this means (I assume a separate repo still managed
>> by
>>> the Hadoop project), but I think we could make this work if it's more
>>> attractive than incubation or a new TLP.
>> 
>> 
>> Ok, there are multiple levels of sub-projects that all make sense:
>> 
>>   - Same source tree, same releases - examples like HDFS & YARN
>>   - Same master branch, separate releases and release branches - Hive's
>>   Storage API vs Hive. It is in the source tree for the master branch, but
>>   has distinct releases and release branches.
>>   - Separate source, separate release - Apache Commons.
>> 
>> There are advantages and disadvantages to each. I'd propose that we use the
>> same source, same release pattern for Ozone. Note that we tried and later
>> reverted doing Common, HDFS, and YARN as separate source, separate release
>> because it was too much trouble. I like Daryn's idea of putting it as a top
>> level directory in Hadoop and making sure that nothing in Common, HDFS, or
>> YARN depend on it. That way if a Release Manager doesn't think it is ready
>> for release, it can be trivially removed before the release.
>> 
>> One thing about using the same releases, Sanjay and Jitendra are signing up
>> to make much more regular bugfix and minor releases in the near future. For
>> example, they'll need to make 3.2 relatively soon to get it released and
>> then 3.3 somewhere in the next 3 to 6 months. That would be good for the
>> project. Hadoop needs more regular releases and fewer big bang releases.
>> 
>> .. Owen
>> 
>> 
>> 
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Re: [VOTE] Merging branch HDFS-7240 to trunk

Reply via email to