Hi Dave, Your opinion is very much appreciated.
Thanks, --Konstantin On Wed, Mar 21, 2012 at 5:36 AM, Dave Shine <dave.sh...@channelintelligence.com> wrote: > I am not a contributor to this project, so I don't know how much weight my > opinion carries. But I have been hoping to see append become stable soon. > We are constantly dealing with the "small file problem", and I have written > M/R jobs to periodically roll up lots of small files into a few small ones. > Having append would prevent me from needing to use up cluster resources > performing these tasks. > > Therefore, all things being equal I +1 making append work. However, if the > level of complexity is as bad as Eli implies below, then I can understand > that perhaps it is not worth the effort. If it will cause too much technical > debt, then removing it makes sense. But don't just remove it because you > don't believe there is a need for it. > > Thanks, > Dave Shine > > > -----Original Message----- > From: Eli Collins [mailto:e...@cloudera.com] > Sent: Tuesday, March 20, 2012 8:38 PM > To: hdfs-dev@hadoop.apache.org > Subject: [DISCUSS] Remove append? > > Hey gang, > > I'd like to get people's thoughts on the following proposal. I think we > should consider removing append from HDFS. > > Where we are today.. append was added in the 0.17-19 releases > (HADOOP-1700) and subsequently disabled (HADOOP-5224) due to quality issues. > It and sync were re-designed, re-implemented, and shipped in > 21.0 (HDFS-265). To my knowledge, there has been no real production use. > Anecdotally people who worked on branch-20-append have told me they think the > new trunk code is substantially less well-tested than the branch-20-append > code (at least for sync, append was never well tested). It has certainly > gotten way less pounding from HBase users. > The design however, is much improved, and people think we can get hsync (and > append) stabilized in trunk (mostly testing and bug fixing). > > Rationale follows.. > > Append does not seem to be an important requirement, hflush was. There has > not been much demand for append, from users or downstream projects. Because > Hadoop 1.x does not have a working append implementation (see HDFS-3120, the > branch-20-append work was focused on sync not getting append working) which > is not enabled by default and downstream projects will want to support Hadoop > 1.x releases for years, most will not introduce dependencies on append > anyway. This is not to say demand does not exist, just that if it does, it's > been much smaller than security, sync, HA, backwards compatbile RPC, etc. > This probably explains why, over 5 years after the original implementation > started, we don't have a stable release with append. > > Append introduces non-trivial design and code complexity, which is not worth > the cost if we don't have real users. Removing append means we have the > property that HDFS blocks, when finalized, are immutable. > This significantly simplifies the design and code, which significantly > simplifies the implementation of other features like snapshots, HDFS-level > caching, dedupe, etc. > > The vast majority of the HDFS-265 effort is still leveraged w/o append. The > new data durability and read consistency behavior was the key part. > > GFS, which HDFS' design is based on, has append (and atomic record > append) so obviously a workable design does not preclude append. > However we also should not ape the GFS feature set simply because it exists. > I've had conversations with people who worked on GFS that regret adding > record append (see also http://queue.acm.org/detail.cfm?id=1594206). In > short, unless append is a real priority for our users I think we should focus > our energy elsewhere. > > Thanks, > Eli > > The information contained in this email message is considered confidential > and proprietary to the sender and is intended solely for review and use by > the named recipient. Any unauthorized review, use or distribution is strictly > prohibited. If you have received this message in error, please advise the > sender by reply email and delete the message.