Re: [VOTE] Merge HDFS-6581 to trunk - Writing to replicas in memory.

Colin McCabe Wed, 24 Sep 2014 18:24:49 -0700

On Wed, Sep 24, 2014 at 4:03 PM, Arpit Agarwal <aagar...@hortonworks.com> wrote:
>> I would appreciate in the future if benchmarks were posted
>> a day or two before a merge vote of a performance improvement was
>> suggested.
>> I feel that it would have been better to float the idea of a
>> merge on the JIRA before actually calling it,
>
> Colin, I noted my intention to start the merge vote on the Jira last week.
> In response you asked for verification that read performance will not
> regressed. Read results were posted prior to starting the vote. I agree
> that the write numbers were not posted until one day after the merge vote
> started. If we delay the vote expiration by a day will it help address any
> remaining timing concerns?


Thank you for the offer.  I am OK with the current vote expiration
time, now that we've seen more benchmarks and discussed more potential
issues.

I am -0 on the merge vote.

But I do want to note that I consider HDFS-7141 to be a blocker for
merging to branch-2.6.

>> There are two areas where I think we need more clarification.  The
>> first is whether other eviction strategies can be implemented besides
>> LRU.
>
> If 2Q or another scheme requires more hooks we can certainly add to the
> pluggable interface. It is not a public interface that is set in stone. It
> is similar to BlockPlacementPolicy and VolumeChoosingPolicy. Both are
> HDFS-internal interfaces with alternative implementations that we rev
> frequently.

I implemented a modified 2Q today on HDFS-7142... I'd appreciate a
review.  Although I haven't had time to do a lot of testing on this, I
think this removes this as a blocker in my mind.

>> The other area is how the memory management for HDFS-6581 fits in with
>> the memory management for HDFS-4949.  I am very concerned that this
>> has been handwaved away as future work. If we create a configuration
>> mess for system administrators as a result, I will be sad.
>
> You asked about memory management on 8/21. I responded the same day stating
> that we are not introducing any configuration settings since mounting the
> disk is done by an administrator. Your response did not indicate any
> concern with this approach. You can see the comment history on HDFS-6581. So
> I am surprised you raise it as a blocker one month later. We have not
> introduced new limits as part of this change so there is no concern of
> future config incompatibilities. The Cache Executor and Lazy Writer can be
> updated to be aware of the memory usage of each other in a compatible way.
>
> What we are voting on here is merging to trunk. If you have additional and
> reasonable concerns that you would like to see addressed prior to 2.6 we
> can have a separate discussion about 2.6 blockers.

I think it's fine to address HDFS-7141 after the merge to trunk.  But
as I noted above, I think we absolutely need to address it before the
merge to 2.6.  We are starting to see a lot of users of HDFS-4949, and
I want to make sure that there is a reasonable story for using both
features at the same time.  Let's continue this discussion on
HDFS-6919 and HDFS-6988 and see if we can come up with a solution that
works for everyone.

best,
Colin


>
> Regards,
> Arpit
>
>
> On Wed, Sep 24, 2014 at 2:19 PM, Colin McCabe <cmcc...@alumni.cmu.edu>
> wrote:
>
>> On Wed, Sep 24, 2014 at 11:12 AM, Suresh Srinivas
>> <sur...@hortonworks.com> wrote:
>> > Features are done in a separate branch for two reasons: 1) During a
>> feature
>> > development the branch may be not functional 2) The high level approach
>> and
>> > design is not very clear and development can continue while that is being
>> > sorted out.
>>
>> Up until the last two days, we had no benchmarks at all, so we didn't
>> have any way to evaluate whether this performance improvement was
>> functional.  Andrew commented on this as well in this thread, and we
>> also raised the issue on the JIRA.  I am glad that a few benchmarks
>> have finally been posted.  I would appreciate in the future if
>> benchmarks were posted a day or two before a merge vote of a
>> performance improvement was suggested.  As it is, it feels like we are
>> racing the clock right now to figure out how well this works, and it
>> puts the reviewers in an unpleasant position.
>>
>> >In case of this feature, clearly (2) is not an issue. We have
>> > had enough discussion about the approach. I also think this branch is
>> ready
>> > for merge without rendering trunk not functional.
>>
>> I agree that this can be merged without rendering trunk
>> non-functional.  I don't agree that we have achieved consensus on all
>> the high-level approach and design.
>>
>> There are two areas where I think we need more clarification.  The
>> first is whether other eviction strategies can be implemented besides
>> LRU.  Yes, there is a pluggable interface, but I am concerned that it
>> may not be flexible enough to implement anything besides LRU.  I am
>> working on a patch now that implements 2Q.  I will find out for
>> myself, I guess.
>>
>> The other area is how the memory management for HDFS-6581 fits in with
>> the memory management for HDFS-4949.  I am very concerned that this
>> has been handwaved away as future work.  If we create a configuration
>> mess for system administrators as a result, I will be sad.  This is
>> the kind of design that I would have liked to see done a while ago.  I
>> would certainly veto any merge to 2.6 until we have figured this out.
>>
>> I commented about this a month ago:
>>
>> https://issues.apache.org/jira/browse/HDFS-6581?focusedCommentId=14106025&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14106025
>> But my concerns were not addressed in the branch.
>>
>> If these two design clarifications can be made in time for the merge,
>> I will be -0 on this.  I have opened HDFS-7141 and HDFS-7142 to
>> continue the discussion on these points.
>>
>> > That leaves us with one
>> > objection that is being raised; the content is not complete. This did not
>> > prevent recently merged crypto file system work from being merged to
>> trunk.
>> > We discussed and decided that it can be merged to trunk and we will
>> finish
>> > the remaining work that is a must have in trunk before merging to
>> branch-2.
>> > That is a reasonable approach for this feature as well. In fact there
>> are a
>> > lot of bug fixes and improvements still happening on crypto work even
>> after
>> > branch-2 merge (which is a good thing in my opinion, it shows that the
>> > testing is continuing and feature is still being improved).
>>
>> I am OK with merging this before it is complete.  I was just pointing
>> out that this could have benefited from more time in development.  It
>> would have been a much smoother merge and less stressful for everyone.
>>
>> >
>> > As regards to the eviction policy we have had discussion in the jira. I
>> > disagree that there exists a "usable" or "better" strategy that works
>> well
>> > for all the work loads. I also disagree that additional policy to the
>> > existing LRU is needed, leave alone it being needed for trunk merge. New
>> > eviction policies need to be developed as people experiment with it.
>> Hence
>> > making the eviction policy pluggable is great way to approach it.
>>
>> LRU is a poor policy for scan workloads.  For example, see here:
>> http://www.cse.iitd.ernet.in/~sbansal/os/lec/l30.html
>> "In general, LRU works well, but there are certain workloads where LRU
>> performs very poorly. One well-known workload is a "scan" wherein a
>> large memory region (larger than physical memory) is accessed in quick
>> succession. e.g... Such scans are often quite common, as some thread
>> may want to go through all its data. And so, plain LRU (or CLOCK) does
>> not work as well in practice. Instead a variant of LRU that considers
>> both recency and frequency of access to a page, is typically used in
>> an OS (e.g., 2Q, CLOCK with Adaptive Replacement, etc.)."
>>
>> So is your argument that we are not using this for scanning workloads?
>>  That seems tough to believe, and as I said earlier, you have posted
>> no system-level benchmarks, so how can I evaluate that claim?
>>
>> > There are some testing information posted on the jira, such as
>> performance.
>> > But I think we need more details on how this feature is being tested.
>> > Arpit, can you please post test details. As for me, given how much work
>> > (both discussion and coding) has gone into it, and given it is a very
>> > important feature that allows experimentation to use memory tier, I would
>> > like to see this available in release 2.6.
>>
>> Just to reiterate, I am -0 provided we can address HDFS-7141 and
>> HDFS-7142 before this gets set in stone.  I would hate to -1 this,
>> because it would mean that you could not call another vote for a week.
>> But I feel that it would have been better to float the idea of a merge
>> on the JIRA before actually calling it, to avoid having discussions
>> like this where we are racing the clock.
>>
>> thanks,
>> Colin
>>
>> >
>> > On Tue, Sep 23, 2014 at 6:09 PM, Colin McCabe <cmcc...@alumni.cmu.edu>
>> > wrote:
>> >
>> >> This seems like a really aggressive timeframe for a merge.  We still
>> >> haven't implemented:
>> >>
>> >> * Checksum skipping on read and write from lazy persisted replicas.
>> >> * Allowing mmaped reads from the lazy persisted data.
>> >> * Any eviction strategy other than LRU.
>> >> * Integration with cache pool limits (how do HDFS-4949 and lazy
>> >> persist replicas share memory)?
>> >> * Eviction from RAM disk via truncation (HDFS-6918)
>> >> * Metrics
>> >> * System testing to find out how useful this is, and what the best
>> >> eviction strategy is.
>> >>
>> >> I see why we might want to defer checksum skipping, metrics, allowing
>> >> mmap, eviction via truncation, and so forth until later.  But I feel
>> >> like we need to figure out how this will integrate with the memory
>> >> used by HDFS-4949 before we merge.  I also would like to see another
>> >> eviction strategy other than LRU, which is a very poor eviction
>> >> strategy for scanning workloads.  I mentioned this a few times on the
>> >> JIRA.
>> >>
>> >> I'd also like to get some idea of how much testing this has received
>> >> in a multi-node cluster.  What makes us confident that this is the
>> >> right time to merge, rather than in a week or two?
>> >>
>> >> best,
>> >> Colin
>> >>
>> >>
>> >> On Tue, Sep 23, 2014 at 4:55 PM, Arpit Agarwal <
>> aagar...@hortonworks.com>
>> >> wrote:
>> >> > I have posted write benchmark results to the Jira.
>> >> >
>> >> > On Tue, Sep 23, 2014 at 3:41 PM, Arpit Agarwal <
>> aagar...@hortonworks.com
>> >> >
>> >> > wrote:
>> >> >
>> >> >> Hi Andrew, I said "it is not going to be a substantial fraction of
>> >> memory
>> >> >> bandwidth". That is certainly not the same as saying it won't be
>> good or
>> >> >> there won't be any improvement.
>> >> >>
>> >> >> Any time you have transfers over RPC or the network stack you will
>> not
>> >> get
>> >> >> close to the memory bandwidth even for intra-host transfers.
>> >> >>
>> >> >> I'll add some micro-benchmark results to the Jira shortly.
>> >> >>
>> >> >> Thanks,
>> >> >> Arpit
>> >> >>
>> >> >> On Tue, Sep 23, 2014 at 2:33 PM, Andrew Wang <
>> andrew.w...@cloudera.com>
>> >> >> wrote:
>> >> >>
>> >> >>> Hi Arpit,
>> >> >>>
>> >> >>> Here is the comment. It was certainly not my intention to misquote
>> >> anyone.
>> >> >>>
>> >> >>>
>> >> >>>
>> >>
>> https://issues.apache.org/jira/browse/HDFS-6581?focusedCommentId=14138223&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14138223
>> >> >>>
>> >> >>> Quote:
>> >> >>>
>> >> >>> It would be nice to see that would could get a substantial fraction
>> of
>> >> >>> memory bandwidth when writing to a single replica in-memory.
>> >> >>>
>> >> >>> The comparison will be interesting but I can tell you without
>> >> measurement
>> >> >>> it is not going to be a substantial fraction of memory bandwidth. We
>> >> are
>> >> >>> still going through DataTransferProtocol with all the copies and
>> >> overhead
>> >> >>> that involves.
>> >> >>>
>> >> >>> When the goal is in-memory writes and we are unable to achieve a
>> >> >>> substantial fraction of memory bandwidth, to me that is "not good
>> >> >>> performance."
>> >> >>>
>> >> >>> I also looked through the subtasks, and AFAICT the only one related
>> to
>> >> >>> improving this is deferring checksum computation. The benchmarking
>> we
>> >> did
>> >> >>> on HDFS-4949 showed that this only really helps when you're down to
>> >> single
>> >> >>> copy or zero copies with SCR/ZCR. DTP reads didn't see much of an
>> >> >>> improvement, so I'd guess the same would be true for DTP writes.
>> >> >>>
>> >> >>> I think my above three questions are still open, as well as my
>> question
>> >> >>> about why we're merging now, as opposed to when the performance of
>> the
>> >> >>> branch is proven out.
>> >> >>>
>> >> >>> Thanks,
>> >> >>> Andrew
>> >> >>>
>> >> >>> On Tue, Sep 23, 2014 at 2:10 PM, Arpit Agarwal <
>> >> aagar...@hortonworks.com>
>> >> >>> wrote:
>> >> >>>
>> >> >>> > Andrew, don't misquote me. Can you link the comment where I said
>> >> >>> > performance wasn't going to be good?
>> >> >>> >
>> >> >>> > I will add some add some preliminary write results to the Jira
>> later
>> >> >>> today.
>> >> >>> >
>> >> >>> > > What's the plan to improve write performance?
>> >> >>> > I described this in response to your and Colin's comments on the
>> >> Jira.
>> >> >>> >
>> >> >>> > For the benefit of folks not following the Jira, the immediate
>> task
>> >> we'd
>> >> >>> > like to get done post-merge is moving checksum computation off the
>> >> write
>> >> >>> > path. Also see open subtasks of HDFS-6581 for other planned perf
>> >> >>> > improvements.
>> >> >>> >
>> >> >>> > Thanks,
>> >> >>> > Arpit
>> >> >>> >
>> >> >>> >
>> >> >>> > On Tue, Sep 23, 2014 at 1:07 PM, Andrew Wang <
>> >> andrew.w...@cloudera.com>
>> >> >>> > wrote:
>> >> >>> >
>> >> >>> > > Hi Arpit,
>> >> >>> > >
>> >> >>> > > On HDFS-6581, I asked for write benchmarks on Sep 19th, and you
>> >> >>> responded
>> >> >>> > > that the performance wasn't going to be good. However, I thought
>> >> the
>> >> >>> > > primary goal of this JIRA was to improve write performance, and
>> >> write
>> >> >>> > > performance is listed as the first feature requirement in the
>> >> design
>> >> >>> doc.
>> >> >>> > >
>> >> >>> > > So, this leads me to a few questions, which I also asked last
>> week
>> >> on
>> >> >>> the
>> >> >>> > > JIRA (I believe still unanswered):
>> >> >>> > >
>> >> >>> > > - What's the plan to improve write performance?
>> >> >>> > > - What kind of performance can we expect after the plan is
>> >> completed?
>> >> >>> > > - Can this expected performance be validated with a prototype?
>> >> >>> > >
>> >> >>> > > Even with these questions answered, I don't understand the need
>> to
>> >> >>> merge
>> >> >>> > > this before the write optimization work is completed. Write
>> perf is
>> >> >>> > listed
>> >> >>> > > as a feature requirement, so the branch can reasonably be called
>> >> not
>> >> >>> > > feature complete until it's shown to be faster.
>> >> >>> > >
>> >> >>> > > Thanks,
>> >> >>> > > Andrew
>> >> >>> > >
>> >> >>> > > On Tue, Sep 23, 2014 at 11:47 AM, Jitendra Pandey <
>> >> >>> > > jiten...@hortonworks.com>
>> >> >>> > > wrote:
>> >> >>> > >
>> >> >>> > > > +1. I have reviewed most of the code in the branch, and I
>> think
>> >> its
>> >> >>> > ready
>> >> >>> > > > to be merged to trunk.
>> >> >>> > > >
>> >> >>> > > >
>> >> >>> > > > On Mon, Sep 22, 2014 at 5:24 PM, Arpit Agarwal <
>> >> >>> > aagar...@hortonworks.com
>> >> >>> > > >
>> >> >>> > > > wrote:
>> >> >>> > > >
>> >> >>> > > > > HDFS Devs,
>> >> >>> > > > >
>> >> >>> > > > > We propose merging the HDFS-6581 development branch to
>> trunk.
>> >> >>> > > > >
>> >> >>> > > > > The work adds support to write to HDFS blocks in memory. The
>> >> >>> target
>> >> >>> > use
>> >> >>> > > > > case covers applications writing relatively small,
>> intermediate
>> >> >>> data
>> >> >>> > > sets
>> >> >>> > > > > with low latency. We introduce a new CreateFlag for the
>> >> existing
>> >> >>> > > > CreateFile
>> >> >>> > > > > API. HDFS will subsequently attempt to place replicas of
>> file
>> >> >>> blocks
>> >> >>> > in
>> >> >>> > > > > local memory with disk writes occurring off the hot path.
>> The
>> >> >>> current
>> >> >>> > > > > design is a simplification of original ideas from Sanjay
>> Radia
>> >> on
>> >> >>> > > > > HDFS-5851.
>> >> >>> > > > >
>> >> >>> > > > > Key goals of the feature were minimal API changes to reduce
>> >> >>> > application
>> >> >>> > > > > burden and best effort data durability. The feature is
>> optional
>> >> >>> and
>> >> >>> > > > > requires appropriate DN configuration from administrators.
>> >> >>> > > > >
>> >> >>> > > > > Design doc:
>> >> >>> > > > >
>> >> >>> > > > >
>> >> >>> > > >
>> >> >>> > >
>> >> >>> >
>> >> >>>
>> >>
>> https://issues.apache.org/jira/secure/attachment/12661926/HDFSWriteableReplicasInMemory.pdf
>> >> >>> > > > >
>> >> >>> > > > > Test plan:
>> >> >>> > > > >
>> >> >>> > > > >
>> >> >>> > > >
>> >> >>> > >
>> >> >>> >
>> >> >>>
>> >>
>> https://issues.apache.org/jira/secure/attachment/12669452/Test-Plan-for-HDFS-6581-Memory-Storage.pdf
>> >> >>> > > > >
>> >> >>> > > > > There are 28 resolved sub-tasks under HDFS-6581, 3 open
>> tasks
>> >> for
>> >> >>> > > > > tests+Jenkins issues  and 7 open subtasks tracking planned
>> >> >>> > > improvements.
>> >> >>> > > > > The latest merge patch is 3300 lines of changed code of
>> which
>> >> 1300
>> >> >>> > > lines
>> >> >>> > > > is
>> >> >>> > > > > new and updated tests. Merging the branch to trunk will
>> allow
>> >> HDFS
>> >> >>> > > > > applications to start evaluating the feature. We will
>> continue
>> >> >>> work
>> >> >>> > on
>> >> >>> > > > > documentation, performance tuning and metrics in parallel
>> with
>> >> the
>> >> >>> > vote
>> >> >>> > > > and
>> >> >>> > > > > post-merge.
>> >> >>> > > > >
>> >> >>> > > > > Contributors to design and code include Xiaoyu Yao, Sanjay
>> >> Radia,
>> >> >>> > > > Jitendra
>> >> >>> > > > > Pandey, Tassapol Athiapinya, Gopal V, Bikas Saha, Vikram
>> Dixit,
>> >> >>> > Suresh
>> >> >>> > > > > Srinivas and Chris Nauroth.
>> >> >>> > > > >
>> >> >>> > > > > Thanks to Haohui Mai, Colin Patrick McCabe, Andrew Wang,
>> Todd
>> >> >>> Lipcon,
>> >> >>> > > > Eric
>> >> >>> > > > > Baldeschwieler and Vinayakumar B for providing useful
>> feedback
>> >> on
>> >> >>> > > > > HDFS-6581, HDFS-5851 and sub-tasks.
>> >> >>> > > > >
>> >> >>> > > > > The vote runs for the usual 7 days and will expire at 12am
>> PDT
>> >> on
>> >> >>> Sep
>> >> >>> > > 30.
>> >> >>> > > > > Here is my +1 for the merge.
>> >> >>> > > > >
>> >> >>> > > > > Regards,
>> >> >>> > > > > Arpit
>> >> >>> > > > >
>> >> >>> > > > > --
>> >> >>> > > > > CONFIDENTIALITY NOTICE
>> >> >>> > > > > NOTICE: This message is intended for the use of the
>> individual
>> >> or
>> >> >>> > > entity
>> >> >>> > > > to
>> >> >>> > > > > which it is addressed and may contain information that is
>> >> >>> > confidential,
>> >> >>> > > > > privileged and exempt from disclosure under applicable law.
>> If
>> >> the
>> >> >>> > > reader
>> >> >>> > > > > of this message is not the intended recipient, you are
>> hereby
>> >> >>> > notified
>> >> >>> > > > that
>> >> >>> > > > > any printing, copying, dissemination, distribution,
>> disclosure
>> >> or
>> >> >>> > > > > forwarding of this communication is strictly prohibited. If
>> you
>> >> >>> have
>> >> >>> > > > > received this communication in error, please contact the
>> sender
>> >> >>> > > > immediately
>> >> >>> > > > > and delete it from your system. Thank You.
>> >> >>> > > > >
>> >> >>> > > >
>> >> >>> > > >
>> >> >>> > > >
>> >> >>> > > > --
>> >> >>> > > > <http://hortonworks.com/download/>
>> >> >>> > > >
>> >> >>> > > > --
>> >> >>> > > > CONFIDENTIALITY NOTICE
>> >> >>> > > > NOTICE: This message is intended for the use of the
>> individual or
>> >> >>> > entity
>> >> >>> > > to
>> >> >>> > > > which it is addressed and may contain information that is
>> >> >>> confidential,
>> >> >>> > > > privileged and exempt from disclosure under applicable law. If
>> >> the
>> >> >>> > reader
>> >> >>> > > > of this message is not the intended recipient, you are hereby
>> >> >>> notified
>> >> >>> > > that
>> >> >>> > > > any printing, copying, dissemination, distribution,
>> disclosure or
>> >> >>> > > > forwarding of this communication is strictly prohibited. If
>> you
>> >> have
>> >> >>> > > > received this communication in error, please contact the
>> sender
>> >> >>> > > immediately
>> >> >>> > > > and delete it from your system. Thank You.
>> >> >>> > > >
>> >> >>> > >
>> >> >>> >
>> >> >>> > --
>> >> >>> > CONFIDENTIALITY NOTICE
>> >> >>> > NOTICE: This message is intended for the use of the individual or
>> >> >>> entity to
>> >> >>> > which it is addressed and may contain information that is
>> >> confidential,
>> >> >>> > privileged and exempt from disclosure under applicable law. If the
>> >> >>> reader
>> >> >>> > of this message is not the intended recipient, you are hereby
>> >> notified
>> >> >>> that
>> >> >>> > any printing, copying, dissemination, distribution, disclosure or
>> >> >>> > forwarding of this communication is strictly prohibited. If you
>> have
>> >> >>> > received this communication in error, please contact the sender
>> >> >>> immediately
>> >> >>> > and delete it from your system. Thank You.
>> >> >>> >
>> >> >>>
>> >> >>
>> >> >>
>> >> >
>> >> > --
>> >> > CONFIDENTIALITY NOTICE
>> >> > NOTICE: This message is intended for the use of the individual or
>> entity
>> >> to
>> >> > which it is addressed and may contain information that is
>> confidential,
>> >> > privileged and exempt from disclosure under applicable law. If the
>> reader
>> >> > of this message is not the intended recipient, you are hereby notified
>> >> that
>> >> > any printing, copying, dissemination, distribution, disclosure or
>> >> > forwarding of this communication is strictly prohibited. If you have
>> >> > received this communication in error, please contact the sender
>> >> immediately
>> >> > and delete it from your system. Thank You.
>> >>
>> >
>> >
>> >
>> > --
>> > http://hortonworks.com/download/
>> >
>> > --
>> > CONFIDENTIALITY NOTICE
>> > NOTICE: This message is intended for the use of the individual or entity
>> to
>> > which it is addressed and may contain information that is confidential,
>> > privileged and exempt from disclosure under applicable law. If the reader
>> > of this message is not the intended recipient, you are hereby notified
>> that
>> > any printing, copying, dissemination, distribution, disclosure or
>> > forwarding of this communication is strictly prohibited. If you have
>> > received this communication in error, please contact the sender
>> immediately
>> > and delete it from your system. Thank You.
>>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: [VOTE] Merge HDFS-6581 to trunk - Writing to replicas in memory.

Reply via email to