Marcello, thanks for the feedback. Regarding your questions whether a
particular feature is "worth the effort", I don't know that--this is just a
proposal, and you can assign different priorities to different tasks, or
drop them completely. I made changes to the document (it's now at version
3) based on your and Josh's comments. I tried to make more clear what the
proposed changes are (based on our previous discussions and email threads).
I should note that:

- I don't know how how to set up a test scenario for multicast messages,
and I don't believe that we discussed that with Dan and Way. I didn't know
about the "slow reader" problem.
- In "Sender vs Receiver", you say "In short one cannot assume that rate
limiting a sender solves the problems of slow readers. The system would
certainly be better if both were implemented, but we cannot replace the
existing implementation with it’s complement". If you look at the wording
of original ASACORE-2946, which originated this whole discussion, it says
"I believe that in this case the RN should punish/disconnect the noisy LN2
and others, not the innocent but slowish LN1." So it looks like it is
asking for "rate limiting a sender". We might be able to preserve existing
behavior (disconnecting RNs) while also adding disconnect conditions for RN.

I had to add unresolved questions as "TBD" (to be defined later) in the
document. This is partly due to my knowledge gaps. If you had previous
discussions and documents regarding these issues, you understand them much
better, and can replace the TBD sections with concrete proposals. Also,
please feel free to make changes where you feel I was not reflecting your
comments.

Let me know if you would like me to place this updated document in the Wiki.

                   - Andrey


On Mon, Jul 18, 2016 at 12:36 PM, Lioy, Marcello <[email protected]>
wrote:

> Apologies for the delay on providing feedback.  Comments below:
>
> *Memory Usage*
>
> I thought we had discussed this in the past and decided it wasn’t worth
> the effort.  Has that changed?  From a technical perspective this seems a
> reasonable approach, particularly if it is being used elsewhere in the
> codebase as the code snippets seem to imply.  If implemented  this should
> be used is in the routing node ranking calculations: i.e. the desirability
> as an RN for TC nodes should drop as less memory is available.
>
>
>
> *Refactoring*
>
> Completely agree, I suspect the reason those are separate methods was more
> risk mitigation rather than anything else.  My question on this is: is it
> worth the effort?  I assume the main goal there is simply maintainability
> in that there is a common code path rather than two (largely) duplicate
> code paths?
>
>
>
> *Message Queues*
>
> There was a lot of work done to try and make sure that control messages
> could flow even in the presence of heavy traffic.  I believe that was the
> reason for splitting out the handing of data vs. control plane messaging.
> Todd, Sheshambika and to a lesser extent myself spend a lot of time
> discussing system behavior and how to avoid “congestive collapse” which is
> something that plagues early alliance releases, these changes were made to
> try and address those issues.  An example of congestive collapse was when
> you had a slow reader (i.e. a node that dequeued messages very slowly) this
> would back traffic up, unfortunately for any multicast messages, that meant
> other nodes would also not get their traffic, and even worse this
> particular RN would stop sending and receiving traffic, creating another
> slow reader.  This would spread through the system until no traffic moved.
> Incidentally, this problem is also why things like the UDP keep-alive are
> built into ARDP rather than a higher layer.
>
>
>
> In general there is no specific proposal about what to change, it mainly
> describes the current behavior and implies thee is a problem: e.g.
> suggesting that the PushXXXNode() messages have the removal mechanism
> reworked, but it isn’t clear to me what the proposed change is.
>
>
>
> Before anything is changed here I would like to understand exactly what
> problem is being solved by touching this code, and that we understand what
> the implications would be for a system under load (like the slow reader
> scenario above).  This code is at the center of how the system works, and
> while I cannot dispute that things could be done better there, we need to
> be really careful about what those changes are and how they will impact the
> system.
>
>
>
> *Timeouts*
>
> Again, this is very touchy code, and before changing it we need to
> understand what problem is being solved.  It is important to characterize
> the existing behavior, and what the new behavior would be.
>
>
>
> *Sender vs Receiver*
>
> The problem being solved here is what we saw in practice: we had nodes
> that crashed, or were CPU bound and so were not draining the queues: so the
> corresponding RN would back up traffic, which would then cause it to become
> less responsive and so on.  Yes we could add some mechanisms to flow
> control applications that tend to “spew” traffic onto the network.  This
> however will not solve the problem of excessively slow readers, though it
> might help mitigate the issue.  In short one cannot assume that rate
> limiting a sender solves the problems of slow readers.  We solved the slow
> reader problem, not the aggressive sender problem.  The system would
> certainly be better if both were implemented, but we cannot replace the
> existing implementation with it’s complement.
>
>
>
> *From:* [email protected] [mailto:
> [email protected]] *On Behalf Of *Lioy,
> Marcello
> *Sent:* Thursday, July 14, 2016 4:09 PM
> *To:* Andrey Krokhin <[email protected]>; Daniel Mihai <
> [email protected]>; Way Vadhanasin <
> [email protected]>; Josh Spain <[email protected]>;
> Arvind Padole <[email protected]>
> *Cc:* '[email protected]' <
> [email protected]>
> *Subject:* Re: [Allseen-core] ASACORE-2946 Notes and Scope of Changes
>
>
>
> + working group
>
>
>
> I have uploaded the proposal to the wiki in the Technical Proposals
> <https://wiki.allseenalliance.org/core/overview?&#technical_proposals>
> section.  I will review and provide feedback before the end of the week.
>
>
>
> *From:* Andrey Krokhin [mailto:[email protected]]
> *Sent:* Monday, July 11, 2016 11:22 AM
> *To:* Daniel Mihai <[email protected]>; Way Vadhanasin <
> [email protected]>; Josh Spain <[email protected]>;
> Lioy, Marcello <[email protected]>; Arvind Padole <
> [email protected]>
> *Subject:* ASACORE-2946 Notes and Scope of Changes
>
>
>
> Hello Marcello and Arvind,
>
>
>
> Please take a look at the attached document, containing an outline of
> proposed changes to the RemoteEndpoint class. It originated from
> ASACORE-2946, but I think it needs to be split into several JIRA tickets
> due to scope of changes.
>
>
>
> Please provide your feedback on the document, and let me know if we need
> to have a separate conference call to discuss it.
>
>
>
>                        Thanks,
>
>                        Andrey
>
>
>
>
>
> ---------- Forwarded message ----------
> From: *Daniel Mihai* <[email protected]>
> Date: Fri, Jul 8, 2016 at 8:14 PM
> Subject: RE: ASACORE-2946 Notes and Scope of Changes
> To: Andrey Krokhin <[email protected]>
> Cc: Way Vadhanasin <[email protected]>, Josh Spain <
> [email protected]>
>
> Nice work Andrey! This doc looks like a good starting point for further
> discussions with Marcello and others.
>
>
>
> *I think you should check with Josh about the next step.*
>
>
>
> If this was my doc, I would:
>
> 1.       Send out the doc to more people, including Marcello and Arvind,
> asking for feedback
>
> 2.       Ask the same folks if we should have an “ad-hoc technical
> meeting” to discuss these proposals and gather feedback
>
>
>
> I expect that Marcello will have useful feedback – either in writing or in
> the ad-hoc meeting.
>
>
>
> Thanks.
>
>
>
> *From:* Andrey Krokhin [mailto:[email protected]]
> *Sent:* Thursday, July 7, 2016 5:33 AM
> *To:* Daniel Mihai <[email protected]>
>
>
> *Cc:* Way Vadhanasin <[email protected]>; Josh Spain <
> [email protected]>
> *Subject:* Re: ASACORE-2946 Notes and Scope of Changes
>
>
>
> Updated document attached. Let me know if corrections are needed.
>
>                   - Andrey
>
>
>
> On Wed, Jul 6, 2016 at 6:27 PM, Daniel Mihai <[email protected]>
> wrote:
>
> I think the next step is to send out the latest doc – either just to us
> here, or include Marcello & Arvind too – I have no preference.
>
>
>
> There is no template.
>
>
>
> Thanks!
>
>
>
> *From:* Andrey Krokhin [mailto:[email protected]]
> *Sent:* Wednesday, July 6, 2016 4:24 PM
>
>
> *To:* Daniel Mihai <[email protected]>
> *Cc:* Way Vadhanasin <[email protected]>; Josh Spain <
> [email protected]>
> *Subject:* Re: ASACORE-2946 Notes and Scope of Changes
>
>
>
> I have prepared an updated document that incorporates most of the points
> we discussed, and removed alternatives that we agreed not to implement.
>
> However, it doesn't have Dan's or Way's comments inside the document (edit
> history).
>
> Should I send you the updated document? As I understand, eventually this
> needs to be sent to Marcello, should I rewrite according to a specific
> template?
>
> What are the next steps?
>
>
>
> [image: Image removed by sender. http://i61.tinypic.com/5luc5u.png.]
>
>
>
> *Andrey Krokhin, Software Engineer*
>
> Affinegy
>
> 1705 S. Capital of Texas Hwy, Ste. 310, Austin, TX, 78746
>
> 512.535.1700
>
> [email protected]   http://affinegy.com
>



-- 


*Andrey Krokhin, Software Engineer*

Affinegy

1705 S. Capital of Texas Hwy, Ste. 310, Austin, TX, 78746

512.535.1700

[email protected]   http://affinegy.com

Attachment: ASACORE-2946_Specs_v3.rtf
Description: RTF file

_______________________________________________
Allseen-core mailing list
[email protected]
https://lists.allseenalliance.org/mailman/listinfo/allseen-core

Reply via email to