[
https://issues.apache.org/jira/browse/S4-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13198035#comment-13198035
]
Matthieu Morel commented on S4-7:
---------------------------------
I had a look at the patch and unfortunately I cannot apply it in its current
state...
Did all tests pass in your environment, including those from s4-core? (for
instance, tcp.partition.queue_size is required but missing in s4-core's
org.apache.s4.deploy.s4.properties)
Unable to pass regression tests, but willing to integrate the patch, I took a
deeper look and found some issues.
* There seems to be a synchronization issue in the TCPEmitter. On most runs on
my machine, the test never finishes because some messages are not sent. You
should be able to reproduce this quite easily.
* it looks quite tedious (and error-prone) to rely on flag variables and
synchronized blocks or methods in TCPEmitter. There might be a cleaner way.
Including using a separate queue for in transit messages.
* removing from a blocking queue is generally inefficient (see javadoc for
BlockingQueue), but this is done systematically in TCPEmitter
* it is probably better to add timeouts in the tests instead of waiting
indefinitely upon thread joins : if there is a failure somewhere, the test
never ends (actually, this can be set on the test itself with a Junit
annotation)
* in UDPEmitter, sending should not throw a runtime exception but rather log an
error and return false (not your code, but this way a retry will be performed,
otherwise the system just gets broken).
I uploaded some fixes in a new branch named S4-7 in the git repository . But
the most important issue remains: please have a look at the synchronization
issue in the TCPEmitter code. (you should do updates on top of S4-7 branch) .
Thanks!
(We might have to take a different approach in the end. It's not a trivial
issue!)
> Netty to tolerate network glitches and connection loss
> ------------------------------------------------------
>
> Key: S4-7
> URL: https://issues.apache.org/jira/browse/S4-7
> Project: Apache S4
> Issue Type: Bug
> Reporter: Leo Neumeyer
> Assignee: Karthik Kambatla
> Fix For: 0.5
>
> Attachments: S4-7-Robust-TCPEmitter-asynchronous-ordered.patch,
> s4-7.patch, s4-7.patch
>
>
> NettyEmitter connects to different partitions and creates channels over which
> it communicates to other listeners.
> It suffers from the following issues --
> 1. If the underlying topology changes, the channels and the associated
> connections are not updated.
> 2. If a connection gets disconnected, it stays disconnected.
> 3. If for any reason, a connection can't be made, send() drops the message to
> be sent.
> The solution is to -
> 1. Maintain a bounded messageQueue for each destination partition - if a
> connection does not exist, the message should be queued.
> 2. Maintain a map of the channel used for each destination partition - update
> this map on changes to topology, or on send() in case of disconnections.
> 3. Every time a (re-)connection is made, send the queued messages first.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira