Hi Philip, OK, I see there is a subtle issue here of the interplay between IKC and POE. You alluded to this in your previous email when you said "keeping a request is hard". I thought about this a bit and I think I see your point.
It seems like this is a problem which may be naturally solved by having a tiny bit of control over a POE queue. We could solve the problem like this: 1. On an error, push the event back onto the queue stack, so that it would get retried (note we are putting it back in-order, so we can't just post it again, it has to go back to the front of the line so it will get retried. 2. We'd need an event which halted/unhalted a POE queue's progress. This would get set by the success of our reconnect. Really, we just need a way to make a queue pause, until some other thing (in this case reconnection) happens. The event is a "user defined" event, so we'd need a primitive to let us signal to POE to start the queue. Are there things in POE that let me do 1-2 above. Or perhaps I'm missing a far simpler solution that doesn't require these primitives? I think the race you mention would be solved by simply having two places where the queue can be halted. One would be the existing mechanism in the monitor's detection routine. The other is when we try to send, and get the socket dead error. Detecting the dead socket at the point of sending is unavoidable, since this is how TCP/IP detects dead sockets, and is also the lowest latency method. Using the monitor to detect (via ping polling) is useful for detecting problems, but seems more useful when there are very few ikc requests flowing through the system (so the two methods, polling and detect on send are complementary). As you mention, testing is hard, but this is always the case with highly concurrent things. In this case, I think that testing could be done simply by turning off the monitor polling, to see that the "detect on send" works. Zero On Sat, Mar 20, 2010 at 8:56 AM, Philip Gwyn <phi...@awale.qc.ca> wrote: > > On 19-Mar-2010 roger bush wrote: > > Is it similarly possible to reconnect on a failed call (at the cost > > of some latency)? Or are there other side effects from this that > > are unwanted? > > If by "failed call" you mean "IKC request to a disconnected remote kernel" > then > currently, no. But it would be relatively trivial to add an monitor for > that. > > > This seems to be the standard way to do things over TCP/IP, since we > > discover sockets are dead via an attempt to read them. > > With IKC, you can know before hand when something has disconnected. But > there > is a race condition between posting a request and getting the > monitored disconnect event. In fact, the "monitor for failed request" > would > also have a race condition unless I can be very clever about detecting > failure > were the disconnect happens after the Responder has done it's work. > > This might not be clear. Here's a list of the steps a request goes > through before "hitting the wire" : > > 1- Your session > 2- IKC::Responder > 3- IKC::Channel > 4- Wheel::ReadWrite > 5- Driver::SysRW > > If the disconnect happens before step 2, then adding a monitor for it is > trivial. If the disconnect happens during 3, 4 or 5, then it will be > trickier > to detect. Especially writing test cases for it. > > Thinking furthur, if you don't trust a IKC connection to stay open, you > need > some sort of 'acknowledge' from the other side. And that is currently > beyond > the scope of IKC. > > -Philip > >