Garrick, good day. Mon, Jun 23, 2008 at 04:57:39PM -0700, Garrick Staples wrote: > But instead of adding a new client function, why not just change the existing > client lib to simply close the socket on timeout? Any further attempts to use > the socket would return an error (or we could even get really nifty and > auto-reconnect).
I tried to make API calls explicit: if client wants to prematurely close the connection, it should do it with pbs_abort_connection(). Closing socket on timeout makes sense, but it can break some consumers of Torque API: think Maui. I looked over the Maui's code (moab/MPBSI.c) and it turns to open initial connection to the PBS server and then make multiple queries in some places. From the other hand, it will detect errors and will hopefully handle them properly, so the breakage won't happen. Auto-reconnect will be fine if the performance overhead is acceptable: you'll need to check if socket is closed and reopen it every time clients wants to write something, if lazy reconnect is used. Another way is to reconnect immediately after close, thus eliminating the need of the checks, but it puts additional burden on the Torque server -- it will need to handle sligtly more connections in the situations where client's sequence is 1. open connection; 2. issue request; 3. disconnect. Reconnecting immediately after failure on the step 2 will be inefficient -- client will disconnect almost immediately and server will be unnecessarily asked for one more connection. From the other hand, timeouts shouldn't happen often: if they are, then something should be done to prevent them in the first place. Perhaps the most radical change to prevent timeouts to happen is to make Torque's server calls to MOMs non-blocking and event-driven, but it is a lot of work. In principle, Torque already uses poll()/select(), so such functionality is already partly present, but as my case showed, connect() poses some troubles too. I think that some consensus should be achieved prior to any modifications. I will be happy to work around this problem and test it on our resources, but perhaps two weeks later -- a bit busy now and don't want to put the production cluster into the test mode just now. Since two major consumers of Torque API are Torque itself and Maui and I am talking to both communities, could other people try to say something about possible solutions? Thanks a lot! -- Eygene Ryabinkin, Russian Research Centre "Kurchatov Institute" _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
