Re: [Patch 09/12] tabled: drop double prefixing
On 04/18/2010 12:42 AM, Pete Zaitcev wrote: On Fedora 14, the following is seen in syslog: Apr 17 19:58:52 niphredil tabled: tabled: connecting to site hitlain.zaitcev.lan:8083: No route to host Apr 17 19:58:56 niphredil tabled: tabled: DB_ENV->rep_elect:WARNING: nvotes (1) is sub-majority with nsites (2) Drop the extra prefix, it only wastes screen space. Signed-off-by: Pete Zaitcev --- lib/tdb.c |7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) applied 9-12 -- To unsubscribe from this list: send the line "unsubscribe hail-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch 01/12] CLD: fix crash in retransmissions
On Sun, 18 Apr 2010 23:46:07 -0400 Jeff Garzik wrote: > [tabled patch 1/1] update Makefile.am > or > [cld patch 1/1] libcldc: add nncld API, the new new CLD API That looks simple enough. -- Pete -- To unsubscribe from this list: send the line "unsubscribe hail-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch 05/12] Chunk: Use CLD timers
On 04/18/2010 12:41 AM, Pete Zaitcev wrote: Since ncld uses CLD timers and thus we had to have them in libcldc, we may as well use them in Chunk. This gives us an automatic importation of bugfixes. Signed-off-by: Pete Zaitcev --- server/chunkd.h | 27 ++ server/cldu.c |4 +- server/util.c | 84 +- 3 files changed, 17 insertions(+), 98 deletions(-) applied 5-8 -- To unsubscribe from this list: send the line "unsubscribe hail-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch 07/12] Chunk: retry initial CLD session open
On 04/18/2010 12:41 AM, Pete Zaitcev wrote: This was an error in the conversion to ncld. In the cldc code, we kick the state machine and the natural retries do the rest. Any failures occure there. But in ncld the original kick can fail too. Five retries give CLD server time to reboot. If it's down, then clients refuse to start. This may be a bad idea, or may be not. We may yet change the retries to be infinite, but for now it's better if builds terminate somehow in case of unexpected problems. Signed-off-by: Pete Zaitcev --- server/cldu.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) commit 44cdb98d2cceb2f4e081db2ee38ec60f1c1a8d8d Author: Master Date: Sat Apr 17 19:50:06 2010 -0600 Retry the initial connection to the CLD server. In the short term, this is acceptable. In the medium term, this is a protocol detail that should be handled somewhere in libcldc. We want all applications to behave the same way, including the method by which they attempt to contact a master. Because there could be multiple CLD servers, you cannot think of retries in the context of a single server. This is crucial WRT work on #replica branch, but it is also somewhat relevant to #master, because we might have multiple servers listed in SRV records as fallbacks from which to choose. You don't want each application implementing this logic, because we want to enforce some level of predictability in master-seeking behavior, and in making decisions about when contacts attempts for -all- servers should cease, as opposed to contact attempts for a -single- server. You don't want it to take 30 minutes to try all servers in a cluster, retrying a number of times on server A, then moving on to server B, etc. Jeff -- To unsubscribe from this list: send the line "unsubscribe hail-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch 05/12] Chunk: Use CLD timers
On 04/18/2010 12:41 AM, Pete Zaitcev wrote: Since ncld uses CLD timers and thus we had to have them in libcldc, we may as well use them in Chunk. This gives us an automatic importation of bugfixes. Signed-off-by: Pete Zaitcev --- server/chunkd.h | 27 ++ server/cldu.c |4 +- server/util.c | 84 +- 3 files changed, 17 insertions(+), 98 deletions(-) I think the need for a "libhail" is becoming clear... -- To unsubscribe from this list: send the line "unsubscribe hail-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch 01/12] CLD: fix crash in retransmissions
On 04/18/2010 12:37 AM, Pete Zaitcev wrote: For a longest time I was plagued by (very infrequent) crashes like this: Program received signal SIGSEGV, Segmentation fault. sess_retry_output (timer=0x92070c0) at session.c:532 532 if (!next_retry || (op->next_retry< next_retry)) (gdb) info threads * 1 Thread 0xb72f96c0 (LWP 22417) sess_retry_output (timer=0x92070c0) at session.c:532 (gdb) where #0 sess_retry_output (timer=0x92070c0) at session.c:532 #1 session_retry (timer=0x92070c0) at session.c:565 #2 0x08049aee in cld_timers_run (tlist=0x8056630) at ../lib/libtimer.c:95 #3 0x0804e9cc in main_loop (argc=5, argv=0xbff70bd4) at server.c:983 #4 main (argc=5, argv=0xbff70bd4) at server.c:1138 The crash happens because op is NULL. As it turned out, this happens if a packet retransmit and a session expiration occur simultaneously (in the same pass of timers_run). The scenario is: - timers_run collects expired timers at exec list - timers_run expires session - two timer_del are called, but one of them is on exec list already, so it's ineffective - session is freed, this zeroes ->data in lists (later op) - timers_run continues along the exec list, invokes the retransmission callback, and that crashes with NULL op. The proposed solution is to rework the timers_run, again. But this time, we'll make it simpler by observing that timers are ordered by expiration time. Therefore, we can pull next timer off the list, expire it, and loop until expiration time is greater than the current time. No execution list is kept. The integrity of the main list is assured by never walking it and always referring to the head anew at each iteration. This patch appears to fix the problem and stands up to use that crashed the old code. Signed-off-by: Pete Zaitcev --- include/cld_common.h | 10 ++ lib/libtimer.c | 41 - 2 files changed, 26 insertions(+), 25 deletions(-) applied 1-4 Note that I change the email subject line prefix (which is normally copied by automated tools directly into git) from "CLD" to "libcldc", when committing to git. I am trying to follow (and encourage others to) the kernel's method of using a prefix to indicate the subsystem or section within the current git repo to which a change applies. To be fully friendly with automated tools, an ideal Project Hail subject line might read [cld patch 1/1] update Makefile.am or [tabled patch 1/1] update Makefile.am or [cld patch 1/1] libcldc: add nncld API, the new new CLD API Not a big deal, just noting what is most friendly to the git automated tools, when I'm importing each Hail patch. Jeff -- To unsubscribe from this list: send the line "unsubscribe hail-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html