Re: [Patch 09/12] tabled: drop double prefixing

2010-04-18 Thread Jeff Garzik

On 04/18/2010 12:42 AM, Pete Zaitcev wrote:

On Fedora 14, the following is seen in syslog:

Apr 17 19:58:52 niphredil tabled: tabled: connecting to site
  hitlain.zaitcev.lan:8083: No route to host
Apr 17 19:58:56 niphredil tabled: tabled: DB_ENV->rep_elect:WARNING:
  nvotes (1) is sub-majority with nsites (2)

Drop the extra prefix, it only wastes screen space.

Signed-off-by: Pete Zaitcev

---
  lib/tdb.c |7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)


applied 9-12


--
To unsubscribe from this list: send the line "unsubscribe hail-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch 01/12] CLD: fix crash in retransmissions

2010-04-18 Thread Pete Zaitcev
On Sun, 18 Apr 2010 23:46:07 -0400
Jeff Garzik  wrote:

>   [tabled patch 1/1] update Makefile.am
>   or
>   [cld patch 1/1] libcldc: add nncld API, the new new CLD API

That looks simple enough.

-- Pete
--
To unsubscribe from this list: send the line "unsubscribe hail-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch 05/12] Chunk: Use CLD timers

2010-04-18 Thread Jeff Garzik

On 04/18/2010 12:41 AM, Pete Zaitcev wrote:

Since ncld uses CLD timers and thus we had to have them in libcldc,
we may as well use them in Chunk. This gives us an automatic importation
of bugfixes.

Signed-off-by: Pete Zaitcev

---
  server/chunkd.h |   27 ++
  server/cldu.c   |4 +-
  server/util.c   |   84 +-
  3 files changed, 17 insertions(+), 98 deletions(-)


applied 5-8


--
To unsubscribe from this list: send the line "unsubscribe hail-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [Patch 07/12] Chunk: retry initial CLD session open

2010-04-18 Thread Jeff Garzik

On 04/18/2010 12:41 AM, Pete Zaitcev wrote:

This was an error in the conversion to ncld. In the cldc code, we
kick the state machine and the natural retries do the rest. Any
failures occure there. But in ncld the original kick can fail too.

Five retries give CLD server time to reboot. If it's down, then
clients refuse to start. This may be a bad idea, or may be not.
We may yet change the retries to be infinite, but for now it's
better if builds terminate somehow in case of unexpected problems.

Signed-off-by: Pete Zaitcev

---
  server/cldu.c |   12 ++--
  1 file changed, 10 insertions(+), 2 deletions(-)

commit 44cdb98d2cceb2f4e081db2ee38ec60f1c1a8d8d
Author: Master
Date:   Sat Apr 17 19:50:06 2010 -0600

 Retry the initial connection to the CLD server.


In the short term, this is acceptable.

In the medium term, this is a protocol detail that should be handled 
somewhere in libcldc.  We want all applications to behave the same way, 
including the method by which they attempt to contact a master.


Because there could be multiple CLD servers, you cannot think of retries 
in the context of a single server.  This is crucial WRT work on #replica 
branch, but it is also somewhat relevant to #master, because we might 
have multiple servers listed in SRV records as fallbacks from which to 
choose.


You don't want each application implementing this logic, because we want 
to enforce some level of predictability in master-seeking behavior, and 
in making decisions about when contacts attempts for -all- servers 
should cease, as opposed to contact attempts for a -single- server.  You 
don't want it to take 30 minutes to try all servers in a cluster, 
retrying a number of times on server A, then moving on to server B, etc.


Jeff



--
To unsubscribe from this list: send the line "unsubscribe hail-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch 05/12] Chunk: Use CLD timers

2010-04-18 Thread Jeff Garzik

On 04/18/2010 12:41 AM, Pete Zaitcev wrote:

Since ncld uses CLD timers and thus we had to have them in libcldc,
we may as well use them in Chunk. This gives us an automatic importation
of bugfixes.

Signed-off-by: Pete Zaitcev

---
  server/chunkd.h |   27 ++
  server/cldu.c   |4 +-
  server/util.c   |   84 +-
  3 files changed, 17 insertions(+), 98 deletions(-)


I think the need for a "libhail" is becoming clear...


--
To unsubscribe from this list: send the line "unsubscribe hail-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch 01/12] CLD: fix crash in retransmissions

2010-04-18 Thread Jeff Garzik

On 04/18/2010 12:37 AM, Pete Zaitcev wrote:

For a longest time I was plagued by (very infrequent) crashes like this:

Program received signal SIGSEGV, Segmentation fault.
sess_retry_output (timer=0x92070c0) at session.c:532
532 if (!next_retry || (op->next_retry<  next_retry))
(gdb) info threads
* 1 Thread 0xb72f96c0 (LWP 22417)  sess_retry_output (timer=0x92070c0) at 
session.c:532
(gdb) where
#0  sess_retry_output (timer=0x92070c0) at session.c:532
#1  session_retry (timer=0x92070c0) at session.c:565
#2  0x08049aee in cld_timers_run (tlist=0x8056630) at ../lib/libtimer.c:95
#3  0x0804e9cc in main_loop (argc=5, argv=0xbff70bd4) at server.c:983
#4  main (argc=5, argv=0xbff70bd4) at server.c:1138

The crash happens because op is NULL. As it turned out, this happens
if a packet retransmit and a session expiration occur simultaneously
(in the same pass of timers_run). The scenario is:
  - timers_run collects expired timers at exec list
  - timers_run expires session
  - two timer_del are called, but one of them is on exec list already,
so it's ineffective
  - session is freed, this zeroes ->data in lists (later op)
  - timers_run continues along the exec list, invokes the retransmission
callback, and that crashes with NULL op.

The proposed solution is to rework the timers_run, again. But this
time, we'll make it simpler by observing that timers are ordered by
expiration time. Therefore, we can pull next timer off the list,
expire it, and loop until expiration time is greater than the current
time. No execution list is kept. The integrity of the main list
is assured by never walking it and always referring to the head
anew at each iteration.

This patch appears to fix the problem and stands up to use that
crashed the old code.

Signed-off-by: Pete Zaitcev

---
  include/cld_common.h |   10 ++
  lib/libtimer.c   |   41 -
  2 files changed, 26 insertions(+), 25 deletions(-)


applied 1-4

Note that I change the email subject line prefix (which is normally 
copied by automated tools directly into git) from "CLD" to "libcldc", 
when committing to git.  I am trying to follow (and encourage others to) 
the kernel's method of using a prefix to indicate the subsystem or 
section within the current git repo to which a change applies.


To be fully friendly with automated tools, an ideal Project Hail subject 
line might read


[cld patch 1/1] update Makefile.am
or
[tabled patch 1/1] update Makefile.am
or
[cld patch 1/1] libcldc: add nncld API, the new new CLD API

Not a big deal, just noting what is most friendly to the git automated 
tools, when I'm importing each Hail patch.


Jeff




--
To unsubscribe from this list: send the line "unsubscribe hail-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html