Re: [Owfs-developers] understanding owserver TCP connection behavior

Paul Davis Thu, 11 Jan 2007 09:33:43 -0800

Trying to get caught up here, sorry for the delayed reply...

You may well be right. My comment is based on direct experience withas few as 2 clients going after the same sensors. Let me describe mylittle network to you. I have 10 devices total spread across 3 hubports. There are also the DS2401 used for wind direction in myoriginal Dallas weather station, but they are not on the net all thetime. There is usually a mixture of cached and uncached access, withuncached being used to retrieve the current counter from theanemometer and the current sensor list to retrieve the wind direction.

In 'normal' operation it seems to run pretty good. However, in a morestrenuous test, I set my script to request all uncached reads. Assoon as I start up a second process to read the sensors, I begin toget errors. That is to say, the calls to owserver timeout(payload_len -1). The last time I tried it a few minutes ago,owserver bit the dust after leaving only the parent process and azombie. The clients were still connected to the owserver sockets, andblocking on receives. owserver had to be restarted. So...

I spent some time looking at this. owserver stayed up a while longerthis time. It spawned 404(!) child processes, and was still going.These child processes were not terminating, but were stuck in a loop.Running strace shows some interesting behavior:


3413  --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
3413  getpid()                          = 3413
3413  rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0
3413  gettimeofday({1168535069, 927295}, NULL) = 0
3413  rt_sigprocmask(SIG_SETMASK, NULL, [RTMIN], 8) = 0

3413 write(5, "\234\277\374\0\0\0\0\0\0\0\0\0\0\0\226\34\234\277\373\0"..., 148) = 148

3413  rt_sigprocmask(SIG_SETMASK, NULL, [RTMIN], 8) = 0
3413  rt_sigsuspend([] <unfinished ...>
3413  --- SIGRTMIN (Real-time signal 0) @ 0 (0) ---
3413  <... rt_sigsuspend resumed> )     = 32
3413  sigreturn()                       = ? (mask now [RTMIN])
.
.
bunch of nanosleeps...
.
.
3413  gettimeofday({1168535071, 478213}, NULL) = 0

3413 writev(156, [{"\0\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 24}], 1) = 24

3413  gettimeofday({1168535071, 480470}, NULL) = 0
3413  nanosleep({0, 100000000},  <unfinished ...>
.
couple of successful writes like above, then the problem:
.
3413  <... gettimeofday resumed> {1168535074, 528564}, NULL) = 0

3413 writev(156, [{"\0\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 24}], 1) = -1 EPIPE (Broken pipe)

3413  --- SIGPIPE (Broken pipe) @ 0 (0) ---
3413  gettimeofday({1168535074, 539566}, NULL) = 0
.
more nanosleeps..
.

3413 writev(156, [{"\0\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 24}], 1) = -1 EPIPE (Broken pipe)

3413  --- SIGPIPE (Broken pipe) @ 0 (0) ---
3413  gettimeofday({1168535097, 59694}, NULL) = 0
3413  nanosleep({0, 100000000},  <unfinished ...>

an so on, ad infinitum. I can't tell what fd 156 is or was and the 4255s don't mean anything to me. lsof shows "can't identify protocol'.I'm guessing it was a client connection which was closed by theclient when owserver still had something it wanted to write, but i'mnot sure. In any case, it looks like there is no SIGPPIPE handler andthe error isn't caught elsewhere, so it doesn't gracefully die.

It was the timeouts that made me make my original comment. But thesehanging child processes and large number of them concern me. I'msurprised my little NSLU2 didn't raise it's little ARM and surrender!Let me know if there's some other info you'd like me to collect. Idon't have a build environment setup for the NSLU2, so if someoneelse doesn't build the packages, I can't readily retest.


Thanks!

Paul


On Jan 10, 2007, at 11:50 PM, Paul Alfille wrote:

On 1/9/07, ziggy <[EMAIL PROTECTED]> wrote:
2. I don't believe this is a significant issue. While there may beno hard limits on the number of concurrent connections now, thepractical limit is 1. The 1-wire is single access and can not beshared. Trying to use multiple connections simultaneously winds upmaking them all slow, with frequent timeouts.
This may be a reflection of your style of use. I can envision highfrequency control processes, and lof frequency logging and display/monitoring processes all attacking the same 1-wire bus. Particularyif some of the processes can use cached results.
Paul
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance toshare your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV_______________________________________________
Owfs-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/owfs-developers

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Owfs-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/owfs-developers

Re: [Owfs-developers] understanding owserver TCP connection behavior

Reply via email to