On Wednesday 04:26 PM 7/15/2009, Tom Jackson wrote:
Your SF bug report says that you put in a 300 millisecond delay.
Where? Even if you think that such a fix is not good, it would be
helpful to at least know what works.
There's a massive amount of debugging I've done on this that's not
included in the bug report, actually, for reasons of brevity. But I did
state that the workaround is to "insert a delay before the data starts
being read by ns_https{post,get}"--or in other words, immediately before
the loops commented with "Read the content" in ns_httpspost/ns_httpsget:
----- 8< ----------------------------------------------------------
#
# Read the content.
#
while 1 {
set buf [_ns_https_read $timeout $rfd $length]
append page $buf
[...]
----- 8< ----------------------------------------------------------
The "after X" statement would go immediately before this while loop.
You also talk about truncation, but then the truncation stops if the
received data goes above 81000.
It might be a good idea to narrow down when the bug appears (what byte
value) and when it goes away again. This might suggest something.
I tried that, and it was suggestive but ultimately not much help in
debugging the problem. For one thing, the byte values vary by platform,
and aren't even consistent on the same platform (i.e., a given byte size
might work or fail depending on the run). It's a timing issue, as I said
in the bug report. However, if you're curious, this is an analysis of the
errors at various byte values taken from our internal bug report for this
issue:
----- 8< ----------------------------------------------------------
The error shows up consistently (99.9+% of the time) at 74000 through
81000 bytes (counting by 1000), so I've been using the range of
70000-83000 for testing. Also, some specific testing showed that the
errors actually kick in reliably at 73729 bytes; note that 73728=8192*9.
And in all the succeeding sizes until the errors stop again, the socket
returns exactly 73728 bytes of data regardless of the request size. This
particular run of consistent errors stops at 81884 bytes (though there are
a few rare successes in that range), which doesn't have any suggestive
powers of 2.
So it seems clear that the buffer size affects the reliability in at least
two ways: 1) larger sizes are more likely to fail, and 2) certain
multiples of 8192 are particularly significant in that they're the last
working size before a long stretch of failing sizes (all of which return
that last working size). In addition to 73728=8192*9, I verified that this
happens at 90112=8192*11 and 106496=8192*13, and that it does NOT happen
at 81920=8192*10 or 57344=8192*7. So it would appear that odd multiples of
8192 where the multiplier is >= 9 are the ones that typically start
lengthy failure sequences.
----- 8< ----------------------------------------------------------
Note that this analysis only applies to RHEL4 (the byte-size analysis for
Mac OS X is similar, but the multipliers and trigger levels are different,
though I didn't record the actual values). And even on RHEL4 these aren't
the only values that fail--other smaller and larger buffer sizes will fail
too, just not as consistently.
- John
--
AOLserver - http://www.aolserver.com/
To Remove yourself from this list, simply send an email to
<[email protected]> with the
body of "SIGNOFF AOLSERVER" in the email message. You can leave the Subject:
field of your email blank.