Only speaking for what is happening on OSX.

The original issue (before the patches):
After reading all the data from the for(;;) loop, we would read a  
SYS_CALL buffer, determine that POLLUP was set and throw out the  
entire message when we set d->dead=1 and did a "goto take_a_break;

Thus we where not getting any indication of an error, just gmetad  
would not work correctly on OSX.


With the RC release with the patch:
A) As we go into the if (struct_poll.revents & POLLIN) and do a  
SYS_CALL on 1023 bytes, we get back X bytes_read.
B) Then doing a 'if' on POLLHUP, we find that POLLHUP is set and  
would normally just do a 'break' which would take us out of the for 
(;;) loop and attempt to process the XML data.  However, with this  
patch we get XML parser errors, and thus throwing out the incomplete  
messages.   (Warning occurs when running in debug mode, and still the  
gmetad not working correctly on OSX.)

To test the theory:
However, IF we do another SYS_CALL for another 1023 bytes AFTER the  
check for POLLHUP (and before the 'break;')   there is an additional  
Y bytes read from the system socket buffer.  Thus, most of the time,  
we never receive the entire message before we hit the POLLHUP break,  
and thus loose the entire message.

I have only done code inspection of the OSX kernel (haven't compiled  
the kernel in debug), but it  "Appears" to set POLLHUP,  Not on the  
test when the application is done reading (as this 'if' statement  
represents) or a lost connection as suggested in the standard, but  
some other time way before we are done reading valid data off the  
socket buffer.

Thus, at this point, I would not even attempt to test for POLLHUP on  
OSX at this point.

Did that explain what we are seeing on OSX?

Mike


On Sep 18, 2007, at 7:47 AM, Brad Nicholes wrote:

>
>>>> On 9/17/2007 at 9:23 PM, in message
> <[EMAIL PROTECTED]>, Mike Walker
> <[EMAIL PROTECTED]> wrote:
>> Bernard,
>>      No go.  This doesn't have the patch that I sent to work the OSX
>> issues in gmetad.  It does have the suggestion by Brad,  of putting
>> an if statement in the read loop to test for the POLLUP.  However,
>> from the previous beta (3.0.5  on ~ Sept 10th) testing cycle and my
>> email response back to the list after that beta, his suggestion
>> doesn't work on OSX.
>>
>> The reason is that the KERNAL is done reading off the socket and sets
>> the POLLUP flag BEFORE gmetad finishes reading the entire buffer.
>> Thus, by breaking out of the read loop before the entire buffer is
>> read, we get an incomplete message, and thus the messages are
>> discarded by the XML parser.   The discarded messages  results in
>> incorrect display in the ganglia PHP, by stating that machines are
>> down, gaps in monitoring, etc.
>>
>
>    I am sure that you are correct, so help me understand what is  
> going on here.  From what I could get from Google searches,  
> different platforms indicate an EOF in different ways.  Some set  
> just POLLIN and then indicate EOF by checking bytes_read == 0 after  
> a read().  In this case an revents of POLLHUP only indicates a  
> broken connection.  However other platforms send a POLLIN | POLLHUP  
> with the POLLHUP indicating the EOF.  In this way an extra read()  
> looking for byte_read==0 would be unnecessary.  A final read() can  
> be done and EOF determined all in the same operation.  In the  
> data_thread.c code as it was originally, a POLLIN with  
> bytes_read==0 would have functioned as expected.  But a POLLIN |  
> POLLHUP with bytes_read==<anything> would have resulted in aborting  
> the connection all together without processing any of the data that  
> had already be read.  By adding a check for POLLHUP within the  
> POLLIN handling, aborting the connection is avoided and the data is  
> processed normally.
>    Are you saying that even if POLLIN | POLLHUP is received and all  
> of the data is read from the socket, there is still more data on  
> the socket and a subsequent read must still be done until  
> bytes_read==0?  I guess the Curl guy just decided to treat POLLIN  
> == POLLHUP.  Does that seem safe for all platforms?  If my  
> assumptions are incorrect, which it looks like they are, then it  
> seems to me that going back to your original patch would be the  
> best solution.  Thoughts?
>
> Brad
>
>
> ---------------------------------------------------------------------- 
> ---
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2005.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to