On Tue, 18 May 2010, Rolf vandeVaart wrote:

I think we are almost saying the same thing. But to be sure, I will restate. 
The call to opal_pointer_array_add() can return either an index (which I assume 
is a positive
integer, maybe also 0?) or OPAL_ERR_OUT_OF_RESOURCE (which is a -2) if it 
cannot malloc anymore space in the table.  So, I guess I agree that the 
original code was wrong as
it never would have detected the error since OMPI_ERROR != 
OPAL_ERR_OUT_OF_RESOURCE.  (-1 != -2)

Since we are overloading the return value, it seems like the only thing we 
could do is something like this:

if (new_group->grp_f_to_c_index < 0)
   error();


Yes, that looks like the right thing to do.

But that does not follow the SOS logic as the key is that we want to compare to 
OMPI_SUCCESS (I think).  Also, for what it is worth, the setting of the 
grp_f_to_c_index
happens in the group constructor, so we cannot get at the return value of 
opal_pointer_array_add() except by looking at the grp_f_to_c value after the 
object is
constructed.  I am not sure the correct way to handle this.


The only reason we replace the OMPI_ERROR checks with OMPI_SUCCESS is because when SOS logs an error in its internal data structures it returns a new reference to the error (an encoded error-code which SOS can use to locate the error). So, OMPI_ERROR is not OMPI_ERROR anymore but an SOS encoded OMPI_ERROR. We could always wrap the code to be checked with a call to extract its native error code and then perform the check like

  if (0 > OPAL_SOS_GET_ERROR_CODE(new_group->grp_f_to_c_index)) {
     error();
  }

In a lot of places (where functions return a boolean OMPI_SUCCESS or OMPI_ERROR), it was perfectly legit to just switch the way it's done but for the opal_pointer_array_add() and mca_base_param_* functions which return an index or an error, the above transformation seems to be the way to go.

I'll send in a patch with these changes.

Abhishek

Rolf

On 05/18/10 14:02, Jeff Squyres wrote:

Looks like the comparison to OMPI_ERROR worked by accident -- just because it happened to have a value of -1.
The *_f_to_c_index values are the return value from a call to 
opal_point_array_add().  This value will either be non-negative or -1.  -1 
indicates a failure.  It's not an *_
ERR_* code -- it's a -1 index value.  So the comparisons should really have 
been to -1 in the first place.

Rolf / Abhishek -- can you fix?  Did this happen in other *_f_to_c_index member 
variable comparisons elsewhere?



On May 18, 2010, at 1:25 PM, Rolf vandeVaart wrote:



I am getting SEGVs while running the IMB-MPI1 tests.  I believe the
problem has to do with changes made to the group_init.c file last
night.  The error occurs in the call to MPI_Comm_split.

 There were 4 changes in the file that look like this:
OLD:
if (OMPI_ERROR == new_group->grp_f_to_c_index)

NEW:
if (OMPI_SUCCESS != new_group->grp_f_to_c_index)

If I change it back, things work.  I understand the idea of changing the
logic, but maybe that does not apply in this case?    When running with
np=2, the value of new_group->grp_f_to_c_index is 4, thereby not
equaling OMPI_SUCCESS and we end up in an error condition which results
in a null pointer later on.

Am I the only that has run into this?

Rolf


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel








Reply via email to