On Dec 15, 2013, at 12:08 PM, George Bosilca <bosi...@icl.utk.edu> wrote:

> On Dec 15, 2013, at 14:36 , Ralph Castain <r...@open-mpi.org> wrote:
> 
>> Sure you can - just find the ompi_proc_t without setting the thread lock 
>> since (as you point out) it is already being held.
> 
> This is hardly thread-safe for all the cases where this function is not 
> called while owning the lock (aka all the modules).

Fair enough - we can use the other option.
> 
>> Or we could pass the ompi_proc_t into the call, if necessary. Either way 
>> should work.
> 
> There is already the orte_process_name_t on the call, as it is the only 
> information needed.
> 
> To be extremely clear, there is no need to alter the current code, it is 
> correct by all means. The first thing we do after the modex_exchange is going 
> in the second stage of the ompi_proc setup (ompi_proc_complete_init) where 
> the first call is for retrieving the proc_hostname from the database. The 
> code I removed was 1) totally superfluous and 2) completely broken on 
> multi-threaded scenarios.

Not true, George - look more closely at the code. We only retrieve the hostname 
if the number of procs is low. Otherwise, we do *not* retrieve it until we do a 
modex_recv, and thus the debug is now broken at scale. This was required for 
scalable launch, which is something I know is important to you as well.

Modifying the API isn't a big deal, so why the fuss? Let's just change it and 
get the debug working again.


> 
>  George.
> 
> 
>> 
>> On Dec 15, 2013, at 11:30 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>> 
>>> The current code is correct. If the goal is to continue to retrieve the 
>>> proc_name in a call that is asking for something else, I don’t see how you 
>>> can remove this circular dependency between setting and using the 
>>> ompi_procs.
>>> 
>>> George.
>>> 
>>> On Dec 15, 2013, at 13:52 , Ralph Castain <r...@open-mpi.org> wrote:
>>> 
>>>> Okay - then let's correct the code rather than lose the debug. I'll fix 
>>>> that problem.
>>>> 
>>>> Thanks
>>>> Ralph
>>>> 
>>>> On Dec 15, 2013, at 9:54 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>>> 
>>>>> I understand your reasons but the code as it was in the trunk is not 
>>>>> correct. In most of the cases when you reach one of the ompi_rte_db_fetch 
>>>>> calls, you are setting up an ompi_proc … which means you own the 
>>>>> ompi_proc_lock mutex. As the ompi_rte_db_fetch was calling back into the 
>>>>> proc infrastructure to find a proc, it was deadlocking on acquiring the 
>>>>> ompi_proc_lock mutex as locking this mutex it is the first thing 
>>>>> ompi_proc_find is doing.
>>>>> 
>>>>> A quick grep indicates that most places where the proc_hostname is used 
>>>>> are capable of handling NULL, so avoiding a deadlock in exchange for few 
>>>>> hostname replaced by NULL in the output seemed like a acceptable approach 
>>>>> to me.
>>>>> 
>>>>> George.
>>>>> 
>>>>> On Dec 15, 2013, at 12:18 , Ralph Castain <r...@open-mpi.org> wrote:
>>>>> 
>>>>>> This actually creates a bit of a problem. The reason we did this was 
>>>>>> because the OMPI-layer "show-help" calls want to report the hostname of 
>>>>>> the proc. Since we don't retrieve that info by default, the show-help 
>>>>>> calls all fail due to a NULL pointer.
>>>>>> 
>>>>>> Nathan tried wrapping all the show-help calls with a modex-fetch of 
>>>>>> hostname, but that isn't a good solution as the fetch might fail 
>>>>>> depending on the problem we are trying to report.
>>>>>> 
>>>>>> We also noted that the modex recv's current implemented all fetched the 
>>>>>> complete RTE-level info whenever any info was requested for that proc. 
>>>>>> So the fetch of the hostname was a very low cost operation.
>>>>>> 
>>>>>> So we decided to always ensure we load the hostname info if it hasn't 
>>>>>> already been done, thus keeping the show-help operations functional.
>>>>>> 
>>>>>> Make sense? Or do you have an alternative method?
>>>>>> Ralph
>>>>>> 
>>>>>> 
>>>>>> On Dec 15, 2013, at 8:54 AM, svn-commit-mai...@open-mpi.org wrote:
>>>>>> 
>>>>>>> Author: bosilca (George Bosilca)
>>>>>>> Date: 2013-12-15 11:54:01 EST (Sun, 15 Dec 2013)
>>>>>>> New Revision: 29917
>>>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/29917
>>>>>>> 
>>>>>>> Log:
>>>>>>> Don't be greedy, just do what we asked for.
>>>>>>> 
>>>>>>> Text files modified: 
>>>>>>> trunk/ompi/mca/rte/orte/rte_orte_module.c |    15 ---------------       
>>>>>>>                   
>>>>>>> 1 files changed, 0 insertions(+), 15 deletions(-)
>>>>>>> 
>>>>>>> Modified: trunk/ompi/mca/rte/orte/rte_orte_module.c
>>>>>>> ==============================================================================
>>>>>>> --- trunk/ompi/mca/rte/orte/rte_orte_module.c   Sun Dec 15 11:49:27 
>>>>>>> 2013        (r29916)
>>>>>>> +++ trunk/ompi/mca/rte/orte/rte_orte_module.c   2013-12-15 11:54:01 EST 
>>>>>>> (Sun, 15 Dec 2013)      (r29917)
>>>>>>> @@ -153,11 +153,6 @@
>>>>>>> if (OPAL_SUCCESS != (rc = opal_db.fetch((opal_identifier_t*)nm, key, 
>>>>>>> data, type))) {
>>>>>>>   return rc;
>>>>>>> }
>>>>>>> -    /* update the hostname */
>>>>>>> -    proct = ompi_proc_find(nm);
>>>>>>> -    if (NULL == proct->proc_hostname) {
>>>>>>> -        opal_db.fetch_pointer((opal_identifier_t*)nm, 
>>>>>>> ORTE_DB_HOSTNAME, (void**)&proct->proc_hostname, OPAL_STRING);
>>>>>>> -    }
>>>>>>> return OMPI_SUCCESS;
>>>>>>> }
>>>>>>> 
>>>>>>> @@ -171,11 +166,6 @@
>>>>>>> if (OPAL_SUCCESS != (rc = opal_db.fetch_pointer((opal_identifier_t*)nm, 
>>>>>>> key, data, type))) {
>>>>>>>   return rc;
>>>>>>> }
>>>>>>> -    /* update the hostname */
>>>>>>> -    proct = ompi_proc_find(nm);
>>>>>>> -    if (NULL == proct->proc_hostname) {
>>>>>>> -        opal_db.fetch_pointer((opal_identifier_t*)nm, 
>>>>>>> ORTE_DB_HOSTNAME, (void**)&proct->proc_hostname, OPAL_STRING);
>>>>>>> -    }
>>>>>>> return OMPI_SUCCESS;
>>>>>>> }
>>>>>>> 
>>>>>>> @@ -191,11 +181,6 @@
>>>>>>>                                                OPAL_SCOPE_GLOBAL, key, 
>>>>>>> kvs))) {
>>>>>>>   return rc;
>>>>>>> }
>>>>>>> -    /* update the hostname */
>>>>>>> -    proct = ompi_proc_find(nm);
>>>>>>> -    if (NULL == proct->proc_hostname) {
>>>>>>> -        opal_db.fetch_pointer((opal_identifier_t*)nm, 
>>>>>>> ORTE_DB_HOSTNAME, (void**)&proct->proc_hostname, OPAL_STRING);
>>>>>>> -    }
>>>>>>> return OMPI_SUCCESS;
>>>>>>> }
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> svn mailing list
>>>>>>> s...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to