On Dec 15, 2013, at 12:08 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> On Dec 15, 2013, at 14:36 , Ralph Castain <r...@open-mpi.org> wrote: > >> Sure you can - just find the ompi_proc_t without setting the thread lock >> since (as you point out) it is already being held. > > This is hardly thread-safe for all the cases where this function is not > called while owning the lock (aka all the modules). Fair enough - we can use the other option. > >> Or we could pass the ompi_proc_t into the call, if necessary. Either way >> should work. > > There is already the orte_process_name_t on the call, as it is the only > information needed. > > To be extremely clear, there is no need to alter the current code, it is > correct by all means. The first thing we do after the modex_exchange is going > in the second stage of the ompi_proc setup (ompi_proc_complete_init) where > the first call is for retrieving the proc_hostname from the database. The > code I removed was 1) totally superfluous and 2) completely broken on > multi-threaded scenarios. Not true, George - look more closely at the code. We only retrieve the hostname if the number of procs is low. Otherwise, we do *not* retrieve it until we do a modex_recv, and thus the debug is now broken at scale. This was required for scalable launch, which is something I know is important to you as well. Modifying the API isn't a big deal, so why the fuss? Let's just change it and get the debug working again. > > George. > > >> >> On Dec 15, 2013, at 11:30 AM, George Bosilca <bosi...@icl.utk.edu> wrote: >> >>> The current code is correct. If the goal is to continue to retrieve the >>> proc_name in a call that is asking for something else, I don’t see how you >>> can remove this circular dependency between setting and using the >>> ompi_procs. >>> >>> George. >>> >>> On Dec 15, 2013, at 13:52 , Ralph Castain <r...@open-mpi.org> wrote: >>> >>>> Okay - then let's correct the code rather than lose the debug. I'll fix >>>> that problem. >>>> >>>> Thanks >>>> Ralph >>>> >>>> On Dec 15, 2013, at 9:54 AM, George Bosilca <bosi...@icl.utk.edu> wrote: >>>> >>>>> I understand your reasons but the code as it was in the trunk is not >>>>> correct. In most of the cases when you reach one of the ompi_rte_db_fetch >>>>> calls, you are setting up an ompi_proc … which means you own the >>>>> ompi_proc_lock mutex. As the ompi_rte_db_fetch was calling back into the >>>>> proc infrastructure to find a proc, it was deadlocking on acquiring the >>>>> ompi_proc_lock mutex as locking this mutex it is the first thing >>>>> ompi_proc_find is doing. >>>>> >>>>> A quick grep indicates that most places where the proc_hostname is used >>>>> are capable of handling NULL, so avoiding a deadlock in exchange for few >>>>> hostname replaced by NULL in the output seemed like a acceptable approach >>>>> to me. >>>>> >>>>> George. >>>>> >>>>> On Dec 15, 2013, at 12:18 , Ralph Castain <r...@open-mpi.org> wrote: >>>>> >>>>>> This actually creates a bit of a problem. The reason we did this was >>>>>> because the OMPI-layer "show-help" calls want to report the hostname of >>>>>> the proc. Since we don't retrieve that info by default, the show-help >>>>>> calls all fail due to a NULL pointer. >>>>>> >>>>>> Nathan tried wrapping all the show-help calls with a modex-fetch of >>>>>> hostname, but that isn't a good solution as the fetch might fail >>>>>> depending on the problem we are trying to report. >>>>>> >>>>>> We also noted that the modex recv's current implemented all fetched the >>>>>> complete RTE-level info whenever any info was requested for that proc. >>>>>> So the fetch of the hostname was a very low cost operation. >>>>>> >>>>>> So we decided to always ensure we load the hostname info if it hasn't >>>>>> already been done, thus keeping the show-help operations functional. >>>>>> >>>>>> Make sense? Or do you have an alternative method? >>>>>> Ralph >>>>>> >>>>>> >>>>>> On Dec 15, 2013, at 8:54 AM, svn-commit-mai...@open-mpi.org wrote: >>>>>> >>>>>>> Author: bosilca (George Bosilca) >>>>>>> Date: 2013-12-15 11:54:01 EST (Sun, 15 Dec 2013) >>>>>>> New Revision: 29917 >>>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/29917 >>>>>>> >>>>>>> Log: >>>>>>> Don't be greedy, just do what we asked for. >>>>>>> >>>>>>> Text files modified: >>>>>>> trunk/ompi/mca/rte/orte/rte_orte_module.c | 15 --------------- >>>>>>> >>>>>>> 1 files changed, 0 insertions(+), 15 deletions(-) >>>>>>> >>>>>>> Modified: trunk/ompi/mca/rte/orte/rte_orte_module.c >>>>>>> ============================================================================== >>>>>>> --- trunk/ompi/mca/rte/orte/rte_orte_module.c Sun Dec 15 11:49:27 >>>>>>> 2013 (r29916) >>>>>>> +++ trunk/ompi/mca/rte/orte/rte_orte_module.c 2013-12-15 11:54:01 EST >>>>>>> (Sun, 15 Dec 2013) (r29917) >>>>>>> @@ -153,11 +153,6 @@ >>>>>>> if (OPAL_SUCCESS != (rc = opal_db.fetch((opal_identifier_t*)nm, key, >>>>>>> data, type))) { >>>>>>> return rc; >>>>>>> } >>>>>>> - /* update the hostname */ >>>>>>> - proct = ompi_proc_find(nm); >>>>>>> - if (NULL == proct->proc_hostname) { >>>>>>> - opal_db.fetch_pointer((opal_identifier_t*)nm, >>>>>>> ORTE_DB_HOSTNAME, (void**)&proct->proc_hostname, OPAL_STRING); >>>>>>> - } >>>>>>> return OMPI_SUCCESS; >>>>>>> } >>>>>>> >>>>>>> @@ -171,11 +166,6 @@ >>>>>>> if (OPAL_SUCCESS != (rc = opal_db.fetch_pointer((opal_identifier_t*)nm, >>>>>>> key, data, type))) { >>>>>>> return rc; >>>>>>> } >>>>>>> - /* update the hostname */ >>>>>>> - proct = ompi_proc_find(nm); >>>>>>> - if (NULL == proct->proc_hostname) { >>>>>>> - opal_db.fetch_pointer((opal_identifier_t*)nm, >>>>>>> ORTE_DB_HOSTNAME, (void**)&proct->proc_hostname, OPAL_STRING); >>>>>>> - } >>>>>>> return OMPI_SUCCESS; >>>>>>> } >>>>>>> >>>>>>> @@ -191,11 +181,6 @@ >>>>>>> OPAL_SCOPE_GLOBAL, key, >>>>>>> kvs))) { >>>>>>> return rc; >>>>>>> } >>>>>>> - /* update the hostname */ >>>>>>> - proct = ompi_proc_find(nm); >>>>>>> - if (NULL == proct->proc_hostname) { >>>>>>> - opal_db.fetch_pointer((opal_identifier_t*)nm, >>>>>>> ORTE_DB_HOSTNAME, (void**)&proct->proc_hostname, OPAL_STRING); >>>>>>> - } >>>>>>> return OMPI_SUCCESS; >>>>>>> } >>>>>>> >>>>>>> _______________________________________________ >>>>>>> svn mailing list >>>>>>> s...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel