from:"Greg Watson"

Re: [OMPI devel] orte question

2011-07-27 Thread Greg Watson

Ralph,

Looking good so far. I did notice that ompi-ps always seems to have an exit 
code of 243. Is that on purpose?

Greg

On Jul 25, 2011, at 4:44 PM, Ralph Castain wrote:

> r24944 - let me know how it works!
> 
> 
> On Jul 25, 2011, at 1:01 PM, Greg Watson wrote:
> 
>> That would probably be more intuitive.
>> 
>> Thanks,
>> Greg
>> 
>> On Jul 25, 2011, at 2:28 PM, Ralph Castain wrote:
>> 
>>> job 0 is mpirun and its daemons - I can have it ignore that job as I doubt 
>>> users care :-)
>>> 
>>> On Jul 25, 2011, at 12:25 PM, Greg Watson wrote:
>>> 
>>>> Ralph,
>>>> 
>>>> The output format looks good, but I'm not sure it's quite correct. If I 
>>>> run the mpirun command, I see the following:
>>>> 
>>>> mpirun:47520:num nodes:1:num jobs:2
>>>> jobid:0:state:RUNNING:slots:0:num procs:0
>>>> jobid:1:state:RUNNING:slots:1:num procs:4
>>>> process:x:rank:0:pid:47522:node:greg.local:state:SYNC REGISTERED
>>>> process:x:rank:1:pid:47523:node:greg.local:state:SYNC REGISTERED
>>>> process:x:rank:2:pid:47524:node:greg.local:state:SYNC REGISTERED
>>>> process:x:rank:3:pid:47525:node:greg.local:state:SYNC REGISTERED
>>>> 
>>>> Seems to indicate there are two jobs, but one of them has 0 procs. Is that 
>>>> expected? Not a huge problem, since I can just ignore the job with 0 procs.
>>>> 
>>>> Greg
>>>> 
>>>> 
>>>> On Jul 23, 2011, at 6:24 PM, Ralph Castain wrote:
>>>> 
>>>>> Okay, you should have it in r24929. Use:
>>>>> 
>>>>> orte-ps --parseable
>>>>> 
>>>>> to get the new output.
>>>>> 
>>>>> 
>>>>> On Jul 23, 2011, at 11:43 AM, Ralph Castain wrote:
>>>>> 
>>>>>> Gar - have to eat my words a bit. The jobid requested by orte-ps is just 
>>>>>> the "local" jobid - i.e., it is expecting you to provide a number from 
>>>>>> 0-N, as I described below (copied here):
>>>>>> 
>>>>>>> A jobid of 1 indicates the primary application, 2 and above would 
>>>>>>> specify comm_spawned jobs. 
>>>>>> 
>>>>>> Not providing the jobid at all corresponds to wildcard and returns the 
>>>>>> status of all jobs under that mpirun.
>>>>>> 
>>>>>> To specify which mpirun you want info on, you use the --pid option. It 
>>>>>> is this option that isn't working properly - orte-ps returns info from 
>>>>>> all mpiruns and doesn't check to provide only data from the given pid.
>>>>>> 
>>>>>> I'll fix that part, and implement the parsable output.
>>>>>> 
>>>>>> 
>>>>>> On Jul 22, 2011, at 8:55 PM, Ralph Castain wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On Jul 22, 2011, at 3:57 PM, Greg Watson wrote:
>>>>>>> 
>>>>>>>> Hi Ralph,
>>>>>>>> 
>>>>>>>> I'd like three things :-)
>>>>>>>> 
>>>>>>>> a) A --report-jobid option that prints the jobid on the first line in 
>>>>>>>> a form that can be passed to the -jobid option on ompi-ps. Probably 
>>>>>>>> tagging it in the output if -tag-output is enabled (e.g. 
>>>>>>>> jobid:) would be a good idea.
>>>>>>>> 
>>>>>>>> b) The orte-ps command output to use the same jobid format.
>>>>>>> 
>>>>>>> I started looking at the above, and found that orte-ps is just plain 
>>>>>>> wrong in the way it handles jobid. The jobid consists of two fields: a 
>>>>>>> 16-bit number indicating the mpirun, and a 16-bit number indicating the 
>>>>>>> job within that mpirun. Unfortunately, orte-ps sends a data request to 
>>>>>>> every mpirun out there instead of only to the one corresponding to that 
>>>>>>> jobid.
>>>>>>> 
>>>>>>> What we probably should do is have you indicate the mpirun of interest 
>>>>>>> via the -pid option, and then let jobid tell us which job you want 
>>>>>>> within that mpirun. A jobid of 1 indicates the primary applicatio

Re: [OMPI devel] orte question

2011-07-25 Thread Greg Watson

That would probably be more intuitive.

Thanks,
Greg

On Jul 25, 2011, at 2:28 PM, Ralph Castain wrote:

> job 0 is mpirun and its daemons - I can have it ignore that job as I doubt 
> users care :-)
> 
> On Jul 25, 2011, at 12:25 PM, Greg Watson wrote:
> 
>> Ralph,
>> 
>> The output format looks good, but I'm not sure it's quite correct. If I run 
>> the mpirun command, I see the following:
>> 
>> mpirun:47520:num nodes:1:num jobs:2
>> jobid:0:state:RUNNING:slots:0:num procs:0
>> jobid:1:state:RUNNING:slots:1:num procs:4
>> process:x:rank:0:pid:47522:node:greg.local:state:SYNC REGISTERED
>> process:x:rank:1:pid:47523:node:greg.local:state:SYNC REGISTERED
>> process:x:rank:2:pid:47524:node:greg.local:state:SYNC REGISTERED
>> process:x:rank:3:pid:47525:node:greg.local:state:SYNC REGISTERED
>> 
>> Seems to indicate there are two jobs, but one of them has 0 procs. Is that 
>> expected? Not a huge problem, since I can just ignore the job with 0 procs.
>> 
>> Greg
>> 
>> 
>> On Jul 23, 2011, at 6:24 PM, Ralph Castain wrote:
>> 
>>> Okay, you should have it in r24929. Use:
>>> 
>>> orte-ps --parseable
>>> 
>>> to get the new output.
>>> 
>>> 
>>> On Jul 23, 2011, at 11:43 AM, Ralph Castain wrote:
>>> 
>>>> Gar - have to eat my words a bit. The jobid requested by orte-ps is just 
>>>> the "local" jobid - i.e., it is expecting you to provide a number from 
>>>> 0-N, as I described below (copied here):
>>>> 
>>>>> A jobid of 1 indicates the primary application, 2 and above would specify 
>>>>> comm_spawned jobs. 
>>>> 
>>>> Not providing the jobid at all corresponds to wildcard and returns the 
>>>> status of all jobs under that mpirun.
>>>> 
>>>> To specify which mpirun you want info on, you use the --pid option. It is 
>>>> this option that isn't working properly - orte-ps returns info from all 
>>>> mpiruns and doesn't check to provide only data from the given pid.
>>>> 
>>>> I'll fix that part, and implement the parsable output.
>>>> 
>>>> 
>>>> On Jul 22, 2011, at 8:55 PM, Ralph Castain wrote:
>>>> 
>>>>> 
>>>>> On Jul 22, 2011, at 3:57 PM, Greg Watson wrote:
>>>>> 
>>>>>> Hi Ralph,
>>>>>> 
>>>>>> I'd like three things :-)
>>>>>> 
>>>>>> a) A --report-jobid option that prints the jobid on the first line in a 
>>>>>> form that can be passed to the -jobid option on ompi-ps. Probably 
>>>>>> tagging it in the output if -tag-output is enabled (e.g. jobid:) 
>>>>>> would be a good idea.
>>>>>> 
>>>>>> b) The orte-ps command output to use the same jobid format.
>>>>> 
>>>>> I started looking at the above, and found that orte-ps is just plain 
>>>>> wrong in the way it handles jobid. The jobid consists of two fields: a 
>>>>> 16-bit number indicating the mpirun, and a 16-bit number indicating the 
>>>>> job within that mpirun. Unfortunately, orte-ps sends a data request to 
>>>>> every mpirun out there instead of only to the one corresponding to that 
>>>>> jobid.
>>>>> 
>>>>> What we probably should do is have you indicate the mpirun of interest 
>>>>> via the -pid option, and then let jobid tell us which job you want within 
>>>>> that mpirun. A jobid of 1 indicates the primary application, 2 and above 
>>>>> would specify comm_spawned jobs. A jobid of -1 would return the status of 
>>>>> all jobs under that mpirun.
>>>>> 
>>>>> If multiple mpiruns are being reported, then the "jobid" in the report 
>>>>> should again be the "local" jobid within that mpirun.
>>>>> 
>>>>> After all, you don't really care what the orte-internal 16-bit identifier 
>>>>> is for that mpirun.
>>>>> 
>>>>>> 
>>>>>> c) A more easily parsable output format from ompi-ps. It doesn't need to 
>>>>>> be a full blown XML format, just something like the following would 
>>>>>> suffice:
>>>>>> 
>>>>>> jobid:719585280:state:Running:slots:1:num procs:4
>>>>>> process_name:./x:rank:0:pid:3082:

Re: [OMPI devel] orte question

2011-07-25 Thread Greg Watson

Ralph,

The output format looks good, but I'm not sure it's quite correct. If I run the 
mpirun command, I see the following:

mpirun:47520:num nodes:1:num jobs:2
jobid:0:state:RUNNING:slots:0:num procs:0
jobid:1:state:RUNNING:slots:1:num procs:4
process:x:rank:0:pid:47522:node:greg.local:state:SYNC REGISTERED
process:x:rank:1:pid:47523:node:greg.local:state:SYNC REGISTERED
process:x:rank:2:pid:47524:node:greg.local:state:SYNC REGISTERED
process:x:rank:3:pid:47525:node:greg.local:state:SYNC REGISTERED

Seems to indicate there are two jobs, but one of them has 0 procs. Is that 
expected? Not a huge problem, since I can just ignore the job with 0 procs.

Greg


On Jul 23, 2011, at 6:24 PM, Ralph Castain wrote:

> Okay, you should have it in r24929. Use:
> 
> orte-ps --parseable
> 
> to get the new output.
> 
> 
> On Jul 23, 2011, at 11:43 AM, Ralph Castain wrote:
> 
>> Gar - have to eat my words a bit. The jobid requested by orte-ps is just the 
>> "local" jobid - i.e., it is expecting you to provide a number from 0-N, as I 
>> described below (copied here):
>> 
>>> A jobid of 1 indicates the primary application, 2 and above would specify 
>>> comm_spawned jobs. 
>> 
>> Not providing the jobid at all corresponds to wildcard and returns the 
>> status of all jobs under that mpirun.
>> 
>> To specify which mpirun you want info on, you use the --pid option. It is 
>> this option that isn't working properly - orte-ps returns info from all 
>> mpiruns and doesn't check to provide only data from the given pid.
>> 
>> I'll fix that part, and implement the parsable output.
>> 
>> 
>> On Jul 22, 2011, at 8:55 PM, Ralph Castain wrote:
>> 
>>> 
>>> On Jul 22, 2011, at 3:57 PM, Greg Watson wrote:
>>> 
>>>> Hi Ralph,
>>>> 
>>>> I'd like three things :-)
>>>> 
>>>> a) A --report-jobid option that prints the jobid on the first line in a 
>>>> form that can be passed to the -jobid option on ompi-ps. Probably tagging 
>>>> it in the output if -tag-output is enabled (e.g. jobid:) would be a 
>>>> good idea.
>>>> 
>>>> b) The orte-ps command output to use the same jobid format.
>>> 
>>> I started looking at the above, and found that orte-ps is just plain wrong 
>>> in the way it handles jobid. The jobid consists of two fields: a 16-bit 
>>> number indicating the mpirun, and a 16-bit number indicating the job within 
>>> that mpirun. Unfortunately, orte-ps sends a data request to every mpirun 
>>> out there instead of only to the one corresponding to that jobid.
>>> 
>>> What we probably should do is have you indicate the mpirun of interest via 
>>> the -pid option, and then let jobid tell us which job you want within that 
>>> mpirun. A jobid of 1 indicates the primary application, 2 and above would 
>>> specify comm_spawned jobs. A jobid of -1 would return the status of all 
>>> jobs under that mpirun.
>>> 
>>> If multiple mpiruns are being reported, then the "jobid" in the report 
>>> should again be the "local" jobid within that mpirun.
>>> 
>>> After all, you don't really care what the orte-internal 16-bit identifier 
>>> is for that mpirun.
>>> 
>>>> 
>>>> c) A more easily parsable output format from ompi-ps. It doesn't need to 
>>>> be a full blown XML format, just something like the following would 
>>>> suffice:
>>>> 
>>>> jobid:719585280:state:Running:slots:1:num procs:4
>>>> process_name:./x:rank:0:pid:3082:node:node1.com:state:Running
>>>> process_name:./x:rank:1:pid:4567:node:node5.com:state:Running
>>>> process_name:./x:rank:2:pid:2343:node:node4.com:state:Running
>>>> process_name:./x:rank:3:pid:3422:node:node7.com:state:Running
>>>> jobid:345346663:state:running:slots:1:num procs:2
>>>> process_name:./x:rank:0:pid:5563:node:node2.com:state:Running
>>>> process_name:./x:rank:1:pid:6677:node:node3.com:state:Running
>>> 
>>> Shouldn't be too hard to do - bunch of if-then-else statements required, 
>>> though.
>>> 
>>>> 
>>>> I'd be happy to help with any or all of these.
>>> 
>>> Appreciate the offer - let me see how hard this proves to be...
>>> 
>>>> 
>>>> Cheers,
>>>> Greg
>>>> 
>>>> On Jul 22, 2011, at 10:18 AM, Ralph Castain wrote:
>>>> 
>>>>> Hmmm..

Re: [OMPI devel] orte question

2011-07-22 Thread Greg Watson

Hi Ralph,

I'd like three things :-)

a) A --report-jobid option that prints the jobid on the first line in a form 
that can be passed to the -jobid option on ompi-ps. Probably tagging it in the 
output if -tag-output is enabled (e.g. jobid:) would be a good idea.

b) The orte-ps command output to use the same jobid format.

c) A more easily parsable output format from ompi-ps. It doesn't need to be a 
full blown XML format, just something like the following would suffice:

jobid:719585280:state:Running:slots:1:num procs:4
process_name:./x:rank:0:pid:3082:node:node1.com:state:Running
process_name:./x:rank:1:pid:4567:node:node5.com:state:Running
process_name:./x:rank:2:pid:2343:node:node4.com:state:Running
process_name:./x:rank:3:pid:3422:node:node7.com:state:Running
jobid:345346663:state:running:slots:1:num procs:2
process_name:./x:rank:0:pid:5563:node:node2.com:state:Running
process_name:./x:rank:1:pid:6677:node:node3.com:state:Running

I'd be happy to help with any or all of these.

Cheers,
Greg

On Jul 22, 2011, at 10:18 AM, Ralph Castain wrote:

> Hmmm...well, it looks like we could have made this nicer than we did :-/
> 
> If you add --report-uri to the mpirun command line, you'll get back the uri 
> for that mpirun. This has the form of :. As the -h option 
> indicates:
> 
>  -report-uri | --report-uri   
> Printout URI on stdout [-], stderr [+], or a file
> [anything else]
> 
> The "jobid" required by the orte-ps command is the one reported there. We 
> could easily add a --report-jobid option if that makes things easier.
> 
> As to the difference in how orte-ps shows the jobid...well, that's probably 
> historical. orte-ps uses an orte utility function to print the jobid, and 
> that utility always shows the jobid in component form. Again, could add or 
> just use the integer version.
> 
> 
> On Jul 22, 2011, at 7:01 AM, Greg Watson wrote:
> 
>> Hi all,
>> 
>> Does anyone know if it's possible to get the orte jobid from the mpirun 
>> command? If not, how are you supposed to get it to use with orte-ps? Also, 
>> orte-ps reports the jobid in [x,y] notation, but the jobid argument seems to 
>> be an integer. How does that work?
>> 
>> Thanks,
>> Greg
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] orte question

2011-07-22 Thread Greg Watson

Hi all,

Does anyone know if it's possible to get the orte jobid from the mpirun 
command? If not, how are you supposed to get it to use with orte-ps? Also, 
orte-ps reports the jobid in [x,y] notation, but the jobid argument seems to be 
an integer. How does that work?

Thanks,
Greg

Re: [OMPI devel] question about pids

2010-02-26 Thread Greg Watson

Ralph,

Adding a pid attribute to the process element would be great.

Thanks,
Greg

On Feb 25, 2010, at 9:07 PM, Ralph Castain wrote:

> Easy to do. I'll dump all the pids at the same time when the launch completes 
> - effectively, it will be at the same point used by other debuggers to attach.
> 
> Have it for you in the trunk this weekend. Can you suggest an xml format you 
> would like? Otherwise, I'll just use the current proc output (used in the map 
> output) and add a "pid" field to it.
> 
> On Thu, Feb 25, 2010 at 10:43 AM, Greg Watson  wrote:
> Ralph,
> 
> We'd like this to be able to support attaching a debugger to the application. 
> Would it be difficult to provide? We don't need the information all at once, 
> each PID could be sent as the process launches (as long as the XML is 
> correctly formatted) if that makes it any easier.
> 
> Greg
> 
> On Feb 23, 2010, at 3:58 PM, Ralph Castain wrote:
> 
> > I don't see a way to currently do that - the rmaps display comes -before- 
> > process launch, so the pid will not be displayed.
> >
> > Do you need to see them? We'd have to add that output somewhere post-launch 
> > - perhaps when debuggers are initialized.
> >
> > On Feb 23, 2010, at 12:58 PM, Greg Watson wrote:
> >
> >> Ralph,
> >>
> >> I notice that you've got support in the XML output code to display the 
> >> pids of the processes, but I can't see how to enable them. Can you give me 
> >> any pointers?
> >>
> >> Thanks,
> >> Greg
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] question about pids

2010-02-25 Thread Greg Watson

Ralph,

We'd like this to be able to support attaching a debugger to the application. 
Would it be difficult to provide? We don't need the information all at once, 
each PID could be sent as the process launches (as long as the XML is correctly 
formatted) if that makes it any easier.

Greg

On Feb 23, 2010, at 3:58 PM, Ralph Castain wrote:

> I don't see a way to currently do that - the rmaps display comes -before- 
> process launch, so the pid will not be displayed.
> 
> Do you need to see them? We'd have to add that output somewhere post-launch - 
> perhaps when debuggers are initialized.
> 
> On Feb 23, 2010, at 12:58 PM, Greg Watson wrote:
> 
>> Ralph,
>> 
>> I notice that you've got support in the XML output code to display the pids 
>> of the processes, but I can't see how to enable them. Can you give me any 
>> pointers?
>> 
>> Thanks,
>> Greg
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] question about pids

2010-02-23 Thread Greg Watson

Ralph,

I notice that you've got support in the XML output code to display the pids of 
the processes, but I can't see how to enable them. Can you give me any pointers?

Thanks,
Greg

Re: [OMPI devel] configure question

2010-02-17 Thread Greg Watson

The problem seems to be that on SL, gfortran defaults to 32-bit binaries while 
gcc defaults to 64-bit. If I set FFLAGS=-m64 then configure finishes. Of 
course, I have no idea if a Fortran MPI program will actually *work*, but at 
least OMPI builds. That's all that matters isn't it? :-).

Greg

On Feb 17, 2010, at 2:01 AM, Ralf Wildenhues wrote:

> Hello Greg,
> 
> * Greg Watson wrote on Tue, Feb 16, 2010 at 07:03:30PM CET:
>> When I run configure under Snow Leopard (this is OMPI 1.3.4), I get the 
>> following:
>> 
>> checking if C and Fortran 77 are link compatible... no
>> **
>> It appears that your Fortran 77 compiler is unable to link against
>> object files created by your C compiler.  This typically indicates
>> one of a few possibilities:
>> 
>>  - A conflict between CFLAGS and FFLAGS
>>  - A problem with your compiler installation(s)
>>  - Different default build options between compilers (e.g., C
>>building for 32 bit and Fortran building for 64 bit)
>>  - Incompatible compilers
>> 
>> Such problems can usually be solved by picking compatible compilers
>> and/or CFLAGS and FFLAGS.  More information (including exactly what
>> command was given to the compilers and what error resulted when the
>> commands were executed) is available in the config.log file in this
>> directory.
>> **
>> configure: error: C and Fortran 77 compilers are not link compatible.  Can 
>> not continue.
>> 
>> Anyone know of the top of their head what these options would be, or even if 
>> it is possible?
> 
> Well, did you take a look at the corresponding bits in the config.log
> file?  Can you post them?
> 
> Thanks,
> Ralf
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] configure question

2010-02-16 Thread Greg Watson

When I run configure under Snow Leopard (this is OMPI 1.3.4), I get the 
following:

checking if C and Fortran 77 are link compatible... no
**
It appears that your Fortran 77 compiler is unable to link against
object files created by your C compiler.  This typically indicates
one of a few possibilities:

  - A conflict between CFLAGS and FFLAGS
  - A problem with your compiler installation(s)
  - Different default build options between compilers (e.g., C
building for 32 bit and Fortran building for 64 bit)
  - Incompatible compilers

Such problems can usually be solved by picking compatible compilers
and/or CFLAGS and FFLAGS.  More information (including exactly what
command was given to the compilers and what error resulted when the
commands were executed) is available in the config.log file in this
directory.
**
configure: error: C and Fortran 77 compilers are not link compatible.  Can not 
continue.

Anyone know of the top of their head what these options would be, or even if it 
is possible?

Thanks,
Greg

Re: [OMPI devel] Snow leopard builds

2010-01-25 Thread Greg Watson

Rich, 

Have you updated your developer tools to Xcode 3.2.1? If you still have the old 
developer tools you were using before upgrading to SL, this may be the problem.

Greg

On Jan 24, 2010, at 7:33 PM, Paul H. Hargrove wrote:

> I build ompi-1.3.3 on Snow Leopard with no problems.
> I have not tried any other versions.
> 
> -Paul
> 
> Graham, Richard L. wrote:
>> Has someone managed to build ompi on snow leopard ?  I am trying to build, 
>> and it looks like configure does not detect the support for htonl and 
>> friends, so it adds the definition.
>> static inline uint32_t htonl(uint32_t hostvar) { return hostvar; }
>> with the compiler proceeding to do a macro substituion for htonl, which 
>> obviously does not work.  I am hoping someone has run into this AND fixed 
>> the problem and could save me trying to figure out this part of our 
>> configure script, and how to fix it.
>> 
>> Thanks,
>> Rich
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>  
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group Tel: +1-510-495-2352
> HPC Research Department   Fax: +1-510-486-6900
> Lawrence Berkeley National Laboratory 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] XML request

2009-09-10 Thread Greg Watson


Hi Jeff,

I think that sums up the situation nicely. For item #2, I wonder if it  
would be better to still use "ssh  mpirun ...", but have mpirun  
fork itself "under the covers"? Not having an extra executable in your  
distribution would probably make long term maintenance easier.


If Ralph can do anything in the 1.3/1.4 timeframe to sort out the few  
remaining issues, it would be appreciated.


Regards,
Greg

On Sep 10, 2009, at 3:19 PM, Jeff Squyres wrote:

Greg and I chatted on the phone about this.  I now understand much  
better about what he is trying to do (short version: Eclipse is  
running on one machine, it is opening an ssh session to a remote  
machine and launching mpirun on that remote machine).


Results of the phone conversation (for the web archives):

- In the short term, there's a few remaining issues to be figured  
out.  Ralph (who is now full-time at Cisco) may or may not have time  
to fix these in the near team.  We (Open MPI) would happily review  
patches from others in this area if a solution is required before  
Ralph can get to it.


- In the long term, we came up with a "thinking outside the box"  
solution that seems to be *much* better (think 1.5 and beyond).   
I'll describe the scheme, but at the same time, I'll indicate that  
Cisco likely does not have time in the foreseeable future to  
implement it.  Again, we would be happy to provide guidance to  
anyone who would want to implement it (e.g., IBM) and/or review  
patches.


-

1. Currently, the Eclipse plugin is effectively executing "ssh  
 mpirun ...".  This has several advantages:

  - Use whatever the native OMPI is on 
  - No need for binary compatibility (i.e., version match of Eclipse  
plugin and remote OMPI installation)


2. The proposal is to change this to "ssh  mpirun- 
proxy ..." where mpirun-proxy is a new executable that does the  
following:
  - fork/exec the real mpirun, making pipes to mpirun's stdin/stdout/ 
stderr

  - tell mpirun to not display any IOF output from MPI processes
  - tell mpirun to not display any show_help messages
  - register to receive ORTE "events" (more below) via the ORTE comm  
library
  - register to receive IOF from all the MPI processes via the ORTE  
comm library
  - register to receive show_help messages from MPI processes via  
the ORTE comm library
  - upon receipt of specific events (e.g., determination of host/ 
node/process maps), output this data encased in a specific XML  
schema (e.g., a specific set of XML tags to encase each data item in  
the nodemap) to ssh's stdout
  - read output from mpirun's stdout/stderr, output it on ssh's  
stdout, encased in  /  (etc.)
  - read IOF from MPI processes and output them to ssh's stdout,  
encased in appropriate XML tagging
  - read show_help messages from MPI processes and output them to  
ssh's stdout, encased in appropriate XML tagging


--> Note that some of the above functionality already exists; its  
would just need to be marshaled together and used in some new  
logic.  Other parts of the functionality do not exist and would need  
to be written (e.g., redirecting show_help messages to something  
other than the HNP).


3. Once #2 is done, remove all the XML processing from mpirun,  
libopen-rte, libmpi, and all OMPI plugins (since it's now all in  
mpirun-proxy).


-

This functionality would accomplish the following:

- The code is distributed in Open MPI -- not Eclipse or an Eclipse  
plugin -- there's no additional compilation or linking step for the  
Eclipse plugin to talk to OMPI.


- The Eclipse plugin, which already checks the output from  
ompi_info, can know when to use this new functionality (ssh mpirun- 
proxy instead of mpirun).


- All the OMPI XML parsing can be centralized to the mpirun-proxy  
executable.  This is a *huge* improvement over having XML sprinkled  
all over the OMPI code base, as it is now.  Additionally, with this  
method, *all* OMPI output will be encased in XML before it is sent  
to the Eclipse plugin (via ssh's stdout).  Today, we have "XML-lite"  
functionality in that "most" of OMPI's output is XML-ified, but  
there's oodles and oodles of corner cases where output is *not* XML- 
ified.  The above proposal seems to be the best idea so far on how  
to address this issue in a holistic way (rather than adding a bunch  
more band-aids every time we find another output that isn't XML- 
ified).






On Sep 10, 2009, at 9:23 AM, Greg Watson wrote:


The most appealing thing about the XML option is that it just works
"out of the box." Using a library API invariably requires compiling  
an

agent or distributing pre-compiled binaries with all the associated
complications. We tried that in the dim past and it was pretty
unworkable. The other problem was that the API headers were not
installed by de

Re: [OMPI devel] XML request

2009-09-10 Thread Greg Watson

The most appealing thing about the XML option is that it just works  
"out of the box." Using a library API invariably requires compiling an  
agent or distributing pre-compiled binaries with all the associated  
complications. We tried that in the dim past and it was pretty  
unworkable. The other problem was that the API headers were not  
installed by default, so users were forced to install local copies of  
OMPI with development headers enabled. It was not a great end-user  
experience.


Greg

On Sep 10, 2009, at 8:45 AM, Jeff Squyres wrote:


Thinking about this a little more ...

This all seems like Open MPI-specific functionality for Eclipse.  If  
that's the case, don't we have an ORTE tools communication library  
that could be used?  IIRC, it pretty much does exactly what you want  
and would be far less clumsy than trying to jury-rig sending XML  
down files/fd's/whatever.  I have dim recollections of the ORTE  
tools communication library API returning the data that you have  
asked for in data structures -- no parsing of XML at all (and, more  
importantly to us, no need to add all kinds of special code paths  
for wrapping our output in XML).


If I'm right (and that's a big "if"!), is there a reason that this  
library is not attractive to you?





On Sep 10, 2009, at 8:04 AM, Jeff Squyres wrote:


On Sep 9, 2009, at 12:17 PM, Ralph Castain wrote:


HmmmI never considered the possibility of output-filename being
used that way. Interesting idea!



That feels way weird to me -- for example, how do you know that  
you're actually outputting to a tty?


FWIW: +1 on the idea of writing to numbered fd's passed on the  
command line.  It just "feels" like a more POSIX-ish way of doing  
things...?  I guess I'm surprised that that would be difficult to  
do from Java.


--
Jeff Squyres
jsquy...@cisco.com




--
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] XML request

2009-09-10 Thread Greg Watson


Hi Jeff,

The problem is that I'm not running the command from java (which has  
it's own issues), but rather the command is started by the ssh shell/ 
exec service. Unfortunately ssh only provides stdin, stdout, and  
stderr forwarding on fd's 0-2. There is no mechanism to do anything  
else. It would be possible to use a socket to tunnel over the ssh  
connection, but this seems overly complicated. Fortunately I know that  
the shell is connected to /dev/tty, so sending to this device should  
work consistently.


I guess ideally what I need is a -turn-off-all-stdout-and-stderr-but- 
leave-xml-output-alone option :-)


Regards,
Greg

On Sep 10, 2009, at 8:04 AM, Jeff Squyres wrote:


On Sep 9, 2009, at 12:17 PM, Ralph Castain wrote:


HmmmI never considered the possibility of output-filename being
used that way. Interesting idea!



That feels way weird to me -- for example, how do you know that  
you're actually outputting to a tty?


FWIW: +1 on the idea of writing to numbered fd's passed on the  
command line.  It just "feels" like a more POSIX-ish way of doing  
things...?  I guess I'm surprised that that would be difficult to do  
from Java.


--
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] XML request

2009-09-09 Thread Greg Watson


Hi Ralph,

Looks good so far. The way I want to use is this to use /dev/tty as  
the xml-file and send any other stdout or stderr to /dev/null. I could  
use something like 'mpirun -xml-file /dev/tty  >/dev/null 2>&1',  
but the syntax is shell specific which causes a problem the ssh exec  
service. I noticed that mpirun has a -output-filename option, but when  
I try -output-filename /dev/null, I get:


[Jarrah.local:01581] opal_os_dirpath_create: Error: Unable to create  
directory (/dev), unable to set the correct mode [-1]
[Jarrah.local:01581] [[22927,0],0] ORTE_ERROR_LOG: Error in file  
ess_hnp_module.c at line 406


Also, I'm not sure if -output-filename redirects both stdout and  
stderr, or just stdout.


Any suggestions would be appreciated.

Thanks,
Greg


On Sep 2, 2009, at 2:04 PM, Ralph Castain wrote:

Okay Greg - give r21930 a whirl. It takes a new cmd line arg -xml- 
file foo as discussed below.


You can also specify it as an MCA param: -mca orte_xml_file foo, or  
OMPI_MCA_orte_xml_file=foo


Let me know how it works
Ralph

On Aug 31, 2009, at 7:26 PM, Greg Watson wrote:


Hey Ralph,

Unfortunately I don't think this is going to work for us. Most of  
the time we're starting the mpirun command using the ssh exec or  
shell service, neither of which provide any mechanism for reading  
from file descriptors other than 1 or 2. The only alternatives I  
see are:


1. Provide a separate command that starts mpirun at the end of a  
pipe that is connected to the fd passed using the -xml-fd argument.  
This command would need to be part of the OMPI distribution,  
because the whole purpose of the XML was to provide an out-of-the- 
box experience when using PTP with OMPI.


2. Implement an -xml-file option, but I could write the code for you.

3. Go back to limiting XML output to the map only.

None of these are particularly ideal. If you can think of anything  
else, let me know.


Regards,
Greg

On Aug 30, 2009, at 10:36 AM, Ralph Castain wrote:

What if we instead offered a -xml-fd N option? I would rather not  
create a file myself. However, since you are calling mpirun  
yourself, this would allow you to create a pipe on your end, and  
then pass us the write end of the pipe. We would then send all XML  
output down that pipe.


Jeff and I chatted about this and felt this might represent the  
cleanest solution. Sound okay?



On Aug 28, 2009, at 6:33 AM, Greg Watson wrote:


Ralph,

Would this be doable? If we could guarantee that the only output  
that went to the file was XML then that would solve the problem.


Greg

On Aug 28, 2009, at 5:39 AM, Ashley Pittman wrote:


On Thu, 2009-08-27 at 23:46 -0400, Greg Watson wrote:
I didn't realize it would be such a problem. Unfortunately  
there is
simply no way to reliably parse this kind of output, because it  
is

impossible to know what the error messages are going to be, and
presumably they could include XML-like formatting as well. The  
whole
point of the XML was to try and simplify the parsing of the  
mpirun

output, but it now looks like it's actually more difficult.


I thought this might be difficult when I saw you were attempting  
it.


Let me tell you about what Valgrind does because they have similar
problems.  Initially they just had added --xml=yes option which  
put most
of the valgrind (as distinct from application) output in xml  
tags.  This
works for simple cases and if you mix it with --log- 
file= it

keeps the valgrind output separate from the application output.

Unfortunately there are lots of places throughout the code where
developers have inserted print statements (in the valgrind case  
these
all go to the logfile) which means the xml is interspersed with  
non-xml

output and hence impossibly to parse reliably.

What they have now done in the current release is to add a extra
--xml-file= option as well as the --log-file=  
option.  Now
in the simple case all output from a normal run goes well  
formatted to
the xml file and the log file remains empty, any tool that wraps  
around
valgrind can parse the xml which is guaranteed to be well  
formatted and
it can detect the presence of other messages by looking for  
output in
the standard log file.  The onus is then on tool writers to look  
at the
remaining cases and decide if they are common or important  
enough to
wrap in xml and propose a patch or removal of the non-formatted  
message

entirely.

The above seems to work well, having a separate log file for xml  
is a
huge step forward as it means whilst the xml isn't necessarily  
complete
you can both parse it and are able to tell when it's missing  
something.


Of course when looking at this level of tool integration it's  
better to
use sockets that files (e.g. --xml-socket=localhost:1234 rather  
than

--xml-file=/tmp/app_.xml) but I'll leave that up to you.

I hope this gives you something to think over.

Ashley,

--

Ashley Pitt

Re: [OMPI devel] XML request

2009-08-31 Thread Greg Watson


Hey Ralph,

Unfortunately I don't think this is going to work for us. Most of the  
time we're starting the mpirun command using the ssh exec or shell  
service, neither of which provide any mechanism for reading from file  
descriptors other than 1 or 2. The only alternatives I see are:


1. Provide a separate command that starts mpirun at the end of a pipe  
that is connected to the fd passed using the -xml-fd argument. This  
command would need to be part of the OMPI distribution, because the  
whole purpose of the XML was to provide an out-of-the-box experience  
when using PTP with OMPI.


2. Implement an -xml-file option, but I could write the code for you.

3. Go back to limiting XML output to the map only.

None of these are particularly ideal. If you can think of anything  
else, let me know.


Regards,
Greg

On Aug 30, 2009, at 10:36 AM, Ralph Castain wrote:

What if we instead offered a -xml-fd N option? I would rather not  
create a file myself. However, since you are calling mpirun  
yourself, this would allow you to create a pipe on your end, and  
then pass us the write end of the pipe. We would then send all XML  
output down that pipe.


Jeff and I chatted about this and felt this might represent the  
cleanest solution. Sound okay?



On Aug 28, 2009, at 6:33 AM, Greg Watson wrote:


Ralph,

Would this be doable? If we could guarantee that the only output  
that went to the file was XML then that would solve the problem.


Greg

On Aug 28, 2009, at 5:39 AM, Ashley Pittman wrote:


On Thu, 2009-08-27 at 23:46 -0400, Greg Watson wrote:

I didn't realize it would be such a problem. Unfortunately there is
simply no way to reliably parse this kind of output, because it is
impossible to know what the error messages are going to be, and
presumably they could include XML-like formatting as well. The  
whole

point of the XML was to try and simplify the parsing of the mpirun
output, but it now looks like it's actually more difficult.


I thought this might be difficult when I saw you were attempting it.

Let me tell you about what Valgrind does because they have similar
problems.  Initially they just had added --xml=yes option which  
put most
of the valgrind (as distinct from application) output in xml  
tags.  This
works for simple cases and if you mix it with --log- 
file= it

keeps the valgrind output separate from the application output.

Unfortunately there are lots of places throughout the code where
developers have inserted print statements (in the valgrind case  
these
all go to the logfile) which means the xml is interspersed with  
non-xml

output and hence impossibly to parse reliably.

What they have now done in the current release is to add a extra
--xml-file= option as well as the --log-file= option.   
Now
in the simple case all output from a normal run goes well  
formatted to
the xml file and the log file remains empty, any tool that wraps  
around
valgrind can parse the xml which is guaranteed to be well  
formatted and
it can detect the presence of other messages by looking for output  
in
the standard log file.  The onus is then on tool writers to look  
at the

remaining cases and decide if they are common or important enough to
wrap in xml and propose a patch or removal of the non-formatted  
message

entirely.

The above seems to work well, having a separate log file for xml  
is a
huge step forward as it means whilst the xml isn't necessarily  
complete
you can both parse it and are able to tell when it's missing  
something.


Of course when looking at this level of tool integration it's  
better to

use sockets that files (e.g. --xml-socket=localhost:1234 rather than
--xml-file=/tmp/app_.xml) but I'll leave that up to you.

I hope this gives you something to think over.

Ashley,

--

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] XML request

2009-08-28 Thread Greg Watson


Ralph,

Would this be doable? If we could guarantee that the only output that  
went to the file was XML then that would solve the problem.


Greg

On Aug 28, 2009, at 5:39 AM, Ashley Pittman wrote:


On Thu, 2009-08-27 at 23:46 -0400, Greg Watson wrote:

I didn't realize it would be such a problem. Unfortunately there is
simply no way to reliably parse this kind of output, because it is
impossible to know what the error messages are going to be, and
presumably they could include XML-like formatting as well. The whole
point of the XML was to try and simplify the parsing of the mpirun
output, but it now looks like it's actually more difficult.


I thought this might be difficult when I saw you were attempting it.

Let me tell you about what Valgrind does because they have similar
problems.  Initially they just had added --xml=yes option which put  
most
of the valgrind (as distinct from application) output in xml tags.   
This

works for simple cases and if you mix it with --log-file= it
keeps the valgrind output separate from the application output.

Unfortunately there are lots of places throughout the code where
developers have inserted print statements (in the valgrind case these
all go to the logfile) which means the xml is interspersed with non- 
xml

output and hence impossibly to parse reliably.

What they have now done in the current release is to add a extra
--xml-file= option as well as the --log-file= option.  Now
in the simple case all output from a normal run goes well formatted to
the xml file and the log file remains empty, any tool that wraps  
around
valgrind can parse the xml which is guaranteed to be well formatted  
and

it can detect the presence of other messages by looking for output in
the standard log file.  The onus is then on tool writers to look at  
the

remaining cases and decide if they are common or important enough to
wrap in xml and propose a patch or removal of the non-formatted  
message

entirely.

The above seems to work well, having a separate log file for xml is a
huge step forward as it means whilst the xml isn't necessarily  
complete
you can both parse it and are able to tell when it's missing  
something.


Of course when looking at this level of tool integration it's better  
to

use sockets that files (e.g. --xml-socket=localhost:1234 rather than
--xml-file=/tmp/app_.xml) but I'll leave that up to you.

I hope this gives you something to think over.

Ashley,

--

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] XML request

2009-08-27 Thread Greg Watson

Hi Ralph,

I didn't realize it would be such a problem. Unfortunately there is  
simply no way to reliably parse this kind of output, because it is  
impossible to know what the error messages are going to be, and  
presumably they could include XML-like formatting as well. The whole  
point of the XML was to try and simplify the parsing of the mpirun  
output, but it now looks like it's actually more difficult.

I seem to remember that you said that the XML between  and   
is always correctly formatted. I think the only feasible approach for  
XML mode now is:

1. Drop the  and  tags.
2. Keep everything between  and  as is.
3. Drop the , , and  tags and just use free  
format for program output and errors.

4. Go back to using stdout for program output, and stderr for errors.

I will just ignore everything before  and after , and send  
stdout and stderr (minus the text between  and ) to a  
console so the user can see what happened when the job run.

I think this was the situation in an earlier version (1.3.0?)

Thanks for your patience,

Greg

On Aug 27, 2009, at 10:44 PM, Ralph Castain wrote:

Hi Greg

I fixed these so they will get properly formatted. However, it is  
symptomatic of a much broader problem - namely, that developers have  
inserted print statements throughout the code for reporting errors.  
There simply isn't any easy way for me to catch them all.

Jeff and I have talked about ways of approaching that problem.  
However, nothing is entirely perfect. For example, an error detected  
by slurm will generate a message that lies completely outside OMPI's  
scope, and will be asynchronous with anything we try to report.

Thus, you are going to have to always be prepared to deal with  
improperly formatted messages. For example, you could easily get the  
following garbled output:

mpirun was unable to stSLURM-GENERATED-ERROR-MESSAGE
art the job&010;

You get the picture, I'm sure. There is nothing I can do about this,  
so your system is simply going to have to figure out how to handle it.

Only other solution I can propose is going back to building against  
the tool library I created, but that has its own issues too...

Ralph

Date: August 25, 2009 9:23:00 AM MDT
To: Open MPI Developers 
Subject: Re: [OMPI devel] XML request
Reply-To: Open MPI Developers 

Ralph,

Looks like some messages are taking a different path:

$ mpirun -mca orte_show_resolved_nodenames 1 -xml -display-map -np  
3 xxx

< 
stderr 
> 
--

;
mpirun was unable to launch the specified application as it  
could not find an executable:

Executable: xxx

Node: Jarrah.local

while attempting to start process rank 0.

< 
stderr 
> 
--

;

3 total processes failed to start

Cheers,
Greg

On Aug 20, 2009, at 3:24 PM, Ralph Castain wrote:

Okay - try r21858.

Ralph

On Aug 20, 2009, at 12:36 PM, Greg Watson wrote:

Hi Ralph,

Cool!

Regarding the scope of the tags, I never really thought about  
output from the command itself. I propose that any output that  
can't otherwise be classified be sent using the appropriate  
 or  tags with no "rank" attribute.

Cheers,
Greg

On Aug 20, 2009, at 1:52 PM, Ralph Castain wrote:

Hi Greg

I can catch most of these and will do so as they flow through a  
single code path. However, there are places sprinkled throughout  
the code where people directly output warning and error info -  
these will be more problematic and represent a degree of change  
that is probably outside the comfort zone for the 1.3 series.

After talking with Jeff about it, we propose that I make the  
simple change that will  catch messages like those below. For  
the broader problem, we believe that some discussion with you  
about the degree of granularity exposed through the xml output  
might help define the overall solution. For example, can we just  
label all stderr messages with  tags, or do you  
need more detailed tagging (e.g., rank, file, line, etc.)?

That discussion can occur later - for now, I'll catch these.  
Will let you know when it is ready to test!

Ralph

On Aug 20, 2009, at 11:16 AM, Greg Watson wrote:

Ralph,

One more thing. Even with XML enabled, I notice that some error  
messages are still sent to stderr without XML tags (see below.)  
Any chance these could be sent to stdout wrapped in stderr> tags?

Thanks,
Greg

$ mpirun -mca orte_show_resolved_nodenames 1 -xml -display-map - 
np 1 ./pop pop_in

--
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI  
processes.

You may or may not see outpu

Re: [OMPI devel] XML request

2009-08-25 Thread Greg Watson

Ralph,

Looks like some messages are taking a different path:

$ mpirun -mca orte_show_resolved_nodenames 1 -xml -display-map -np 3 xxx

<
stderr
>
--

;
mpirun was unable to launch the specified application as it  
could not find an executable:

Executable: xxx

Node: Jarrah.local

while attempting to start process rank 0.

< 
stderr 
> 
--

;

3 total processes failed to start

Cheers,
Greg

On Aug 20, 2009, at 3:24 PM, Ralph Castain wrote:

Okay - try r21858.

Ralph

On Aug 20, 2009, at 12:36 PM, Greg Watson wrote:

Hi Ralph,

Cool!

Regarding the scope of the tags, I never really thought about  
output from the command itself. I propose that any output that  
can't otherwise be classified be sent using the appropriate  
 or  tags with no "rank" attribute.

Cheers,
Greg

On Aug 20, 2009, at 1:52 PM, Ralph Castain wrote:

Hi Greg

I can catch most of these and will do so as they flow through a  
single code path. However, there are places sprinkled throughout  
the code where people directly output warning and error info -  
these will be more problematic and represent a degree of change  
that is probably outside the comfort zone for the 1.3 series.

After talking with Jeff about it, we propose that I make the  
simple change that will  catch messages like those below. For the  
broader problem, we believe that some discussion with you about  
the degree of granularity exposed through the xml output might  
help define the overall solution. For example, can we just label  
all stderr messages with  tags, or do you need  
more detailed tagging (e.g., rank, file, line, etc.)?

That discussion can occur later - for now, I'll catch these. Will  
let you know when it is ready to test!

Ralph

On Aug 20, 2009, at 11:16 AM, Greg Watson wrote:

Ralph,

One more thing. Even with XML enabled, I notice that some error  
messages are still sent to stderr without XML tags (see below.)  
Any chance these could be sent to stdout wrapped in stderr> tags?

Thanks,
Greg

$ mpirun -mca orte_show_resolved_nodenames 1 -xml -display-map - 
np 1 ./pop pop_in

--
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--
rank 
= 
"0 
"> 

;

 Parallel Ocean Program (POP) 

  Version 2.0.1 Released 21 Jan 2004
stdout>

rank 
= 
"0 
"> 

;
rank 
= 
"0 
"> 

;

POP aborting...

 Input nprocs not same as system request
stdout>

rank 
= 
"0 
"> 

;

--
mpirun has exited due to process rank 0 with PID 15201 on
node 4pcnuggets exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
------

On Aug 19, 2009, at 10:48 AM, Greg Watson wrote:

Ralph,

Looks like it's working now.

Thanks,
Greg

On Aug 18, 2009, at 5:21 PM, Ralph Castain wrote:

Give r21836 a try and see if it still gets out of order.

Ralph

On Aug 18, 2009, at 2:18 PM, Greg Watson wrote:

Ralph,

Not sure that's it because all XML output should be via stdout.

Greg

On Aug 18, 2009, at 3:53 PM, Ralph Castain wrote:

Hmmmlet me try adding a fflush after the  output  
to force it out. Best guess is that you are seeing a little  
race condition - the map output is coming over stderr, while  
the  tag is coming over stdout.

On Tue, Aug 18, 2009 at 12:53 PM, Greg Watson > wrote:

Hi Ralph,

I'm seeing something strange. When I run "mpirun -mca  
orte_show_resolved_nodenames 1 -xml -display-map...", I see:

...

but when I run " ssh localhost mpirun -mca  
orte_show_resolved_nodenames 1 -xml -display-map...", I see:

...

Any ideas?

Thanks,
Greg

On Aug 17, 2009, at 11:16 PM, Ralph Castain wrote:

Should be done on trunk

Re: [OMPI devel] XML request

2009-08-20 Thread Greg Watson


Hi Ralph,

Cool!

Regarding the scope of the tags, I never really thought about output  
from the command itself. I propose that any output that can't  
otherwise be classified be sent using the appropriate  or  
 tags with no "rank" attribute.


Cheers,
Greg

On Aug 20, 2009, at 1:52 PM, Ralph Castain wrote:


Hi Greg

I can catch most of these and will do so as they flow through a  
single code path. However, there are places sprinkled throughout the  
code where people directly output warning and error info - these  
will be more problematic and represent a degree of change that is  
probably outside the comfort zone for the 1.3 series.


After talking with Jeff about it, we propose that I make the simple  
change that will  catch messages like those below. For the broader  
problem, we believe that some discussion with you about the degree  
of granularity exposed through the xml output might help define the  
overall solution. For example, can we just label all stderr messages  
with  tags, or do you need more detailed tagging  
(e.g., rank, file, line, etc.)?


That discussion can occur later - for now, I'll catch these. Will  
let you know when it is ready to test!


Ralph

On Aug 20, 2009, at 11:16 AM, Greg Watson wrote:


Ralph,

One more thing. Even with XML enabled, I notice that some error  
messages are still sent to stderr without XML tags (see below.) Any  
chance these could be sent to stdout wrapped in   
tags?


Thanks,
Greg

$ mpirun -mca orte_show_resolved_nodenames 1 -xml -display-map -np  
1 ./pop pop_in







--
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--
rank 
= 
"0 
"> 

 
;

 

 Parallel Ocean Program (POP) 

  Version 2.0.1 Released 21 Jan 2004

 

rank 
= 
"0 
"> 

 
;
rank 
= 
"0 
"> 

 
;

 

POP aborting...

 Input nprocs not same as system request
stdout>

 

rank 
= 
"0 
"> 

 
;

--
mpirun has exited due to process rank 0 with PID 15201 on
node 4pcnuggets exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
------


On Aug 19, 2009, at 10:48 AM, Greg Watson wrote:


Ralph,

Looks like it's working now.

Thanks,
Greg

On Aug 18, 2009, at 5:21 PM, Ralph Castain wrote:


Give r21836 a try and see if it still gets out of order.

Ralph


On Aug 18, 2009, at 2:18 PM, Greg Watson wrote:


Ralph,

Not sure that's it because all XML output should be via stdout.

Greg

On Aug 18, 2009, at 3:53 PM, Ralph Castain wrote:

Hmmmlet me try adding a fflush after the  output to  
force it out. Best guess is that you are seeing a little race  
condition - the map output is coming over stderr, while the  
 tag is coming over stdout.




On Tue, Aug 18, 2009 at 12:53 PM, Greg Watson > wrote:

Hi Ralph,

I'm seeing something strange. When I run "mpirun -mca  
orte_show_resolved_nodenames 1 -xml -display-map...", I see:




   
   
   
   
   
   

...


but when I run " ssh localhost mpirun -mca  
orte_show_resolved_nodenames 1 -xml -display-map...", I see:



   
   
   
   
   
   


...


Any ideas?

Thanks,
Greg


On Aug 17, 2009, at 11:16 PM, Ralph Castain wrote:

Should be done on trunk with r21826 - would you please give it  
a try and let me know if that meets requirements? If so, I'll  
move it to 1.3.4.


Thanks
Ralph

On Aug 17, 2009, at 6:42 AM, Greg Watson wrote:

Hi Ralph,

Yes, you'd just need issue the start tag prior to any other XML  
output, then the end tag when it's guaranteed all XML other  
output has been sent.


Greg

On Aug 17, 2009, at 7:44 AM, Ralph Castain wrote:

All things are possible - some just a tad more painful than  
others.


It looks like you want the mpirun tags to flow around all  
output during the run - i.e., there is only one pair of mpirun  
tags that surround anything that might come out of the job. True?


If so, that would be trivial.

On Aug 14, 2009, at 9:25 AM, Greg Watson wrote:

Ralph,

Woul

Re: [OMPI devel] XML request

2009-08-20 Thread Greg Watson


Ralph,

One more thing. Even with XML enabled, I notice that some error  
messages are still sent to stderr without XML tags (see below.) Any  
chance these could be sent to stdout wrapped in  tags?


Thanks,
Greg

$ mpirun -mca orte_show_resolved_nodenames 1 -xml -display-map -np 1 ./ 
pop pop_in







--
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--
rank 
= 
"0 
"> 

 
;

 

 Parallel Ocean Program (POP) 

  Version 2.0.1 Released 21 Jan 2004

 

rank 
= 
"0 
"> 

 
;
rank 
= 
"0 
"> 

 
;

 

POP aborting...

 Input nprocs not same as system request

 

rank 
= 
"0 
"> 

 
;

--
mpirun has exited due to process rank 0 with PID 15201 on
node 4pcnuggets exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
------


On Aug 19, 2009, at 10:48 AM, Greg Watson wrote:


Ralph,

Looks like it's working now.

Thanks,
Greg

On Aug 18, 2009, at 5:21 PM, Ralph Castain wrote:


Give r21836 a try and see if it still gets out of order.

Ralph


On Aug 18, 2009, at 2:18 PM, Greg Watson wrote:


Ralph,

Not sure that's it because all XML output should be via stdout.

Greg

On Aug 18, 2009, at 3:53 PM, Ralph Castain wrote:

Hmmmlet me try adding a fflush after the  output to  
force it out. Best guess is that you are seeing a little race  
condition - the map output is coming over stderr, while the  
 tag is coming over stdout.




On Tue, Aug 18, 2009 at 12:53 PM, Greg Watson > wrote:

Hi Ralph,

I'm seeing something strange. When I run "mpirun -mca  
orte_show_resolved_nodenames 1 -xml -display-map...", I see:




   
   
   
   
   
   

...


but when I run " ssh localhost mpirun -mca  
orte_show_resolved_nodenames 1 -xml -display-map...", I see:



   
   
   
   
   
   


...


Any ideas?

Thanks,
Greg


On Aug 17, 2009, at 11:16 PM, Ralph Castain wrote:

Should be done on trunk with r21826 - would you please give it a  
try and let me know if that meets requirements? If so, I'll move  
it to 1.3.4.


Thanks
Ralph

On Aug 17, 2009, at 6:42 AM, Greg Watson wrote:

Hi Ralph,

Yes, you'd just need issue the start tag prior to any other XML  
output, then the end tag when it's guaranteed all XML other  
output has been sent.


Greg

On Aug 17, 2009, at 7:44 AM, Ralph Castain wrote:

All things are possible - some just a tad more painful than others.

It looks like you want the mpirun tags to flow around all output  
during the run - i.e., there is only one pair of mpirun tags that  
surround anything that might come out of the job. True?


If so, that would be trivial.

On Aug 14, 2009, at 9:25 AM, Greg Watson wrote:

Ralph,

Would it be possible to get mpirun to issue start and end tags if  
the -xml option is used? Currently there is no way to determine  
when the output starts and finishes, which makes parsing the XML  
tricky, particularly if something else generates output (e.g. the  
shell). Something like this would be ideal:




...

...
...


If we could get it in 1.3.4 even better. :-)

Thanks,
Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_

Re: [OMPI devel] XML request

2009-08-19 Thread Greg Watson


Ralph,

Looks like it's working now.

Thanks,
Greg

On Aug 18, 2009, at 5:21 PM, Ralph Castain wrote:


Give r21836 a try and see if it still gets out of order.

Ralph


On Aug 18, 2009, at 2:18 PM, Greg Watson wrote:


Ralph,

Not sure that's it because all XML output should be via stdout.

Greg

On Aug 18, 2009, at 3:53 PM, Ralph Castain wrote:

Hmmmlet me try adding a fflush after the  output to  
force it out. Best guess is that you are seeing a little race  
condition - the map output is coming over stderr, while the  
 tag is coming over stdout.




On Tue, Aug 18, 2009 at 12:53 PM, Greg Watson  
 wrote:

Hi Ralph,

I'm seeing something strange. When I run "mpirun -mca  
orte_show_resolved_nodenames 1 -xml -display-map...", I see:




   
   
   
   
   
   

...


but when I run " ssh localhost mpirun -mca  
orte_show_resolved_nodenames 1 -xml -display-map...", I see:



   
   
   
   
   
   


...


Any ideas?

Thanks,
Greg


On Aug 17, 2009, at 11:16 PM, Ralph Castain wrote:

Should be done on trunk with r21826 - would you please give it a  
try and let me know if that meets requirements? If so, I'll move  
it to 1.3.4.


Thanks
Ralph

On Aug 17, 2009, at 6:42 AM, Greg Watson wrote:

Hi Ralph,

Yes, you'd just need issue the start tag prior to any other XML  
output, then the end tag when it's guaranteed all XML other output  
has been sent.


Greg

On Aug 17, 2009, at 7:44 AM, Ralph Castain wrote:

All things are possible - some just a tad more painful than others.

It looks like you want the mpirun tags to flow around all output  
during the run - i.e., there is only one pair of mpirun tags that  
surround anything that might come out of the job. True?


If so, that would be trivial.

On Aug 14, 2009, at 9:25 AM, Greg Watson wrote:

Ralph,

Would it be possible to get mpirun to issue start and end tags if  
the -xml option is used? Currently there is no way to determine  
when the output starts and finishes, which makes parsing the XML  
tricky, particularly if something else generates output (e.g. the  
shell). Something like this would be ideal:




...

...
...


If we could get it in 1.3.4 even better. :-)

Thanks,
Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] XML request

2009-08-18 Thread Greg Watson


Ralph,

Not sure that's it because all XML output should be via stdout.

Greg

On Aug 18, 2009, at 3:53 PM, Ralph Castain wrote:

Hmmmlet me try adding a fflush after the  output to  
force it out. Best guess is that you are seeing a little race  
condition - the map output is coming over stderr, while the   
tag is coming over stdout.




On Tue, Aug 18, 2009 at 12:53 PM, Greg Watson  
 wrote:

Hi Ralph,

I'm seeing something strange. When I run "mpirun -mca  
orte_show_resolved_nodenames 1 -xml -display-map...", I see:




   
   
   
   
   
   

...


but when I run " ssh localhost mpirun -mca  
orte_show_resolved_nodenames 1 -xml -display-map...", I see:



   
   
   
   
   
   


...


Any ideas?

Thanks,
Greg


On Aug 17, 2009, at 11:16 PM, Ralph Castain wrote:

Should be done on trunk with r21826 - would you please give it a try  
and let me know if that meets requirements? If so, I'll move it to  
1.3.4.


Thanks
Ralph

On Aug 17, 2009, at 6:42 AM, Greg Watson wrote:

Hi Ralph,

Yes, you'd just need issue the start tag prior to any other XML  
output, then the end tag when it's guaranteed all XML other output  
has been sent.


Greg

On Aug 17, 2009, at 7:44 AM, Ralph Castain wrote:

All things are possible - some just a tad more painful than others.

It looks like you want the mpirun tags to flow around all output  
during the run - i.e., there is only one pair of mpirun tags that  
surround anything that might come out of the job. True?


If so, that would be trivial.

On Aug 14, 2009, at 9:25 AM, Greg Watson wrote:

Ralph,

Would it be possible to get mpirun to issue start and end tags if  
the -xml option is used? Currently there is no way to determine when  
the output starts and finishes, which makes parsing the XML tricky,  
particularly if something else generates output (e.g. the shell).  
Something like this would be ideal:




...

...
...


If we could get it in 1.3.4 even better. :-)

Thanks,
Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] XML request

2009-08-18 Thread Greg Watson


Hi Ralph,

I'm seeing something strange. When I run "mpirun -mca  
orte_show_resolved_nodenames 1 -xml -display-map...", I see:











...


but when I run " ssh localhost mpirun -mca  
orte_show_resolved_nodenames 1 -xml -display-map...", I see:











...


Any ideas?

Thanks,
Greg

On Aug 17, 2009, at 11:16 PM, Ralph Castain wrote:

Should be done on trunk with r21826 - would you please give it a try  
and let me know if that meets requirements? If so, I'll move it to  
1.3.4.


Thanks
Ralph

On Aug 17, 2009, at 6:42 AM, Greg Watson wrote:


Hi Ralph,

Yes, you'd just need issue the start tag prior to any other XML  
output, then the end tag when it's guaranteed all XML other output  
has been sent.


Greg

On Aug 17, 2009, at 7:44 AM, Ralph Castain wrote:


All things are possible - some just a tad more painful than others.

It looks like you want the mpirun tags to flow around all output  
during the run - i.e., there is only one pair of mpirun tags that  
surround anything that might come out of the job. True?


If so, that would be trivial.

On Aug 14, 2009, at 9:25 AM, Greg Watson wrote:


Ralph,

Would it be possible to get mpirun to issue start and end tags if  
the -xml option is used? Currently there is no way to determine  
when the output starts and finishes, which makes parsing the XML  
tricky, particularly if something else generates output (e.g. the  
shell). Something like this would be ideal:




...

...
...


If we could get it in 1.3.4 even better. :-)

Thanks,
Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] XML request

2009-08-17 Thread Greg Watson


Hi Ralph,

Yes, you'd just need issue the start tag prior to any other XML  
output, then the end tag when it's guaranteed all XML other output has  
been sent.


Greg

On Aug 17, 2009, at 7:44 AM, Ralph Castain wrote:


All things are possible - some just a tad more painful than others.

It looks like you want the mpirun tags to flow around all output  
during the run - i.e., there is only one pair of mpirun tags that  
surround anything that might come out of the job. True?


If so, that would be trivial.

On Aug 14, 2009, at 9:25 AM, Greg Watson wrote:


Ralph,

Would it be possible to get mpirun to issue start and end tags if  
the -xml option is used? Currently there is no way to determine  
when the output starts and finishes, which makes parsing the XML  
tricky, particularly if something else generates output (e.g. the  
shell). Something like this would be ideal:




...

...
...


If we could get it in 1.3.4 even better. :-)

Thanks,
Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] XML request

2009-08-14 Thread Greg Watson


Ralph,

Would it be possible to get mpirun to issue start and end tags if the - 
xml option is used? Currently there is no way to determine when the  
output starts and finishes, which makes parsing the XML tricky,  
particularly if something else generates output (e.g. the shell).  
Something like this would be ideal:




...

...
...


If we could get it in 1.3.4 even better. :-)

Thanks,
Greg

Re: [OMPI devel] XML output

2009-07-20 Thread Greg Watson


Ralph,

This is working well in trunk. If it can go into 1.3.4 that would be  
great.


Thanks,

Greg

On Jul 16, 2009, at 10:05 PM, Ralph Castain wrote:

Okay, this is fixed in r21706 on the trunk. I will request it be  
pushed into 1.3.4.


Thanks for catching the problem.
Ralph

On Jul 16, 2009, at 12:16 PM, Greg Watson wrote:


Ralph,

One of our users is seeing the following output with the XML option  
enabled (1.3.3):


time_mix_freq = 17
Time mixing option:
  avgfit -- time averaging
  with timestep chosen to fit exactly into one day  
or coupling interval
Averaging time steps are at step numbers2,17 each  
day

 

It appears that the XML tags for the same task are being  
interleaved. Any idea if this is fixable?


Thanks,

Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] XML output

2009-07-16 Thread Greg Watson


Ralph,

One of our users is seeing the following output with the XML option  
enabled (1.3.3):


time_mix_freq = 17
Time mixing option:
  avgfit -- time averaging
  with timestep chosen to fit exactly into one day or  
coupling interval
Averaging time steps are at step numbers2,17 each  
day

 

It appears that the XML tags for the same task are being interleaved.  
Any idea if this is fixable?


Thanks,

Greg

Re: [OMPI devel] XML stdout/stderr

2009-05-27 Thread Greg Watson


Close, but no banana!

Can you add a semicolon to the end of each? So "<" should be replaced  
by "<", etc.


Thanks,

Greg

On May 26, 2009, at 8:45 PM, Ralph Castain wrote:

Guess I had just never seen that format before - thanks for  
clarifying!


I committed the revisions to the trunk in r21285 - see what you  
think...

Ralph


On Tue, May 26, 2009 at 1:55 PM, Greg Watson   
wrote:

Ralph,

Both my proposals are correct XML and should be parsable by any  
conforming XML parser. Just changing the tags will not work because  
any text that contains "&", "<", or ">" will still confuse an XML  
parser.


Regards,

Greg

On May 26, 2009, at 8:25 AM, Ralph Castain wrote:


Yo Greg

I'm slow, but it did hit me that there may be a simpler solution  
after all. I gather that the problem is that the user's output  
could have tags in it that match our own, thus causing tag- 
confusion. True?


My concern is that our proposed solution generates pidgin-xml which  
could only ever be translated by a specially written parser. Kinda  
makes xml a little moot in ways.


What if we simply change the name of our tags to something ompi- 
specific? I could tag things with , for example. This  
would follow the natural naming convention for internal variables,  
and would avoid any conflicts unless the user were truly stupid -  
in which case, the onus would be on them.


Would that resolve the problem?
Ralph


On Tue, May 26, 2009 at 5:42 AM, Ralph Castain   
wrote:



On Mon, May 25, 2009 at 9:10 AM, Greg Watson  
 wrote:

Ralph,

In life, nothing is ever easy...

:-)



While the XML output is working well, I've come across an issue  
with stdout/stderr. Unfortunately it's not just enough to wrap it  
in tags, because it's possible that the output will contain XML  
formatting information. There are two ways to get around this. The  
easiest is to wrap the output in "". This has  
the benefit of being relatively easy, but will fail if the output  
contains the string "]]>". The other way is to replace all  
instances of "&", "<", and ">" with "&", "<", and ">"  
respectively. This is safer, but requires more processing.


Thoughts?

"Ick" immediately comes to mind, but is hardly helpful. :-)

I am already doing some processing to deal with linefeeds in the  
middle of output streams, so adding these three special chars isn't  
-that- big a deal. I can have a test version for you in the next  
day or so (svn trunk) - I am on reduced hours while moving my son  
(driving across country).


Let's give that a try and see if it resolves the problem...




Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] XML stdout/stderr

2009-05-26 Thread Greg Watson


Ralph,

Both my proposals are correct XML and should be parsable by any  
conforming XML parser. Just changing the tags will not work because  
any text that contains "&", "<", or ">" will still confuse an XML  
parser.


Regards,

Greg

On May 26, 2009, at 8:25 AM, Ralph Castain wrote:


Yo Greg

I'm slow, but it did hit me that there may be a simpler solution  
after all. I gather that the problem is that the user's output could  
have tags in it that match our own, thus causing tag-confusion. True?


My concern is that our proposed solution generates pidgin-xml which  
could only ever be translated by a specially written parser. Kinda  
makes xml a little moot in ways.


What if we simply change the name of our tags to something ompi- 
specific? I could tag things with , for example. This  
would follow the natural naming convention for internal variables,  
and would avoid any conflicts unless the user were truly stupid - in  
which case, the onus would be on them.


Would that resolve the problem?
Ralph


On Tue, May 26, 2009 at 5:42 AM, Ralph Castain   
wrote:



On Mon, May 25, 2009 at 9:10 AM, Greg Watson   
wrote:

Ralph,

In life, nothing is ever easy...

:-)



While the XML output is working well, I've come across an issue with  
stdout/stderr. Unfortunately it's not just enough to wrap it in  
tags, because it's possible that the output will contain XML  
formatting information. There are two ways to get around this. The  
easiest is to wrap the output in "". This has the  
benefit of being relatively easy, but will fail if the output  
contains the string "]]>". The other way is to replace all instances  
of "&", "<", and ">" with "&", "<", and ">" respectively.  
This is safer, but requires more processing.


Thoughts?

"Ick" immediately comes to mind, but is hardly helpful. :-)

I am already doing some processing to deal with linefeeds in the  
middle of output streams, so adding these three special chars isn't - 
that- big a deal. I can have a test version for you in the next day  
or so (svn trunk) - I am on reduced hours while moving my son  
(driving across country).


Let's give that a try and see if it resolves the problem...




Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] XML stdout/stderr

2009-05-25 Thread Greg Watson


Ralph,

In life, nothing is ever easy...

While the XML output is working well, I've come across an issue with  
stdout/stderr. Unfortunately it's not just enough to wrap it in tags,  
because it's possible that the output will contain XML formatting  
information. There are two ways to get around this. The easiest is to  
wrap the output in "". This has the benefit of  
being relatively easy, but will fail if the output contains the string  
"]]>". The other way is to replace all instances of "&", "<", and ">"  
with "&", "<", and ">" respectively. This is safer, but  
requires more processing.


Thoughts?

Greg

Re: [OMPI devel] -display-map behavior in 1.3

2009-05-04 Thread Greg Watson


Ralph,

I did find another issue in 1.3 though. It looks like with the -xml  
option you're sending output tagged with  to stderr, whereas  
it would probably be better if everything was sent to stdout.  
Otherwise it's necessary to parse the stderr stream separately.


Greg

On May 1, 2009, at 10:47 AM, Greg Watson wrote:

Arrgh! Sorry, my bad. I must have been linked against an old version  
or something. When I recompiled the output went away.


Greg

On May 1, 2009, at 10:09 AM, Ralph Castain wrote:


Interesting - I'm not seeing this behavior:

graywolf54:trunk rhc$ mpirun -n 3 --xml --display-map hostname







graywolf54.lanl.gov
graywolf54.lanl.gov
graywolf54.lanl.gov
graywolf54:trunk rhc$

Can you tell me more about when you see this? Note that the display- 
map output should always appear on stderr because that is our  
default output device.



On Fri, May 1, 2009 at 7:39 AM, Ralph Castain   
wrote:

Hmmm...no, that's a bug. I'll fix it.

Thanks!



On Fri, May 1, 2009 at 7:24 AM, Greg Watson   
wrote:

Ralf,

I've just noticed that if I use '-xml -display-map', I get the xml  
version of the map and then the non-xml version is sent to stderr  
(wrapped in xml tags). Was this by design? In my view it would be  
better to suppress the non-xml map altogether if the -xml option.  
1.4 seems to do the same.


Greg

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] -display-map behavior in 1.3

2009-05-01 Thread Greg Watson

Arrgh! Sorry, my bad. I must have been linked against an old version  
or something. When I recompiled the output went away.


Greg

On May 1, 2009, at 10:09 AM, Ralph Castain wrote:


Interesting - I'm not seeing this behavior:

graywolf54:trunk rhc$ mpirun -n 3 --xml --display-map hostname







graywolf54.lanl.gov
graywolf54.lanl.gov
graywolf54.lanl.gov
graywolf54:trunk rhc$

Can you tell me more about when you see this? Note that the display- 
map output should always appear on stderr because that is our  
default output device.



On Fri, May 1, 2009 at 7:39 AM, Ralph Castain   
wrote:

Hmmm...no, that's a bug. I'll fix it.

Thanks!



On Fri, May 1, 2009 at 7:24 AM, Greg Watson   
wrote:

Ralf,

I've just noticed that if I use '-xml -display-map', I get the xml  
version of the map and then the non-xml version is sent to stderr  
(wrapped in xml tags). Was this by design? In my view it would be  
better to suppress the non-xml map altogether if the -xml option.  
1.4 seems to do the same.


Greg

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] -display-map behavior in 1.3

2009-05-01 Thread Greg Watson


Ralf,

I've just noticed that if I use '-xml -display-map', I get the xml  
version of the map and then the non-xml version is sent to stderr  
(wrapped in xml tags). Was this by design? In my view it would be  
better to suppress the non-xml map altogether if the -xml option. 1.4  
seems to do the same.


Greg

Re: [OMPI devel] -display-map

2009-01-20 Thread Greg Watson


Looks good now. Thanks!

Greg

On Jan 20, 2009, at 12:00 PM, Ralph Castain wrote:

I'm embarrassed to admit that I never actually implemented the xml  
option for tag-output...this has been rectified with r20302.


Let me know if that works for you - sorry for confusion.

Ralph


On Jan 20, 2009, at 8:08 AM, Greg Watson wrote:


Ralph,

The encapsulation is not quite right yet. I'm seeing this:

[1,0]n = 0
[1,1]n = 0

but it should be:

n = 0
n = 0

Thanks,

Greg

On Jan 20, 2009, at 9:20 AM, Ralph Castain wrote:

You need to add --tag-output - this is a separate option as it  
applies both to xml and non-xml situations.


If you like, I can force tag-output "on" by default whenever -xml  
is specified.


Ralph


On Jan 16, 2009, at 12:52 PM, Greg Watson wrote:


Ralph,

Is there something I need to do to enable stdout/err  
encapsulation (apart from -xml)? Here's what I see:


$ mpirun -mca orte_show_resolved_nodenames 1 -xml -display-map - 
np 5 /Users/greg/Documents/workspace1/testMPI/Debug/testMPI


















n = 0
n = 0
n = 0
n = 0
n = 0

On Jan 15, 2009, at 1:13 PM, Ralph Castain wrote:

Okay, it is in the trunk as of r20284 - I'll file the request to  
have it moved to 1.3.1.


Let me know if you get a chance to test the stdout/err stuff in  
the trunk - we should try and iterate it so any changes can make  
1.3.1 as well.


Thanks!
Ralph


On Jan 15, 2009, at 11:03 AM, Greg Watson wrote:


Ralph,

I think the second form would be ideal and would simplify  
things greatly.


Greg

On Jan 15, 2009, at 10:53 AM, Ralph Castain wrote:

Here is what I was able to do - note that the resolve messages  
are associated with the specific hostname, not the overall map:











Will that work for you? If you like, I can remove the name=  
field from the noderesolve element since the info is specific  
to the host element that contains it. In other words, I can  
make it look like this:











if that would help.

Ralph


On Jan 14, 2009, at 7:57 AM, Ralph Castain wrote:

We -may- be able to do a more formal XML output at some  
point. The problem will be the natural interleaving of stdout/ 
err from the various procs due to the async behavior of MPI.  
Mpirun receives fragmented output in the forwarding system,  
limited by the buffer sizes and the amount of data we can  
read at any one "bite" from the pipes connecting us to the  
procs. So even though the user -thinks- they output a single  
large line of stuff, it may show up at mpirun as a series of  
fragments. Hence, it gets tricky to know how to put  
appropriate XML brackets around it.


Given this input about when you actually want resolved name  
info, I can at least do something about that area. Won't be  
in 1.3.0, but should make 1.3.1.


As for XML-tagged stdout/err: the OMPI community asked me not  
to turn that feature "on" for 1.3.0 as they felt it hasn't  
been adequately tested yet. The code is present, but cannot  
be activated in 1.3.0. However, I believe it is activated on  
the trunk when you do --xml --tagged-output, so perhaps some  
testing will help us debug and validate it adequately for  
1.3.1?


Thanks
Ralph


On Jan 14, 2009, at 7:02 AM, Greg Watson wrote:


Ralph,

The only time we use the resolved names is when we get a  
map, so we consider them part of the map output.


If quasi-XML is all that will ever be possible with 1.3,  
then you may as well leave as-is and we will attempt to  
clean it up in Eclipse. It would be nice if a future version  
of ompi could output correct XML (including stdout) as this  
would vastly simplify the parsing we need to do.


Regards,

Greg

On Jan 13, 2009, at 3:30 PM, Ralph Castain wrote:

Hmmm...well, I can't do either for 1.3.0 as it is departing  
this afternoon.


The first option would be very hard to do. I would have to  
expose the display-map option across the code base and  
check it prior to printing anything about resolving node  
names. I guess I should ask: do you only want noderesolve  
statements when we are displaying the map? Right now, I  
will output them regardless.


The second option could be done. I could check if any  
"display" option has been specified, and output the   
root at that time (likewise for the end). Anything we  
output in-between would be encapsulated between the two,  
but that would include any user output to stdout and/or  
stderr - which for 1.3.0 is not in xml.


Any thoughts?

Ralph

PS. Guess I should clarify that I was not striving for true  
XML interaction here, but rather a qua

Re: [OMPI devel] -display-map

2009-01-20 Thread Greg Watson


Ralph,

The encapsulation is not quite right yet. I'm seeing this:

[1,0]n = 0
[1,1]n = 0

but it should be:

n = 0
n = 0

Thanks,

Greg

On Jan 20, 2009, at 9:20 AM, Ralph Castain wrote:

You need to add --tag-output - this is a separate option as it  
applies both to xml and non-xml situations.


If you like, I can force tag-output "on" by default whenever -xml is  
specified.


Ralph


On Jan 16, 2009, at 12:52 PM, Greg Watson wrote:


Ralph,

Is there something I need to do to enable stdout/err encapsulation  
(apart from -xml)? Here's what I see:


$ mpirun -mca orte_show_resolved_nodenames 1 -xml -display-map -np  
5 /Users/greg/Documents/workspace1/testMPI/Debug/testMPI


















n = 0
n = 0
n = 0
n = 0
n = 0

On Jan 15, 2009, at 1:13 PM, Ralph Castain wrote:

Okay, it is in the trunk as of r20284 - I'll file the request to  
have it moved to 1.3.1.


Let me know if you get a chance to test the stdout/err stuff in  
the trunk - we should try and iterate it so any changes can make  
1.3.1 as well.


Thanks!
Ralph


On Jan 15, 2009, at 11:03 AM, Greg Watson wrote:


Ralph,

I think the second form would be ideal and would simplify things  
greatly.


Greg

On Jan 15, 2009, at 10:53 AM, Ralph Castain wrote:

Here is what I was able to do - note that the resolve messages  
are associated with the specific hostname, not the overall map:











Will that work for you? If you like, I can remove the name=  
field from the noderesolve element since the info is specific to  
the host element that contains it. In other words, I can make it  
look like this:











if that would help.

Ralph


On Jan 14, 2009, at 7:57 AM, Ralph Castain wrote:

We -may- be able to do a more formal XML output at some point.  
The problem will be the natural interleaving of stdout/err from  
the various procs due to the async behavior of MPI. Mpirun  
receives fragmented output in the forwarding system, limited by  
the buffer sizes and the amount of data we can read at any one  
"bite" from the pipes connecting us to the procs. So even  
though the user -thinks- they output a single large line of  
stuff, it may show up at mpirun as a series of fragments.  
Hence, it gets tricky to know how to put appropriate XML  
brackets around it.


Given this input about when you actually want resolved name  
info, I can at least do something about that area. Won't be in  
1.3.0, but should make 1.3.1.


As for XML-tagged stdout/err: the OMPI community asked me not  
to turn that feature "on" for 1.3.0 as they felt it hasn't been  
adequately tested yet. The code is present, but cannot be  
activated in 1.3.0. However, I believe it is activated on the  
trunk when you do --xml --tagged-output, so perhaps some  
testing will help us debug and validate it adequately for 1.3.1?


Thanks
Ralph


On Jan 14, 2009, at 7:02 AM, Greg Watson wrote:


Ralph,

The only time we use the resolved names is when we get a map,  
so we consider them part of the map output.


If quasi-XML is all that will ever be possible with 1.3, then  
you may as well leave as-is and we will attempt to clean it up  
in Eclipse. It would be nice if a future version of ompi could  
output correct XML (including stdout) as this would vastly  
simplify the parsing we need to do.


Regards,

Greg

On Jan 13, 2009, at 3:30 PM, Ralph Castain wrote:

Hmmm...well, I can't do either for 1.3.0 as it is departing  
this afternoon.


The first option would be very hard to do. I would have to  
expose the display-map option across the code base and check  
it prior to printing anything about resolving node names. I  
guess I should ask: do you only want noderesolve statements  
when we are displaying the map? Right now, I will output them  
regardless.


The second option could be done. I could check if any  
"display" option has been specified, and output the   
root at that time (likewise for the end). Anything we output  
in-between would be encapsulated between the two, but that  
would include any user output to stdout and/or stderr - which  
for 1.3.0 is not in xml.


Any thoughts?

Ralph

PS. Guess I should clarify that I was not striving for true  
XML interaction here, but rather a quasi-XML format that  
would help you to filter the output. I have no problem trying  
to get to something more formally correct, but it could be  
tricky in some places to achieve it due to the inherent async  
nature of the beast.



On Jan 13, 2009, at 12:17 PM, Greg Watson wrote:


Ralph,

The XML is looking better now, but ther

Re: [OMPI devel] -display-map

2009-01-20 Thread Greg Watson

I don't think there's any reason we'd want stdout/err not to be  
encapsulated, so forcing tag-output makes sense.


Greg

On Jan 20, 2009, at 9:20 AM, Ralph Castain wrote:

You need to add --tag-output - this is a separate option as it  
applies both to xml and non-xml situations.


If you like, I can force tag-output "on" by default whenever -xml is  
specified.


Ralph


On Jan 16, 2009, at 12:52 PM, Greg Watson wrote:


Ralph,

Is there something I need to do to enable stdout/err encapsulation  
(apart from -xml)? Here's what I see:


$ mpirun -mca orte_show_resolved_nodenames 1 -xml -display-map -np  
5 /Users/greg/Documents/workspace1/testMPI/Debug/testMPI


















n = 0
n = 0
n = 0
n = 0
n = 0

On Jan 15, 2009, at 1:13 PM, Ralph Castain wrote:

Okay, it is in the trunk as of r20284 - I'll file the request to  
have it moved to 1.3.1.


Let me know if you get a chance to test the stdout/err stuff in  
the trunk - we should try and iterate it so any changes can make  
1.3.1 as well.


Thanks!
Ralph


On Jan 15, 2009, at 11:03 AM, Greg Watson wrote:


Ralph,

I think the second form would be ideal and would simplify things  
greatly.


Greg

On Jan 15, 2009, at 10:53 AM, Ralph Castain wrote:

Here is what I was able to do - note that the resolve messages  
are associated with the specific hostname, not the overall map:











Will that work for you? If you like, I can remove the name=  
field from the noderesolve element since the info is specific to  
the host element that contains it. In other words, I can make it  
look like this:











if that would help.

Ralph


On Jan 14, 2009, at 7:57 AM, Ralph Castain wrote:

We -may- be able to do a more formal XML output at some point.  
The problem will be the natural interleaving of stdout/err from  
the various procs due to the async behavior of MPI. Mpirun  
receives fragmented output in the forwarding system, limited by  
the buffer sizes and the amount of data we can read at any one  
"bite" from the pipes connecting us to the procs. So even  
though the user -thinks- they output a single large line of  
stuff, it may show up at mpirun as a series of fragments.  
Hence, it gets tricky to know how to put appropriate XML  
brackets around it.


Given this input about when you actually want resolved name  
info, I can at least do something about that area. Won't be in  
1.3.0, but should make 1.3.1.


As for XML-tagged stdout/err: the OMPI community asked me not  
to turn that feature "on" for 1.3.0 as they felt it hasn't been  
adequately tested yet. The code is present, but cannot be  
activated in 1.3.0. However, I believe it is activated on the  
trunk when you do --xml --tagged-output, so perhaps some  
testing will help us debug and validate it adequately for 1.3.1?


Thanks
Ralph


On Jan 14, 2009, at 7:02 AM, Greg Watson wrote:


Ralph,

The only time we use the resolved names is when we get a map,  
so we consider them part of the map output.


If quasi-XML is all that will ever be possible with 1.3, then  
you may as well leave as-is and we will attempt to clean it up  
in Eclipse. It would be nice if a future version of ompi could  
output correct XML (including stdout) as this would vastly  
simplify the parsing we need to do.


Regards,

Greg

On Jan 13, 2009, at 3:30 PM, Ralph Castain wrote:

Hmmm...well, I can't do either for 1.3.0 as it is departing  
this afternoon.


The first option would be very hard to do. I would have to  
expose the display-map option across the code base and check  
it prior to printing anything about resolving node names. I  
guess I should ask: do you only want noderesolve statements  
when we are displaying the map? Right now, I will output them  
regardless.


The second option could be done. I could check if any  
"display" option has been specified, and output the   
root at that time (likewise for the end). Anything we output  
in-between would be encapsulated between the two, but that  
would include any user output to stdout and/or stderr - which  
for 1.3.0 is not in xml.


Any thoughts?

Ralph

PS. Guess I should clarify that I was not striving for true  
XML interaction here, but rather a quasi-XML format that  
would help you to filter the output. I have no problem trying  
to get to something more formally correct, but it could be  
tricky in some places to achieve it due to the inherent async  
nature of the beast.



On Jan 13, 2009, at 12:17 PM, Greg Watson wrote:


Ralph,

The XML is looking better now, but there is

Re: [OMPI devel] -display-map

2009-01-16 Thread Greg Watson


Ralph,

Is there something I need to do to enable stdout/err encapsulation  
(apart from -xml)? Here's what I see:


$ mpirun -mca orte_show_resolved_nodenames 1 -xml -display-map -np 5 / 
Users/greg/Documents/workspace1/testMPI/Debug/testMPI


















n = 0
n = 0
n = 0
n = 0
n = 0

On Jan 15, 2009, at 1:13 PM, Ralph Castain wrote:

Okay, it is in the trunk as of r20284 - I'll file the request to  
have it moved to 1.3.1.


Let me know if you get a chance to test the stdout/err stuff in the  
trunk - we should try and iterate it so any changes can make 1.3.1  
as well.


Thanks!
Ralph


On Jan 15, 2009, at 11:03 AM, Greg Watson wrote:


Ralph,

I think the second form would be ideal and would simplify things  
greatly.


Greg

On Jan 15, 2009, at 10:53 AM, Ralph Castain wrote:

Here is what I was able to do - note that the resolve messages are  
associated with the specific hostname, not the overall map:











Will that work for you? If you like, I can remove the name= field  
from the noderesolve element since the info is specific to the  
host element that contains it. In other words, I can make it look  
like this:











if that would help.

Ralph


On Jan 14, 2009, at 7:57 AM, Ralph Castain wrote:

We -may- be able to do a more formal XML output at some point.  
The problem will be the natural interleaving of stdout/err from  
the various procs due to the async behavior of MPI. Mpirun  
receives fragmented output in the forwarding system, limited by  
the buffer sizes and the amount of data we can read at any one  
"bite" from the pipes connecting us to the procs. So even though  
the user -thinks- they output a single large line of stuff, it  
may show up at mpirun as a series of fragments. Hence, it gets  
tricky to know how to put appropriate XML brackets around it.


Given this input about when you actually want resolved name info,  
I can at least do something about that area. Won't be in 1.3.0,  
but should make 1.3.1.


As for XML-tagged stdout/err: the OMPI community asked me not to  
turn that feature "on" for 1.3.0 as they felt it hasn't been  
adequately tested yet. The code is present, but cannot be  
activated in 1.3.0. However, I believe it is activated on the  
trunk when you do --xml --tagged-output, so perhaps some testing  
will help us debug and validate it adequately for 1.3.1?


Thanks
Ralph


On Jan 14, 2009, at 7:02 AM, Greg Watson wrote:


Ralph,

The only time we use the resolved names is when we get a map, so  
we consider them part of the map output.


If quasi-XML is all that will ever be possible with 1.3, then  
you may as well leave as-is and we will attempt to clean it up  
in Eclipse. It would be nice if a future version of ompi could  
output correct XML (including stdout) as this would vastly  
simplify the parsing we need to do.


Regards,

Greg

On Jan 13, 2009, at 3:30 PM, Ralph Castain wrote:

Hmmm...well, I can't do either for 1.3.0 as it is departing  
this afternoon.


The first option would be very hard to do. I would have to  
expose the display-map option across the code base and check it  
prior to printing anything about resolving node names. I guess  
I should ask: do you only want noderesolve statements when we  
are displaying the map? Right now, I will output them regardless.


The second option could be done. I could check if any "display"  
option has been specified, and output the  root at that  
time (likewise for the end). Anything we output in-between  
would be encapsulated between the two, but that would include  
any user output to stdout and/or stderr - which for 1.3.0 is  
not in xml.


Any thoughts?

Ralph

PS. Guess I should clarify that I was not striving for true XML  
interaction here, but rather a quasi-XML format that would help  
you to filter the output. I have no problem trying to get to  
something more formally correct, but it could be tricky in some  
places to achieve it due to the inherent async nature of the  
beast.



On Jan 13, 2009, at 12:17 PM, Greg Watson wrote:


Ralph,

The XML is looking better now, but there is still one problem.  
To be valid, there needs to be only one root element, but  
currently you don't have any (or many). So rather than:














the XML should be:

Re: [OMPI devel] -display-map

2009-01-16 Thread Greg Watson

FYI, if I configure with --with-platform=contrib/platform/lanl/macosx- 
dynamic the build succeeds.

Greg

On Jan 16, 2009, at 1:08 PM, Jeff Squyres wrote:

References: <78c4b4d7-d9bc-4268-97cf-8cbad...@computer.org> > <9317bd55-13a2-44be-bcc0-3e42e2322...@computer.org> <5cb48a5d-1ce3-48f7-8890-c99239b0a...@lanl.gov 
> <22ebe824--47f1-a954-8b54536bf...@computer.org> > <6dda0348-96b4-4e3f-91b4-490631cfe...@computer.org> >  <460591d2-bd7b-43ca-9b1e-1b4e02127...@lanl.gov 
>  > <4d997767-d893-43e7-bd4a-41266c9b4...@lanl.gov> <206dc9cd-aa61-4e7c-8a28-7dd3279ce...@computer.org 
>  <5175dc9a-ee1f-4b38-be89-eb55fcef3...@lanl.gov 
> <66736892-ce43-464c-b439-7ed03ddb0...@computer.org> > <8d67c754-d192-45ee-b4e8-071f67d78...@lanl.gov>  
<6D116CE6-9A8B-407E-A2D7-!
1 f716e827...@computer.org> <7d175b97-df6a-42db-81b9-4d9663861...@lanl.gov 
> <19ef4971-0390-4992-a8a7-cbc6b7189...@computer.org>

X-Mailer: Apple Mail (2.930.3)
Return-Path: jsquy...@cisco.com
X-OriginalArrivalTime: 16 Jan 2009 18:08:11.0165 (UTC)  
FILETIME=[646AA4D0:01C97805]

Er... whoops.  This looks like my mistake (I just recently add  
MPI_REDUCE_LOCAL to the trunk -- not v1.3).

I could have sworn that I tested this on a Mac, multiple times.   
I'll test again...

On Jan 16, 2009, at 12:58 PM, Greg Watson wrote:

When I try to build trunk, it fails with:

i_f77.lax/libmpi_f77_pmpi.a/pwin_unlock_f.o .libs/libmpi_f77.lax/ 
libmpi_f77_pmpi.a/pwin_wait_f.o .libs/libmpi_f77.lax/ 
libmpi_f77_pmpi.a/pwtick_f.o .libs/libmpi_f77.lax/libmpi_f77_pmpi.a/ 
pwtime_f.o   ../../../ompi/.libs/libmpi.0.0.0.dylib /usr/local/ 
openmpi-1.4-devel/lib/libopen-rte.0.0.0.dylib /usr/local/ 
openmpi-1.4-devel/lib/libopen-pal.0.0.0.dylib  -install_name  /usr/ 
local/openmpi-1.4-devel/lib/libmpi_f77.0.dylib - 
compatibility_version 1 -current_version 1.0
ld: duplicate symbol _mpi_reduce_local_f in .libs/libmpi_f77.lax/ 
libmpi_f77_pmpi.a/preduce_local_f.o and .libs/reduce_local_f.o

collect2: ld returned 1 exit status
make[3]: *** [libmpi_f77.la] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

I'm using the default configure command (./configure --prefix=xxx)  
on Mac OS X 10.5. This works fine on the 1.3 branch.

Greg

On Jan 15, 2009, at 1:13 PM, Ralph Castain wrote:

Okay, it is in the trunk as of r20284 - I'll file the request to  
have it moved to 1.3.1.

Let me know if you get a chance to test the stdout/err stuff in  
the trunk - we should try and iterate it so any changes can make  
1.3.1 as well.

Thanks!
Ralph

On Jan 15, 2009, at 11:03 AM, Greg Watson wrote:

Ralph,

I think the second form would be ideal and would simplify things  
greatly.

Greg

On Jan 15, 2009, at 10:53 AM, Ralph Castain wrote:

Here is what I was able to do - note that the resolve messages  
are associated with the specific hostname, not the overall map:

Will that work for you? If you like, I can remove the name=  
field from the noderesolve element since the info is specific to  
the host element that contains it. In other words, I can make it  
look like this:

if that would help.

Ralph

On Jan 14, 2009, at 7:57 AM, Ralph Castain wrote:

We -may- be able to do a more formal XML output at some point.  
The problem will be the natural interleaving of stdout/err from  
the various procs due to the async behavior of MPI. Mpirun  
receives fragmented output in the forwarding system, limited by  
the buffer sizes and the amount of data we can read at any one  
"bite" from the pipes connecting us to the procs. So even  
though the user -thinks- they output a single large line of  
stuff, it may show up at mpirun as a series of fragments.  
Hence, it gets tricky to know how to put appropriate XML  
brackets around it.

Given this input about when you actually want resolved name  
info, I can at least do something about that area. Won't be in  
1.3.0, but should make 1.3.1.

As for XML-tagged stdout/err: the OMPI community asked me not  
to turn that feature "on" for 1.3.0 as they felt it hasn't been  
adequately tested yet. The code is present, but cannot be  
activated in 1.3.0. However, I believe it is activated on the  
trunk when you do --xml --tagged-output, so perhaps some  
testing will help us debug and validate it adequately for 1.3.1?

Thanks
Ralph

On Jan 14, 2009, at 7:02 AM, Greg Watson wrote:

Ralph,

The only time we use the resolved names is when we get a map,  
so we consider them part of the map output.

If quasi-XML is all that will ever be possible with 1.3, then  
you may as well leave as-is and we will attempt to clean it up  
in Eclipse. It would be nice if a

Re: [OMPI devel] -display-map

2009-01-16 Thread Greg Watson


When I try to build trunk, it fails with:

i_f77.lax/libmpi_f77_pmpi.a/pwin_unlock_f.o .libs/libmpi_f77.lax/ 
libmpi_f77_pmpi.a/pwin_wait_f.o .libs/libmpi_f77.lax/libmpi_f77_pmpi.a/ 
pwtick_f.o .libs/libmpi_f77.lax/libmpi_f77_pmpi.a/ 
pwtime_f.o   ../../../ompi/.libs/libmpi.0.0.0.dylib /usr/local/ 
openmpi-1.4-devel/lib/libopen-rte.0.0.0.dylib /usr/local/openmpi-1.4- 
devel/lib/libopen-pal.0.0.0.dylib  -install_name  /usr/local/ 
openmpi-1.4-devel/lib/libmpi_f77.0.dylib -compatibility_version 1 - 
current_version 1.0
ld: duplicate symbol _mpi_reduce_local_f in .libs/libmpi_f77.lax/ 
libmpi_f77_pmpi.a/preduce_local_f.o and .libs/reduce_local_f.o


collect2: ld returned 1 exit status
make[3]: *** [libmpi_f77.la] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

I'm using the default configure command (./configure --prefix=xxx) on  
Mac OS X 10.5. This works fine on the 1.3 branch.


Greg

On Jan 15, 2009, at 1:13 PM, Ralph Castain wrote:

Okay, it is in the trunk as of r20284 - I'll file the request to  
have it moved to 1.3.1.


Let me know if you get a chance to test the stdout/err stuff in the  
trunk - we should try and iterate it so any changes can make 1.3.1  
as well.


Thanks!
Ralph


On Jan 15, 2009, at 11:03 AM, Greg Watson wrote:


Ralph,

I think the second form would be ideal and would simplify things  
greatly.


Greg

On Jan 15, 2009, at 10:53 AM, Ralph Castain wrote:

Here is what I was able to do - note that the resolve messages are  
associated with the specific hostname, not the overall map:











Will that work for you? If you like, I can remove the name= field  
from the noderesolve element since the info is specific to the  
host element that contains it. In other words, I can make it look  
like this:











if that would help.

Ralph


On Jan 14, 2009, at 7:57 AM, Ralph Castain wrote:

We -may- be able to do a more formal XML output at some point.  
The problem will be the natural interleaving of stdout/err from  
the various procs due to the async behavior of MPI. Mpirun  
receives fragmented output in the forwarding system, limited by  
the buffer sizes and the amount of data we can read at any one  
"bite" from the pipes connecting us to the procs. So even though  
the user -thinks- they output a single large line of stuff, it  
may show up at mpirun as a series of fragments. Hence, it gets  
tricky to know how to put appropriate XML brackets around it.


Given this input about when you actually want resolved name info,  
I can at least do something about that area. Won't be in 1.3.0,  
but should make 1.3.1.


As for XML-tagged stdout/err: the OMPI community asked me not to  
turn that feature "on" for 1.3.0 as they felt it hasn't been  
adequately tested yet. The code is present, but cannot be  
activated in 1.3.0. However, I believe it is activated on the  
trunk when you do --xml --tagged-output, so perhaps some testing  
will help us debug and validate it adequately for 1.3.1?


Thanks
Ralph


On Jan 14, 2009, at 7:02 AM, Greg Watson wrote:


Ralph,

The only time we use the resolved names is when we get a map, so  
we consider them part of the map output.


If quasi-XML is all that will ever be possible with 1.3, then  
you may as well leave as-is and we will attempt to clean it up  
in Eclipse. It would be nice if a future version of ompi could  
output correct XML (including stdout) as this would vastly  
simplify the parsing we need to do.


Regards,

Greg

On Jan 13, 2009, at 3:30 PM, Ralph Castain wrote:

Hmmm...well, I can't do either for 1.3.0 as it is departing  
this afternoon.


The first option would be very hard to do. I would have to  
expose the display-map option across the code base and check it  
prior to printing anything about resolving node names. I guess  
I should ask: do you only want noderesolve statements when we  
are displaying the map? Right now, I will output them regardless.


The second option could be done. I could check if any "display"  
option has been specified, and output the  root at that  
time (likewise for the end). Anything we output in-between  
would be encapsulated between the two, but that would include  
any user output to stdout and/or stderr - which for 1.3.0 is  
not in xml.


Any thoughts?

Ralph

PS. Guess I should clarify that I was not striving for true XML  
interaction here, but rather a quasi-XML format that would help  
you to filter the output. I have no problem trying to get to  
something more formally correct, but it could be tricky in some  
places to achieve it due to the inherent async nature of the  
beast.



On Jan 13, 2009, at 12:17 PM, Greg Watson wrote:


Ralph,

The XML is looking better now, but there is still one

Re: [OMPI devel] -display-map

2009-01-15 Thread Greg Watson


Ralph,

I think the second form would be ideal and would simplify things  
greatly.


Greg

On Jan 15, 2009, at 10:53 AM, Ralph Castain wrote:

Here is what I was able to do - note that the resolve messages are  
associated with the specific hostname, not the overall map:











Will that work for you? If you like, I can remove the name= field  
from the noderesolve element since the info is specific to the host  
element that contains it. In other words, I can make it look like  
this:











if that would help.

Ralph


On Jan 14, 2009, at 7:57 AM, Ralph Castain wrote:

We -may- be able to do a more formal XML output at some point. The  
problem will be the natural interleaving of stdout/err from the  
various procs due to the async behavior of MPI. Mpirun receives  
fragmented output in the forwarding system, limited by the buffer  
sizes and the amount of data we can read at any one "bite" from the  
pipes connecting us to the procs. So even though the user -thinks-  
they output a single large line of stuff, it may show up at mpirun  
as a series of fragments. Hence, it gets tricky to know how to put  
appropriate XML brackets around it.


Given this input about when you actually want resolved name info, I  
can at least do something about that area. Won't be in 1.3.0, but  
should make 1.3.1.


As for XML-tagged stdout/err: the OMPI community asked me not to  
turn that feature "on" for 1.3.0 as they felt it hasn't been  
adequately tested yet. The code is present, but cannot be activated  
in 1.3.0. However, I believe it is activated on the trunk when you  
do --xml --tagged-output, so perhaps some testing will help us  
debug and validate it adequately for 1.3.1?


Thanks
Ralph


On Jan 14, 2009, at 7:02 AM, Greg Watson wrote:


Ralph,

The only time we use the resolved names is when we get a map, so  
we consider them part of the map output.


If quasi-XML is all that will ever be possible with 1.3, then you  
may as well leave as-is and we will attempt to clean it up in  
Eclipse. It would be nice if a future version of ompi could output  
correct XML (including stdout) as this would vastly simplify the  
parsing we need to do.


Regards,

Greg

On Jan 13, 2009, at 3:30 PM, Ralph Castain wrote:

Hmmm...well, I can't do either for 1.3.0 as it is departing this  
afternoon.


The first option would be very hard to do. I would have to expose  
the display-map option across the code base and check it prior to  
printing anything about resolving node names. I guess I should  
ask: do you only want noderesolve statements when we are  
displaying the map? Right now, I will output them regardless.


The second option could be done. I could check if any "display"  
option has been specified, and output the  root at that  
time (likewise for the end). Anything we output in-between would  
be encapsulated between the two, but that would include any user  
output to stdout and/or stderr - which for 1.3.0 is not in xml.


Any thoughts?

Ralph

PS. Guess I should clarify that I was not striving for true XML  
interaction here, but rather a quasi-XML format that would help  
you to filter the output. I have no problem trying to get to  
something more formally correct, but it could be tricky in some  
places to achieve it due to the inherent async nature of the beast.



On Jan 13, 2009, at 12:17 PM, Greg Watson wrote:


Ralph,

The XML is looking better now, but there is still one problem.  
To be valid, there needs to be only one root element, but  
currently you don't have any (or many). So rather than:














the XML should be:













or:















Would either of these be possible?

Thanks,

Greg

On Dec 8, 2008, at 2:18 PM, Greg Watson wrote:


Ok thanks. I'll test from trunk in future.

Greg

On Dec 8, 2008, at 2:05 PM, Ralph Castain wrote:


Working its way around the CMR process now.

Might be easier in the future if we could test/debug this in  
the trunk, though. Otherwise, the CMR procedure will fall  
behind and a fix might miss a release window.


Anyway, hopefully this one will make the 1.3.0 release cutoff.

Thanks
Ralph

On Dec 8, 2008, at 9:56 AM, Greg Watson wrote:


Hi Ralph,

This is now in 1.3rc2, thanks. However there are a couple of  
problems. Here is what I see:


[Jarrah.watson.ibm.com:58957] resolved="Jarrah.watson.ibm.com">


For some reason each

Re: [OMPI devel] -display-map

2009-01-14 Thread Greg Watson


Ralph,

The only time we use the resolved names is when we get a map, so we  
consider them part of the map output.


If quasi-XML is all that will ever be possible with 1.3, then you may  
as well leave as-is and we will attempt to clean it up in Eclipse. It  
would be nice if a future version of ompi could output correct XML  
(including stdout) as this would vastly simplify the parsing we need  
to do.


Regards,

Greg

On Jan 13, 2009, at 3:30 PM, Ralph Castain wrote:

Hmmm...well, I can't do either for 1.3.0 as it is departing this  
afternoon.


The first option would be very hard to do. I would have to expose  
the display-map option across the code base and check it prior to  
printing anything about resolving node names. I guess I should ask:  
do you only want noderesolve statements when we are displaying the  
map? Right now, I will output them regardless.


The second option could be done. I could check if any "display"  
option has been specified, and output the  root at that time  
(likewise for the end). Anything we output in-between would be  
encapsulated between the two, but that would include any user output  
to stdout and/or stderr - which for 1.3.0 is not in xml.


Any thoughts?

Ralph

PS. Guess I should clarify that I was not striving for true XML  
interaction here, but rather a quasi-XML format that would help you  
to filter the output. I have no problem trying to get to something  
more formally correct, but it could be tricky in some places to  
achieve it due to the inherent async nature of the beast.



On Jan 13, 2009, at 12:17 PM, Greg Watson wrote:


Ralph,

The XML is looking better now, but there is still one problem. To  
be valid, there needs to be only one root element, but currently  
you don't have any (or many). So rather than:














the XML should be:













or:















Would either of these be possible?

Thanks,

Greg

On Dec 8, 2008, at 2:18 PM, Greg Watson wrote:


Ok thanks. I'll test from trunk in future.

Greg

On Dec 8, 2008, at 2:05 PM, Ralph Castain wrote:


Working its way around the CMR process now.

Might be easier in the future if we could test/debug this in the  
trunk, though. Otherwise, the CMR procedure will fall behind and  
a fix might miss a release window.


Anyway, hopefully this one will make the 1.3.0 release cutoff.

Thanks
Ralph

On Dec 8, 2008, at 9:56 AM, Greg Watson wrote:


Hi Ralph,

This is now in 1.3rc2, thanks. However there are a couple of  
problems. Here is what I see:


[Jarrah.watson.ibm.com:58957] resolved="Jarrah.watson.ibm.com">


For some reason each line is prefixed with "[...]", any idea why  
this is? Also the end tag should be "/>" not ">".


Thanks,

Greg

On Nov 24, 2008, at 3:06 PM, Greg Watson wrote:


Great, thanks. I'll take a look once it comes over to 1.3.

Cheers,

Greg

On Nov 24, 2008, at 2:59 PM, Ralph Castain wrote:


Yo Greg

This is in the trunk as of r20032. I'll bring it over to 1.3  
in a few days.


I implemented it as another MCA param  
"orte_show_resolved_nodenames" so you can actually get the  
info as you execute the job, if you want. The xml tag is  
"noderesolve" - let me know if you need any changes.


Ralph


On Oct 22, 2008, at 11:55 AM, Greg Watson wrote:


Ralph,

I guess the issue for us is that we will have to run two  
commands to get the information we need. One to get the  
configuration information, such as version and MCA  
parameters, and one to get the host information, whereas it  
would seem more logical that this should all be available via  
some kind of "configuration discovery" command. I understand  
the issue with supplying the hostfile though, so maybe this  
just points at the need for us to separate configuration  
information from the host information. In any case, we'll  
work with what you think is best.


Greg

On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote:

Hmmm...just to be sure we are all clear on this. The reason  
we proposed to use mpirun is that "hostfile" has no meaning  
outside of mpirun. That's why ompi_info can't do anything in  
this regard.


We have no idea what hostfile the user may specify until we  
actually get the mpirun cmd line. They may have specified a  
default-hostfile, but they could also specify hostfiles for  
the individual app_contexts. These may or may not include  
the node upon which mpirun is executing.


So the only way to provide you with a separate command to  
get a hostfile<

Re: [OMPI devel] -display-map

2009-01-13 Thread Greg Watson


Ralph,

The XML is looking better now, but there is still one problem. To be  
valid, there needs to be only one root element, but currently you  
don't have any (or many). So rather than:














the XML should be:













or:















Would either of these be possible?

Thanks,

Greg

On Dec 8, 2008, at 2:18 PM, Greg Watson wrote:


Ok thanks. I'll test from trunk in future.

Greg

On Dec 8, 2008, at 2:05 PM, Ralph Castain wrote:


Working its way around the CMR process now.

Might be easier in the future if we could test/debug this in the  
trunk, though. Otherwise, the CMR procedure will fall behind and a  
fix might miss a release window.


Anyway, hopefully this one will make the 1.3.0 release cutoff.

Thanks
Ralph

On Dec 8, 2008, at 9:56 AM, Greg Watson wrote:


Hi Ralph,

This is now in 1.3rc2, thanks. However there are a couple of  
problems. Here is what I see:


[Jarrah.watson.ibm.com:58957] resolved="Jarrah.watson.ibm.com">


For some reason each line is prefixed with "[...]", any idea why  
this is? Also the end tag should be "/>" not ">".


Thanks,

Greg

On Nov 24, 2008, at 3:06 PM, Greg Watson wrote:


Great, thanks. I'll take a look once it comes over to 1.3.

Cheers,

Greg

On Nov 24, 2008, at 2:59 PM, Ralph Castain wrote:


Yo Greg

This is in the trunk as of r20032. I'll bring it over to 1.3 in  
a few days.


I implemented it as another MCA param  
"orte_show_resolved_nodenames" so you can actually get the info  
as you execute the job, if you want. The xml tag is  
"noderesolve" - let me know if you need any changes.


Ralph


On Oct 22, 2008, at 11:55 AM, Greg Watson wrote:


Ralph,

I guess the issue for us is that we will have to run two  
commands to get the information we need. One to get the  
configuration information, such as version and MCA parameters,  
and one to get the host information, whereas it would seem more  
logical that this should all be available via some kind of  
"configuration discovery" command. I understand the issue with  
supplying the hostfile though, so maybe this just points at the  
need for us to separate configuration information from the host  
information. In any case, we'll work with what you think is best.


Greg

On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote:

Hmmm...just to be sure we are all clear on this. The reason we  
proposed to use mpirun is that "hostfile" has no meaning  
outside of mpirun. That's why ompi_info can't do anything in  
this regard.


We have no idea what hostfile the user may specify until we  
actually get the mpirun cmd line. They may have specified a  
default-hostfile, but they could also specify hostfiles for  
the individual app_contexts. These may or may not include the  
node upon which mpirun is executing.


So the only way to provide you with a separate command to get  
a hostfile<->nodename mapping would require you to provide us  
with the default-hostifle and/or hostfile cmd line options  
just as if you were issuing the mpirun cmd. We just wouldn't  
launch - but it would be the exact equivalent of doing "mpirun  
--do-not-launch".


Am I missing something? If so, please do correct me - I would  
be happy to provide a tool if that would make it easier. Just  
not sure what that tool would do.


Thanks
Ralph


On Oct 19, 2008, at 1:59 PM, Greg Watson wrote:


Ralph,

It seems a little strange to be using mpirun for this, but  
barring providing a separate command, or using ompi_info, I  
think this would solve our problem.


Thanks,

Greg

On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:


Sorry for delay - had to ponder this one for awhile.

Jeff and I agree that adding something to ompi_info would  
not be a good idea. Ompi_info has no knowledge or  
understanding of hostfiles, and adding that capability to it  
would be a major distortion of its intended use.


However, we think we can offer an alternative that might  
better solve the problem. Remember, we now treat hostfiles  
in a very different manner than before - see the wiki page  
for a complete description, or "man orte_hosts".


So the problem is that, to provide you with what you want,  
we need to "dump" the information from whatever default- 
hostfile was provided, and, if no default-hostfile was  
provided, then the information from each hostfile that was  
provided with an app_context.


The best way we could think of to do this is to add another  
mpirun cmd lin

Re: [OMPI devel] orte_default_hostfile

2008-12-15 Thread Greg Watson


Hi Ralph,

I think mainly because it simplifies installation of ompi for PTP  
users. Since PTP uses the hostfile to display the system  
configuration, we're pretty much always going to have one (although  
PTP does work without it, feedback is more limited). It's much easier  
for people to add a list of hosts to a file, than have to go and add  
something to the param file as well (it's hard enough to get them to  
do the former correctly).


Greg

On Dec 15, 2008, at 1:51 PM, Ralph Castain wrote:

Can you help me understand something here? I'm not opposed to making  
the change - just puzzled as to why the value of the default  
hostfile name is of concern to Eclipse.


There is one reason not to set a default name - it causes us to open  
and read that file, even though nobody ever put something in it.  
Remember, we distribute and install an empty default hostfile that  
contains instructions on how to build one, so it will always exist.  
Since the name of the default hostfile can be set in the default MCA  
param file, environment, or cmd line, there didn't seem to be any  
real reason to define some special name.


It isn't a big deal, though, so I don't really care that much. But I  
would like to understand why Eclipse cares so we can factor that  
into any future thinking.


Ralph



On Dec 12, 2008, at 7:11 AM, Greg Watson wrote:

From our perspective, it would be good if it could default to the  
old behavior (in 1.3 if possible).


Thanks,
Greg

On Dec 8, 2008, at 11:42 AM, Ralph Castain wrote:

I don't think there was any overt thought given to it, at least  
not on my part. I suspect it came about because (a) the wiki  
defining hostfile behavior made no mention of the default value,  
(b) I may have overlooked the prior default when rewriting that  
code, and (c) since we now have default-hostfile as well as  
hostfile, it could be I didn't default the name since it isn't  
clear which one should get the default.


I honestly don't remember - this has been in the code base for a  
really long time now.


I have no iron in this fire - as you know, all of our environs  
here are managed. So I guess I'll throw it out there to the  
community:


do we want --default-hostfile to have a default value?

Pros: it could be considered a continuation of 1.2's hostfile  
behavior


Cons: we treat hostfile in a totally different way in 1.3. We now  
have two hostfiles: a default that applies to all app_contexts,  
and a hostfile that applies to only one app_context. It would seem  
that the default-hostfile best aligns with the old "hostfile"  
behavior, but could lead to some confusion in its new usage.


Any preferences/thoughts?
Ralph

On Dec 5, 2008, at 9:15 AM, Greg Watson wrote:


Hi,

In 1.2.x, the rds_hostfile_path parameter pointed to openmpi- 
default-hostfile by default. This parameter has been replaced  
with orte_default_hostfile in 1.3, but now it defaults to .  
Was there any particular reason for the default value to change?


Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] orte_default_hostfile

2008-12-12 Thread Greg Watson

From our perspective, it would be good if it could default to the old  
behavior (in 1.3 if possible).


Thanks,
Greg

On Dec 8, 2008, at 11:42 AM, Ralph Castain wrote:

I don't think there was any overt thought given to it, at least not  
on my part. I suspect it came about because (a) the wiki defining  
hostfile behavior made no mention of the default value, (b) I may  
have overlooked the prior default when rewriting that code, and (c)  
since we now have default-hostfile as well as hostfile, it could be  
I didn't default the name since it isn't clear which one should get  
the default.


I honestly don't remember - this has been in the code base for a  
really long time now.


I have no iron in this fire - as you know, all of our environs here  
are managed. So I guess I'll throw it out there to the community:


do we want --default-hostfile to have a default value?

Pros: it could be considered a continuation of 1.2's hostfile behavior

Cons: we treat hostfile in a totally different way in 1.3. We now  
have two hostfiles: a default that applies to all app_contexts, and  
a hostfile that applies to only one app_context. It would seem that  
the default-hostfile best aligns with the old "hostfile" behavior,  
but could lead to some confusion in its new usage.


Any preferences/thoughts?
Ralph

On Dec 5, 2008, at 9:15 AM, Greg Watson wrote:


Hi,

In 1.2.x, the rds_hostfile_path parameter pointed to openmpi- 
default-hostfile by default. This parameter has been replaced with  
orte_default_hostfile in 1.3, but now it defaults to . Was  
there any particular reason for the default value to change?


Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] -display-map

2008-12-08 Thread Greg Watson


Ok thanks. I'll test from trunk in future.

Greg

On Dec 8, 2008, at 2:05 PM, Ralph Castain wrote:


Working its way around the CMR process now.

Might be easier in the future if we could test/debug this in the  
trunk, though. Otherwise, the CMR procedure will fall behind and a  
fix might miss a release window.


Anyway, hopefully this one will make the 1.3.0 release cutoff.

Thanks
Ralph

On Dec 8, 2008, at 9:56 AM, Greg Watson wrote:


Hi Ralph,

This is now in 1.3rc2, thanks. However there are a couple of  
problems. Here is what I see:


[Jarrah.watson.ibm.com:58957] resolved="Jarrah.watson.ibm.com">


For some reason each line is prefixed with "[...]", any idea why  
this is? Also the end tag should be "/>" not ">".


Thanks,

Greg

On Nov 24, 2008, at 3:06 PM, Greg Watson wrote:


Great, thanks. I'll take a look once it comes over to 1.3.

Cheers,

Greg

On Nov 24, 2008, at 2:59 PM, Ralph Castain wrote:


Yo Greg

This is in the trunk as of r20032. I'll bring it over to 1.3 in a  
few days.


I implemented it as another MCA param  
"orte_show_resolved_nodenames" so you can actually get the info  
as you execute the job, if you want. The xml tag is "noderesolve"  
- let me know if you need any changes.


Ralph


On Oct 22, 2008, at 11:55 AM, Greg Watson wrote:


Ralph,

I guess the issue for us is that we will have to run two  
commands to get the information we need. One to get the  
configuration information, such as version and MCA parameters,  
and one to get the host information, whereas it would seem more  
logical that this should all be available via some kind of  
"configuration discovery" command. I understand the issue with  
supplying the hostfile though, so maybe this just points at the  
need for us to separate configuration information from the host  
information. In any case, we'll work with what you think is best.


Greg

On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote:

Hmmm...just to be sure we are all clear on this. The reason we  
proposed to use mpirun is that "hostfile" has no meaning  
outside of mpirun. That's why ompi_info can't do anything in  
this regard.


We have no idea what hostfile the user may specify until we  
actually get the mpirun cmd line. They may have specified a  
default-hostfile, but they could also specify hostfiles for the  
individual app_contexts. These may or may not include the node  
upon which mpirun is executing.


So the only way to provide you with a separate command to get a  
hostfile<->nodename mapping would require you to provide us  
with the default-hostifle and/or hostfile cmd line options just  
as if you were issuing the mpirun cmd. We just wouldn't launch  
- but it would be the exact equivalent of doing "mpirun --do- 
not-launch".


Am I missing something? If so, please do correct me - I would  
be happy to provide a tool if that would make it easier. Just  
not sure what that tool would do.


Thanks
Ralph


On Oct 19, 2008, at 1:59 PM, Greg Watson wrote:


Ralph,

It seems a little strange to be using mpirun for this, but  
barring providing a separate command, or using ompi_info, I  
think this would solve our problem.


Thanks,

Greg

On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:


Sorry for delay - had to ponder this one for awhile.

Jeff and I agree that adding something to ompi_info would not  
be a good idea. Ompi_info has no knowledge or understanding  
of hostfiles, and adding that capability to it would be a  
major distortion of its intended use.


However, we think we can offer an alternative that might  
better solve the problem. Remember, we now treat hostfiles in  
a very different manner than before - see the wiki page for a  
complete description, or "man orte_hosts".


So the problem is that, to provide you with what you want, we  
need to "dump" the information from whatever default-hostfile  
was provided, and, if no default-hostfile was provided, then  
the information from each hostfile that was provided with an  
app_context.


The best way we could think of to do this is to add another  
mpirun cmd line option --dump-hostfiles that would output the  
line-by-line name from the hostfile plus the name we resolved  
it to. Of course, --xml would cause it to be in xml format.


Would that meet your needs?

Ralph


On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:


Hi Ralph,

We've been discussing this back and forth a bit internally  
and don't really see an easy solution. Our problem is that  
Eclipse is not running on the head node, so gethostbyname  
will not necessarily resolve to the same address. For  
example, the hostfile might refer to the head node by an  
internal network address that is not visible to the outside  
world. Since gethostname also looks in /etc/hosts, it may  
resolve locally but not on a remote system. The onl

Re: [OMPI devel] -display-map

2008-12-08 Thread Greg Watson


Hi Ralph,

This is now in 1.3rc2, thanks. However there are a couple of problems.  
Here is what I see:


[Jarrah.watson.ibm.com:58957] resolved="Jarrah.watson.ibm.com">


For some reason each line is prefixed with "[...]", any idea why this  
is? Also the end tag should be "/>" not ">".


Thanks,

Greg

On Nov 24, 2008, at 3:06 PM, Greg Watson wrote:


Great, thanks. I'll take a look once it comes over to 1.3.

Cheers,

Greg

On Nov 24, 2008, at 2:59 PM, Ralph Castain wrote:


Yo Greg

This is in the trunk as of r20032. I'll bring it over to 1.3 in a  
few days.


I implemented it as another MCA param  
"orte_show_resolved_nodenames" so you can actually get the info as  
you execute the job, if you want. The xml tag is "noderesolve" -  
let me know if you need any changes.


Ralph


On Oct 22, 2008, at 11:55 AM, Greg Watson wrote:


Ralph,

I guess the issue for us is that we will have to run two commands  
to get the information we need. One to get the configuration  
information, such as version and MCA parameters, and one to get  
the host information, whereas it would seem more logical that this  
should all be available via some kind of "configuration discovery"  
command. I understand the issue with supplying the hostfile  
though, so maybe this just points at the need for us to separate  
configuration information from the host information. In any case,  
we'll work with what you think is best.


Greg

On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote:

Hmmm...just to be sure we are all clear on this. The reason we  
proposed to use mpirun is that "hostfile" has no meaning outside  
of mpirun. That's why ompi_info can't do anything in this regard.


We have no idea what hostfile the user may specify until we  
actually get the mpirun cmd line. They may have specified a  
default-hostfile, but they could also specify hostfiles for the  
individual app_contexts. These may or may not include the node  
upon which mpirun is executing.


So the only way to provide you with a separate command to get a  
hostfile<->nodename mapping would require you to provide us with  
the default-hostifle and/or hostfile cmd line options just as if  
you were issuing the mpirun cmd. We just wouldn't launch - but it  
would be the exact equivalent of doing "mpirun --do-not-launch".


Am I missing something? If so, please do correct me - I would be  
happy to provide a tool if that would make it easier. Just not  
sure what that tool would do.


Thanks
Ralph


On Oct 19, 2008, at 1:59 PM, Greg Watson wrote:


Ralph,

It seems a little strange to be using mpirun for this, but  
barring providing a separate command, or using ompi_info, I  
think this would solve our problem.


Thanks,

Greg

On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:


Sorry for delay - had to ponder this one for awhile.

Jeff and I agree that adding something to ompi_info would not  
be a good idea. Ompi_info has no knowledge or understanding of  
hostfiles, and adding that capability to it would be a major  
distortion of its intended use.


However, we think we can offer an alternative that might better  
solve the problem. Remember, we now treat hostfiles in a very  
different manner than before - see the wiki page for a complete  
description, or "man orte_hosts".


So the problem is that, to provide you with what you want, we  
need to "dump" the information from whatever default-hostfile  
was provided, and, if no default-hostfile was provided, then  
the information from each hostfile that was provided with an  
app_context.


The best way we could think of to do this is to add another  
mpirun cmd line option --dump-hostfiles that would output the  
line-by-line name from the hostfile plus the name we resolved  
it to. Of course, --xml would cause it to be in xml format.


Would that meet your needs?

Ralph


On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:


Hi Ralph,

We've been discussing this back and forth a bit internally and  
don't really see an easy solution. Our problem is that Eclipse  
is not running on the head node, so gethostbyname will not  
necessarily resolve to the same address. For example, the  
hostfile might refer to the head node by an internal network  
address that is not visible to the outside world. Since  
gethostname also looks in /etc/hosts, it may resolve locally  
but not on a remote system. The only think I can think of  
would be, rather than us reading the hostfile directly as we  
do now, to provide an option to ompi_info that would dump the  
hostfile using the same rules that you apply when you're using  
the hostfile. Would that be feasible?


Greg

On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:

Sorry for delay - was on vacation and am now trying to work  
my way back to the surface.


I'm not sure I can fix this one for t

[OMPI devel] orte_default_hostfile

2008-12-05 Thread Greg Watson


Hi,

In 1.2.x, the rds_hostfile_path parameter pointed to openmpi-default- 
hostfile by default. This parameter has been replaced with  
orte_default_hostfile in 1.3, but now it defaults to . Was there  
any particular reason for the default value to change?


Greg

Re: [OMPI devel] -display-map

2008-12-02 Thread Greg Watson


Ralph, will this be in 1.3rc1?

Thanks,
Greg

On Nov 24, 2008, at 3:06 PM, Greg Watson wrote:


Great, thanks. I'll take a look once it comes over to 1.3.

Cheers,

Greg

On Nov 24, 2008, at 2:59 PM, Ralph Castain wrote:


Yo Greg

This is in the trunk as of r20032. I'll bring it over to 1.3 in a  
few days.


I implemented it as another MCA param  
"orte_show_resolved_nodenames" so you can actually get the info as  
you execute the job, if you want. The xml tag is "noderesolve" -  
let me know if you need any changes.


Ralph


On Oct 22, 2008, at 11:55 AM, Greg Watson wrote:


Ralph,

I guess the issue for us is that we will have to run two commands  
to get the information we need. One to get the configuration  
information, such as version and MCA parameters, and one to get  
the host information, whereas it would seem more logical that this  
should all be available via some kind of "configuration discovery"  
command. I understand the issue with supplying the hostfile  
though, so maybe this just points at the need for us to separate  
configuration information from the host information. In any case,  
we'll work with what you think is best.


Greg

On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote:

Hmmm...just to be sure we are all clear on this. The reason we  
proposed to use mpirun is that "hostfile" has no meaning outside  
of mpirun. That's why ompi_info can't do anything in this regard.


We have no idea what hostfile the user may specify until we  
actually get the mpirun cmd line. They may have specified a  
default-hostfile, but they could also specify hostfiles for the  
individual app_contexts. These may or may not include the node  
upon which mpirun is executing.


So the only way to provide you with a separate command to get a  
hostfile<->nodename mapping would require you to provide us with  
the default-hostifle and/or hostfile cmd line options just as if  
you were issuing the mpirun cmd. We just wouldn't launch - but it  
would be the exact equivalent of doing "mpirun --do-not-launch".


Am I missing something? If so, please do correct me - I would be  
happy to provide a tool if that would make it easier. Just not  
sure what that tool would do.


Thanks
Ralph


On Oct 19, 2008, at 1:59 PM, Greg Watson wrote:


Ralph,

It seems a little strange to be using mpirun for this, but  
barring providing a separate command, or using ompi_info, I  
think this would solve our problem.


Thanks,

Greg

On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:


Sorry for delay - had to ponder this one for awhile.

Jeff and I agree that adding something to ompi_info would not  
be a good idea. Ompi_info has no knowledge or understanding of  
hostfiles, and adding that capability to it would be a major  
distortion of its intended use.


However, we think we can offer an alternative that might better  
solve the problem. Remember, we now treat hostfiles in a very  
different manner than before - see the wiki page for a complete  
description, or "man orte_hosts".


So the problem is that, to provide you with what you want, we  
need to "dump" the information from whatever default-hostfile  
was provided, and, if no default-hostfile was provided, then  
the information from each hostfile that was provided with an  
app_context.


The best way we could think of to do this is to add another  
mpirun cmd line option --dump-hostfiles that would output the  
line-by-line name from the hostfile plus the name we resolved  
it to. Of course, --xml would cause it to be in xml format.


Would that meet your needs?

Ralph


On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:


Hi Ralph,

We've been discussing this back and forth a bit internally and  
don't really see an easy solution. Our problem is that Eclipse  
is not running on the head node, so gethostbyname will not  
necessarily resolve to the same address. For example, the  
hostfile might refer to the head node by an internal network  
address that is not visible to the outside world. Since  
gethostname also looks in /etc/hosts, it may resolve locally  
but not on a remote system. The only think I can think of  
would be, rather than us reading the hostfile directly as we  
do now, to provide an option to ompi_info that would dump the  
hostfile using the same rules that you apply when you're using  
the hostfile. Would that be feasible?


Greg

On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:

Sorry for delay - was on vacation and am now trying to work  
my way back to the surface.


I'm not sure I can fix this one for two reasons:

1. In general, OMPI doesn't really care what name is used for  
the node. However, the problem is that it needs to be  
consistent. In this case, ORTE has already used the name  
returned by gethostname to create its session directory  
structure long before mpirun reads a hostfile. This

Re: [OMPI devel] -display-map

2008-11-24 Thread Greg Watson


Great, thanks. I'll take a look once it comes over to 1.3.

Cheers,

Greg

On Nov 24, 2008, at 2:59 PM, Ralph Castain wrote:


Yo Greg

This is in the trunk as of r20032. I'll bring it over to 1.3 in a  
few days.


I implemented it as another MCA param "orte_show_resolved_nodenames"  
so you can actually get the info as you execute the job, if you  
want. The xml tag is "noderesolve" - let me know if you need any  
changes.


Ralph


On Oct 22, 2008, at 11:55 AM, Greg Watson wrote:


Ralph,

I guess the issue for us is that we will have to run two commands  
to get the information we need. One to get the configuration  
information, such as version and MCA parameters, and one to get the  
host information, whereas it would seem more logical that this  
should all be available via some kind of "configuration discovery"  
command. I understand the issue with supplying the hostfile though,  
so maybe this just points at the need for us to separate  
configuration information from the host information. In any case,  
we'll work with what you think is best.


Greg

On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote:

Hmmm...just to be sure we are all clear on this. The reason we  
proposed to use mpirun is that "hostfile" has no meaning outside  
of mpirun. That's why ompi_info can't do anything in this regard.


We have no idea what hostfile the user may specify until we  
actually get the mpirun cmd line. They may have specified a  
default-hostfile, but they could also specify hostfiles for the  
individual app_contexts. These may or may not include the node  
upon which mpirun is executing.


So the only way to provide you with a separate command to get a  
hostfile<->nodename mapping would require you to provide us with  
the default-hostifle and/or hostfile cmd line options just as if  
you were issuing the mpirun cmd. We just wouldn't launch - but it  
would be the exact equivalent of doing "mpirun --do-not-launch".


Am I missing something? If so, please do correct me - I would be  
happy to provide a tool if that would make it easier. Just not  
sure what that tool would do.


Thanks
Ralph


On Oct 19, 2008, at 1:59 PM, Greg Watson wrote:


Ralph,

It seems a little strange to be using mpirun for this, but  
barring providing a separate command, or using ompi_info, I think  
this would solve our problem.


Thanks,

Greg

On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:


Sorry for delay - had to ponder this one for awhile.

Jeff and I agree that adding something to ompi_info would not be  
a good idea. Ompi_info has no knowledge or understanding of  
hostfiles, and adding that capability to it would be a major  
distortion of its intended use.


However, we think we can offer an alternative that might better  
solve the problem. Remember, we now treat hostfiles in a very  
different manner than before - see the wiki page for a complete  
description, or "man orte_hosts".


So the problem is that, to provide you with what you want, we  
need to "dump" the information from whatever default-hostfile  
was provided, and, if no default-hostfile was provided, then the  
information from each hostfile that was provided with an  
app_context.


The best way we could think of to do this is to add another  
mpirun cmd line option --dump-hostfiles that would output the  
line-by-line name from the hostfile plus the name we resolved it  
to. Of course, --xml would cause it to be in xml format.


Would that meet your needs?

Ralph


On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:


Hi Ralph,

We've been discussing this back and forth a bit internally and  
don't really see an easy solution. Our problem is that Eclipse  
is not running on the head node, so gethostbyname will not  
necessarily resolve to the same address. For example, the  
hostfile might refer to the head node by an internal network  
address that is not visible to the outside world. Since  
gethostname also looks in /etc/hosts, it may resolve locally  
but not on a remote system. The only think I can think of would  
be, rather than us reading the hostfile directly as we do now,  
to provide an option to ompi_info that would dump the hostfile  
using the same rules that you apply when you're using the  
hostfile. Would that be feasible?


Greg

On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:

Sorry for delay - was on vacation and am now trying to work my  
way back to the surface.


I'm not sure I can fix this one for two reasons:

1. In general, OMPI doesn't really care what name is used for  
the node. However, the problem is that it needs to be  
consistent. In this case, ORTE has already used the name  
returned by gethostname to create its session directory  
structure long before mpirun reads a hostfile. This is why we  
retain the value from gethostname instead of allowing it to be  
overwritten by the name

Re: [OMPI devel] 1.3 release date?

2008-10-22 Thread Greg Watson


Brad,

Many thanks for the update.

Greg

On Oct 22, 2008, at 8:43 PM, Brad Benton wrote:


Greg,

Here is the latest schedule that we have for getting 1.3 out the door:
https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3

Basically, this schedule sets Nov. 10 as the release date with a  
backup date of Nov. 17.
Here is a bit more detail as to the release to beta and then to  
release candidate 1,

prior to the general release (lifted from the wiki):
   1.3 beta: Target: October 27, 2008
   1.3 rc1: Target: November 3, 2008
   1.3 release: Target: November 10, 2008

--Brad


On Fri, Oct 17, 2008 at 5:38 AM, Jeff Squyres   
wrote:

Greg -- I defer to George/Brad for plans of the specific release date.

We hope to be feature complete by early next week.  This clears the  
way for a "beta" release.  Specifically, there's two things we're  
waiting for:


1. Some FT stuff that Tim/Josh think can be done by this weekend
2. A critical code review for a big openib BTL change that will be  
done when Pasha and I are at the Chicago Forum meeting on Monday




On Oct 15, 2008, at 4:48 PM, Greg Watson wrote:

Hi all,

Has a release date been set for 1.3?

Thanks,

Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] -display-map

2008-10-22 Thread Greg Watson


Ralph,

I guess the issue for us is that we will have to run two commands to  
get the information we need. One to get the configuration information,  
such as version and MCA parameters, and one to get the host  
information, whereas it would seem more logical that this should all  
be available via some kind of "configuration discovery" command. I  
understand the issue with supplying the hostfile though, so maybe this  
just points at the need for us to separate configuration information  
from the host information. In any case, we'll work with what you think  
is best.


Greg

On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote:

Hmmm...just to be sure we are all clear on this. The reason we  
proposed to use mpirun is that "hostfile" has no meaning outside of  
mpirun. That's why ompi_info can't do anything in this regard.


We have no idea what hostfile the user may specify until we actually  
get the mpirun cmd line. They may have specified a default-hostfile,  
but they could also specify hostfiles for the individual  
app_contexts. These may or may not include the node upon which  
mpirun is executing.


So the only way to provide you with a separate command to get a  
hostfile<->nodename mapping would require you to provide us with the  
default-hostifle and/or hostfile cmd line options just as if you  
were issuing the mpirun cmd. We just wouldn't launch - but it would  
be the exact equivalent of doing "mpirun --do-not-launch".


Am I missing something? If so, please do correct me - I would be  
happy to provide a tool if that would make it easier. Just not sure  
what that tool would do.


Thanks
Ralph


On Oct 19, 2008, at 1:59 PM, Greg Watson wrote:


Ralph,

It seems a little strange to be using mpirun for this, but barring  
providing a separate command, or using ompi_info, I think this  
would solve our problem.


Thanks,

Greg

On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:


Sorry for delay - had to ponder this one for awhile.

Jeff and I agree that adding something to ompi_info would not be a  
good idea. Ompi_info has no knowledge or understanding of  
hostfiles, and adding that capability to it would be a major  
distortion of its intended use.


However, we think we can offer an alternative that might better  
solve the problem. Remember, we now treat hostfiles in a very  
different manner than before - see the wiki page for a complete  
description, or "man orte_hosts".


So the problem is that, to provide you with what you want, we need  
to "dump" the information from whatever default-hostfile was  
provided, and, if no default-hostfile was provided, then the  
information from each hostfile that was provided with an  
app_context.


The best way we could think of to do this is to add another mpirun  
cmd line option --dump-hostfiles that would output the line-by- 
line name from the hostfile plus the name we resolved it to. Of  
course, --xml would cause it to be in xml format.


Would that meet your needs?

Ralph


On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:


Hi Ralph,

We've been discussing this back and forth a bit internally and  
don't really see an easy solution. Our problem is that Eclipse is  
not running on the head node, so gethostbyname will not  
necessarily resolve to the same address. For example, the  
hostfile might refer to the head node by an internal network  
address that is not visible to the outside world. Since  
gethostname also looks in /etc/hosts, it may resolve locally but  
not on a remote system. The only think I can think of would be,  
rather than us reading the hostfile directly as we do now, to  
provide an option to ompi_info that would dump the hostfile using  
the same rules that you apply when you're using the hostfile.  
Would that be feasible?


Greg

On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:

Sorry for delay - was on vacation and am now trying to work my  
way back to the surface.


I'm not sure I can fix this one for two reasons:

1. In general, OMPI doesn't really care what name is used for  
the node. However, the problem is that it needs to be  
consistent. In this case, ORTE has already used the name  
returned by gethostname to create its session directory  
structure long before mpirun reads a hostfile. This is why we  
retain the value from gethostname instead of allowing it to be  
overwritten by the name in whatever allocation we are given.  
Using the name in hostfile would require that I either find some  
way to remember any prior name, or that I tear down and rebuild  
the session directory tree - neither seems attractive nor simple  
(e.g., what happens when the user provides multiple entries in  
the hostfile for the node, each with a different IP address  
based on another interface in that node? Sounds crazy, but we  
have already seen it done - which one do I use?).


2. We don't actually store th

Re: [OMPI devel] -display-map

2008-10-19 Thread Greg Watson


Ralph,

It seems a little strange to be using mpirun for this, but barring  
providing a separate command, or using ompi_info, I think this would  
solve our problem.


Thanks,

Greg

On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:


Sorry for delay - had to ponder this one for awhile.

Jeff and I agree that adding something to ompi_info would not be a  
good idea. Ompi_info has no knowledge or understanding of hostfiles,  
and adding that capability to it would be a major distortion of its  
intended use.


However, we think we can offer an alternative that might better  
solve the problem. Remember, we now treat hostfiles in a very  
different manner than before - see the wiki page for a complete  
description, or "man orte_hosts".


So the problem is that, to provide you with what you want, we need  
to "dump" the information from whatever default-hostfile was  
provided, and, if no default-hostfile was provided, then the  
information from each hostfile that was provided with an app_context.


The best way we could think of to do this is to add another mpirun  
cmd line option --dump-hostfiles that would output the line-by-line  
name from the hostfile plus the name we resolved it to. Of course, -- 
xml would cause it to be in xml format.


Would that meet your needs?

Ralph


On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:


Hi Ralph,

We've been discussing this back and forth a bit internally and  
don't really see an easy solution. Our problem is that Eclipse is  
not running on the head node, so gethostbyname will not necessarily  
resolve to the same address. For example, the hostfile might refer  
to the head node by an internal network address that is not visible  
to the outside world. Since gethostname also looks in /etc/hosts,  
it may resolve locally but not on a remote system. The only think I  
can think of would be, rather than us reading the hostfile directly  
as we do now, to provide an option to ompi_info that would dump the  
hostfile using the same rules that you apply when you're using the  
hostfile. Would that be feasible?


Greg

On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:

Sorry for delay - was on vacation and am now trying to work my way  
back to the surface.


I'm not sure I can fix this one for two reasons:

1. In general, OMPI doesn't really care what name is used for the  
node. However, the problem is that it needs to be consistent. In  
this case, ORTE has already used the name returned by gethostname  
to create its session directory structure long before mpirun reads  
a hostfile. This is why we retain the value from gethostname  
instead of allowing it to be overwritten by the name in whatever  
allocation we are given. Using the name in hostfile would require  
that I either find some way to remember any prior name, or that I  
tear down and rebuild the session directory tree - neither seems  
attractive nor simple (e.g., what happens when the user provides  
multiple entries in the hostfile for the node, each with a  
different IP address based on another interface in that node?  
Sounds crazy, but we have already seen it done - which one do I  
use?).


2. We don't actually store the hostfile info anywhere - we just  
use it and forget it. For us to add an XML attribute containing  
any hostfile-related info would therefore require us to re-read  
the hostfile. I could have it do that -only- in the case of "XML  
output required", but it seems rather ugly.


An alternative might be for you to simply do a "gethostbyname"  
lookup of the IP address or hostname to see if it matches instead  
of just doing a strcmp. This is what we have to do internally as  
we frequently have problems with FQDN vs. non-FQDN vs. IP  
addresses etc. If the local OS hasn't cached the IP address for  
the node in question it can take a little time to DNS resolve it,  
but otherwise works fine.


I can point you to the code in OPAL that we use - I would think  
something similar would be easy to implement in your code and  
would readily solve the problem.


Ralph

On Sep 19, 2008, at 7:18 AM, Greg Watson wrote:


Ralph,

The problem we're seeing is just with the head node. If I specify  
a particular IP address for the head node in the hostfile, it  
gets changed to the FQDN when displayed in the map. This is a  
problem for us as we need to be able to match the two, and since  
we're not necessarily running on the head node, we can't always  
do the same resolution you're doing.


Would it be possible to use the same address that is specified in  
the hostfile, or alternatively provide an XML attribute that  
contains this information?


Thanks,

Greg

On Sep 11, 2008, at 9:06 AM, Ralph Castain wrote:

Not in that regard, depending upon what you mean by "recently".  
The only changes I am aware of wrt nodes consisted of some  
changes to the order in which we use the n

Re: [OMPI devel] -display-map

2008-10-15 Thread Greg Watson


Hi Ralph,

We've been discussing this back and forth a bit internally and don't  
really see an easy solution. Our problem is that Eclipse is not  
running on the head node, so gethostbyname will not necessarily  
resolve to the same address. For example, the hostfile might refer to  
the head node by an internal network address that is not visible to  
the outside world. Since gethostname also looks in /etc/hosts, it may  
resolve locally but not on a remote system. The only think I can think  
of would be, rather than us reading the hostfile directly as we do  
now, to provide an option to ompi_info that would dump the hostfile  
using the same rules that you apply when you're using the hostfile.  
Would that be feasible?


Greg

On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:

Sorry for delay - was on vacation and am now trying to work my way  
back to the surface.


I'm not sure I can fix this one for two reasons:

1. In general, OMPI doesn't really care what name is used for the  
node. However, the problem is that it needs to be consistent. In  
this case, ORTE has already used the name returned by gethostname to  
create its session directory structure long before mpirun reads a  
hostfile. This is why we retain the value from gethostname instead  
of allowing it to be overwritten by the name in whatever allocation  
we are given. Using the name in hostfile would require that I either  
find some way to remember any prior name, or that I tear down and  
rebuild the session directory tree - neither seems attractive nor  
simple (e.g., what happens when the user provides multiple entries  
in the hostfile for the node, each with a different IP address based  
on another interface in that node? Sounds crazy, but we have already  
seen it done - which one do I use?).


2. We don't actually store the hostfile info anywhere - we just use  
it and forget it. For us to add an XML attribute containing any  
hostfile-related info would therefore require us to re-read the  
hostfile. I could have it do that -only- in the case of "XML output  
required", but it seems rather ugly.


An alternative might be for you to simply do a "gethostbyname"  
lookup of the IP address or hostname to see if it matches instead of  
just doing a strcmp. This is what we have to do internally as we  
frequently have problems with FQDN vs. non-FQDN vs. IP addresses  
etc. If the local OS hasn't cached the IP address for the node in  
question it can take a little time to DNS resolve it, but otherwise  
works fine.


I can point you to the code in OPAL that we use - I would think  
something similar would be easy to implement in your code and would  
readily solve the problem.


Ralph

On Sep 19, 2008, at 7:18 AM, Greg Watson wrote:


Ralph,

The problem we're seeing is just with the head node. If I specify a  
particular IP address for the head node in the hostfile, it gets  
changed to the FQDN when displayed in the map. This is a problem  
for us as we need to be able to match the two, and since we're not  
necessarily running on the head node, we can't always do the same  
resolution you're doing.


Would it be possible to use the same address that is specified in  
the hostfile, or alternatively provide an XML attribute that  
contains this information?


Thanks,

Greg

On Sep 11, 2008, at 9:06 AM, Ralph Castain wrote:

Not in that regard, depending upon what you mean by "recently".  
The only changes I am aware of wrt nodes consisted of some changes  
to the order in which we use the nodes when specified by hostfile  
or -host, and a little #if protectionism needed by Brian for the  
Cray port.


Are you seeing this for every node? Reason I ask: I can't offhand  
think of anything in the code base that would replace a host name  
with the FQDN because we don't get that info for remote nodes. The  
only exception is the head node (where mpirun sits) - in that lone  
case, we default to the name returned to us by gethostname(). We  
do that because the head node is frequently accessible on a more  
global basis than the compute nodes - thus, the FQDN is required  
to ensure that there is no address confusion on the network.


If the user refers to compute nodes in a hostfile or -host (or in  
an allocation from a resource manager) by non-FQDN, we just assume  
they know what they are doing and the name will correctly resolve  
to a unique address.



On Sep 10, 2008, at 9:45 AM, Greg Watson wrote:


Hi,

Has there been a change in the behavior of the -display-map  
option has changed recently in the 1.3 branch. We're now seeing  
the host name as a fully resolved DN rather than the entry that  
was specified in the hostfile. Is there any particular reason for  
this? If so, would it be possible to add the hostfile entry to  
the output since we need to be able to match the two?


Thanks,

Greg
__

[OMPI devel] 1.3 release date?

2008-10-15 Thread Greg Watson


Hi all,

Has a release date been set for 1.3?

Thanks,

Greg

Re: [OMPI devel] -display-map

2008-09-19 Thread Greg Watson


Ralph,

The problem we're seeing is just with the head node. If I specify a  
particular IP address for the head node in the hostfile, it gets  
changed to the FQDN when displayed in the map. This is a problem for  
us as we need to be able to match the two, and since we're not  
necessarily running on the head node, we can't always do the same  
resolution you're doing.


Would it be possible to use the same address that is specified in the  
hostfile, or alternatively provide an XML attribute that contains this  
information?


Thanks,

Greg

On Sep 11, 2008, at 9:06 AM, Ralph Castain wrote:

Not in that regard, depending upon what you mean by "recently". The  
only changes I am aware of wrt nodes consisted of some changes to  
the order in which we use the nodes when specified by hostfile or - 
host, and a little #if protectionism needed by Brian for the Cray  
port.


Are you seeing this for every node? Reason I ask: I can't offhand  
think of anything in the code base that would replace a host name  
with the FQDN because we don't get that info for remote nodes. The  
only exception is the head node (where mpirun sits) - in that lone  
case, we default to the name returned to us by gethostname(). We do  
that because the head node is frequently accessible on a more global  
basis than the compute nodes - thus, the FQDN is required to ensure  
that there is no address confusion on the network.


If the user refers to compute nodes in a hostfile or -host (or in an  
allocation from a resource manager) by non-FQDN, we just assume they  
know what they are doing and the name will correctly resolve to a  
unique address.



On Sep 10, 2008, at 9:45 AM, Greg Watson wrote:


Hi,

Has there been a change in the behavior of the -display-map option  
has changed recently in the 1.3 branch. We're now seeing the host  
name as a fully resolved DN rather than the entry that was  
specified in the hostfile. Is there any particular reason for this?  
If so, would it be possible to add the hostfile entry to the output  
since we need to be able to match the two?


Thanks,

Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] -display-map and mpi_spawn

2008-09-16 Thread Greg Watson


Hi Ralph,

No I'm happy to get a map at the beginning and at every spawn. Do you  
send the whole map again, or only an update?


Regards,

Greg

On Sep 11, 2008, at 9:09 AM, Ralph Castain wrote:

It already somewhat does. If you use --display-map at mpirun, you  
automatically get display-map whenever MPI_Spawn is called.


We didn't provide a mechanism by which you could only display-map  
for MPI_Spawn (and not for the original mpirun), but it would be  
trivial to do so - just have to define an info-key for that purpose.  
Is that what you need?



On Sep 11, 2008, at 5:35 AM, Greg Watson wrote:


Ralph,

At the moment -display-map shows the process mapping when mpirun  
first starts, but I'm wondering about processes created  
dynamically. Would it be possible to trigger a map update when  
MPI_Spawn is called?


Regards,

Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] -display-map and mpi_spawn

2008-09-11 Thread Greg Watson


Ralph,

At the moment -display-map shows the process mapping when mpirun first  
starts, but I'm wondering about processes created dynamically. Would  
it be possible to trigger a map update when MPI_Spawn is called?


Regards,

Greg

[OMPI devel] -display-map

2008-09-10 Thread Greg Watson


Hi,

Has there been a change in the behavior of the -display-map option has  
changed recently in the 1.3 branch. We're now seeing the host name as  
a fully resolved DN rather than the entry that was specified in the  
hostfile. Is there any particular reason for this? If so, would it be  
possible to add the hostfile entry to the output since we need to be  
able to match the two?


Thanks,

Greg

Re: [OMPI devel] IOF redesign: cmd line options

2008-08-28 Thread Greg Watson

Can we also have an option to wrap stdout/err in XML tags, or were you  
already planning that?


Greg

On Aug 28, 2008, at 8:51 AM, Ralph Castain wrote:


The revised IOF design calls for several new cmd line options:

1. specify which procs are to receive stdin. The options that were  
to be supported are: all procs, a specific proc, or no procs. The  
default will be rank=0 only. All procs not included will have their  
stdin tied to /dev/null - which means a debugger could not attach to  
the stdin at a later time.


2. specify which stdxxx file descriptors you want left open on your  
procs. Our defaults are to leave stdout/stderr/stddiag open on all  
procs. This option would allow the user to specify that we tie any  
or all of these to /dev/null


3. tag output with [job,rank] on every line. I have currently  
defined this option to be --tag-output. It is "off" by default,  
though at least one user has questioned that it should be "on" by  
default.


Does anyone have suggestions as to the naming of these cmd line  
options, their behavior, and/or their default settings? Any  
additional requests?


Thanks
Ralph

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] OMPI 1.3 problems

2008-08-04 Thread Greg Watson

That fixed it. Looks like something in the latest trunk has triggered  
this problem.


Greg

On Aug 4, 2008, at 7:58 PM, Ralph Castain wrote:

I see one difference, and it probably does lead to Terry's cited  
ticket. I always run -mca btl ^sm since I'm only testing  
functionality, not performance.


Give that a try and see if it completes. If so, then the problem  
probably is related to the ticket cited by Terry. Otherwise, we'll  
have to consider other options.


Ralph

On Aug 4, 2008, at 5:50 PM, Greg Watson wrote:

Configuring with ./configure --prefix=/usr/local/openmpi-1.3-devel  
--with-platform=contrib/platform/lanl/macosx-dynamic --disable-io- 
romio


Recompiling the app, then running with mpirun -np 5 ./shallow

All processes show R+ as their status. If I attach gdb to a worker  
I get the following stack trace:


(gdb) where
#0  0x9045e58a in swtch_pri ()
#1  0x904ccbc1 in sched_yield ()
#2  0x000f6480 in opal_progress () at runtime/opal_progress.c:220
#3  0x004bb0bc in opal_condition_wait ()
#4  0x004bca5c in ompi_request_wait_completion ()
#5  0x004bc92a in mca_pml_ob1_send ()
#6  0x003cdcab in MPI_Send ()
#7  0x453f in send_updated_ds (res_type=0x5040, jstart=8,  
jend=11, ds=0xbfff85b0, indx=57, master_id=0) at worker.c:214

#8  0x444d in worker () at worker.c:185
#9  0x2e0b in main (argc=1, argv=0xb0b8) at main.c:90

The master process shows:

(gdb) where
#0  0x9045e58a in swtch_pri ()
#1  0x904ccbc1 in sched_yield ()
#2  0x000f6480 in opal_progress () at runtime/opal_progress.c:220
#3  0x004ba8bb in opal_condition_wait ()
#4  0x004ba6e4 in ompi_request_wait_completion ()
#5  0x004ba589 in mca_pml_ob1_recv ()
#6  0x003c80aa in MPI_Recv ()
#7  0x354c in update_global_ds (res_type=0x5040, indx=57,  
ds=0xbfffd068) at main.c:257

#8  0x3334 in main (argc=1, argv=0xb0b8) at main.c:195

Seems to be stuck in communication.

Greg

On Aug 4, 2008, at 6:12 PM, Ralph Castain wrote:

Can you tell us how you are configuring and your command line? As  
I said, I'm having no problem running your code  on my Mac w/10.5,  
both PowerPC and Intel.


Ralph

On Aug 4, 2008, at 3:10 PM, Greg Watson wrote:

Yes the application does sends/receives. No, it doesn't seem to  
be getting past MPI_Init.


I've reinstalled from a completely new 1.3 branch. Still hangs.

Greg

On Aug 4, 2008, at 4:45 PM, Terry Dontje wrote:

Are you doing any communications?  Have you gotten past  
MPI_Init?  Could

your issue be related to the following ticket?

https://svn.open-mpi.org/trac/ompi/ticket/1378


--td
Greg Watson wrote:
I'm seeing the same behavior on trunk as 1.3. The program just  
hangs.


Greg

On Aug 4, 2008, at 2:25 PM, Ralph Castain wrote:


Well, I unfortunately cannot test this right now Greg - the 1.3
branch won't build due to a problem with the man page  
installation

script. The fix is in the trunk, but hasn't migrated across yet.

:-//

My guess is that you are caught on some stage where the  
hanging bugs
hadn't been fixed, but you cannot update to the current head  
of the
1.3 branch as it won't compile. All I can suggest is shifting  
to the
trunk (which definitely works) for now as the man page fix  
should

migrate soon.

Ralph

On Aug 4, 2008, at 12:12 PM, Ralph Castain wrote:

Depending upon the r-level, there was a problem for awhile  
with the
system hanging that was caused by a couple of completely  
unrelated
issues. I believe these have been fixed now - at least, it is  
fixed
on the trunk for me under that same system. I'll check 1.3  
now - it

could be that some commits are missing over there.


On Aug 4, 2008, at 12:06 PM, Greg Watson wrote:

I have a fairly simple test program that runs fine under 1.2  
on
MacOS X 10.5 . When I recompile and run it under 1.3 (head  
of 1.3

branch) it just hangs.

They are both built using
--with-platform=contrib/platform/lanl/macosx-dynamic. For  
1.3, I've

added --disable-io-romio.

Any suggestions?

Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.c

Re: [OMPI devel] OMPI 1.3 problems

2008-08-04 Thread Greg Watson

Configuring with ./configure --prefix=/usr/local/openmpi-1.3-devel -- 
with-platform=contrib/platform/lanl/macosx-dynamic --disable-io-romio


Recompiling the app, then running with mpirun -np 5 ./shallow

All processes show R+ as their status. If I attach gdb to a worker I  
get the following stack trace:


(gdb) where
#0  0x9045e58a in swtch_pri ()
#1  0x904ccbc1 in sched_yield ()
#2  0x000f6480 in opal_progress () at runtime/opal_progress.c:220
#3  0x004bb0bc in opal_condition_wait ()
#4  0x004bca5c in ompi_request_wait_completion ()
#5  0x004bc92a in mca_pml_ob1_send ()
#6  0x003cdcab in MPI_Send ()
#7  0x453f in send_updated_ds (res_type=0x5040, jstart=8, jend=11,  
ds=0xbfff85b0, indx=57, master_id=0) at worker.c:214

#8  0x444d in worker () at worker.c:185
#9  0x2e0b in main (argc=1, argv=0xb0b8) at main.c:90

The master process shows:

(gdb) where
#0  0x9045e58a in swtch_pri ()
#1  0x904ccbc1 in sched_yield ()
#2  0x000f6480 in opal_progress () at runtime/opal_progress.c:220
#3  0x004ba8bb in opal_condition_wait ()
#4  0x004ba6e4 in ompi_request_wait_completion ()
#5  0x004ba589 in mca_pml_ob1_recv ()
#6  0x003c80aa in MPI_Recv ()
#7  0x354c in update_global_ds (res_type=0x5040, indx=57,  
ds=0xbfffd068) at main.c:257

#8  0x3334 in main (argc=1, argv=0xb0b8) at main.c:195

Seems to be stuck in communication.

Greg

On Aug 4, 2008, at 6:12 PM, Ralph Castain wrote:

Can you tell us how you are configuring and your command line? As I  
said, I'm having no problem running your code  on my Mac w/10.5,  
both PowerPC and Intel.


Ralph

On Aug 4, 2008, at 3:10 PM, Greg Watson wrote:

Yes the application does sends/receives. No, it doesn't seem to be  
getting past MPI_Init.


I've reinstalled from a completely new 1.3 branch. Still hangs.

Greg

On Aug 4, 2008, at 4:45 PM, Terry Dontje wrote:

Are you doing any communications?  Have you gotten past MPI_Init?   
Could

your issue be related to the following ticket?

https://svn.open-mpi.org/trac/ompi/ticket/1378


--td
Greg Watson wrote:
I'm seeing the same behavior on trunk as 1.3. The program just  
hangs.


Greg

On Aug 4, 2008, at 2:25 PM, Ralph Castain wrote:


Well, I unfortunately cannot test this right now Greg - the 1.3
branch won't build due to a problem with the man page installation
script. The fix is in the trunk, but hasn't migrated across yet.

:-//

My guess is that you are caught on some stage where the hanging  
bugs
hadn't been fixed, but you cannot update to the current head of  
the
1.3 branch as it won't compile. All I can suggest is shifting to  
the

trunk (which definitely works) for now as the man page fix should
migrate soon.

Ralph

On Aug 4, 2008, at 12:12 PM, Ralph Castain wrote:

Depending upon the r-level, there was a problem for awhile with  
the
system hanging that was caused by a couple of completely  
unrelated
issues. I believe these have been fixed now - at least, it is  
fixed
on the trunk for me under that same system. I'll check 1.3 now  
- it

could be that some commits are missing over there.


On Aug 4, 2008, at 12:06 PM, Greg Watson wrote:


I have a fairly simple test program that runs fine under 1.2 on
MacOS X 10.5 . When I recompile and run it under 1.3 (head of  
1.3

branch) it just hangs.

They are both built using
--with-platform=contrib/platform/lanl/macosx-dynamic. For 1.3,  
I've

added --disable-io-romio.

Any suggestions?

Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] OMPI 1.3 problems

2008-08-04 Thread Greg Watson

Yes the application does sends/receives. No, it doesn't seem to be  
getting past MPI_Init.


I've reinstalled from a completely new 1.3 branch. Still hangs.

Greg

On Aug 4, 2008, at 4:45 PM, Terry Dontje wrote:

Are you doing any communications?  Have you gotten past MPI_Init?   
Could

your issue be related to the following ticket?

https://svn.open-mpi.org/trac/ompi/ticket/1378


--td
Greg Watson wrote:

I'm seeing the same behavior on trunk as 1.3. The program just hangs.

Greg

On Aug 4, 2008, at 2:25 PM, Ralph Castain wrote:


Well, I unfortunately cannot test this right now Greg - the 1.3
branch won't build due to a problem with the man page installation
script. The fix is in the trunk, but hasn't migrated across yet.

:-//

My guess is that you are caught on some stage where the hanging bugs
hadn't been fixed, but you cannot update to the current head of the
1.3 branch as it won't compile. All I can suggest is shifting to the
trunk (which definitely works) for now as the man page fix should
migrate soon.

Ralph

On Aug 4, 2008, at 12:12 PM, Ralph Castain wrote:


Depending upon the r-level, there was a problem for awhile with the
system hanging that was caused by a couple of completely unrelated
issues. I believe these have been fixed now - at least, it is fixed
on the trunk for me under that same system. I'll check 1.3 now - it
could be that some commits are missing over there.


On Aug 4, 2008, at 12:06 PM, Greg Watson wrote:


I have a fairly simple test program that runs fine under 1.2 on
MacOS X 10.5 . When I recompile and run it under 1.3 (head of 1.3
branch) it just hangs.

They are both built using
--with-platform=contrib/platform/lanl/macosx-dynamic. For 1.3,  
I've

added --disable-io-romio.

Any suggestions?

Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] OMPI 1.3 problems

2008-08-04 Thread Greg Watson


I'm seeing the same behavior on trunk as 1.3. The program just hangs.

Greg

On Aug 4, 2008, at 2:25 PM, Ralph Castain wrote:

Well, I unfortunately cannot test this right now Greg - the 1.3  
branch won't build due to a problem with the man page installation  
script. The fix is in the trunk, but hasn't migrated across yet.


:-//

My guess is that you are caught on some stage where the hanging bugs  
hadn't been fixed, but you cannot update to the current head of the  
1.3 branch as it won't compile. All I can suggest is shifting to the  
trunk (which definitely works) for now as the man page fix should  
migrate soon.


Ralph

On Aug 4, 2008, at 12:12 PM, Ralph Castain wrote:

Depending upon the r-level, there was a problem for awhile with the  
system hanging that was caused by a couple of completely unrelated  
issues. I believe these have been fixed now - at least, it is fixed  
on the trunk for me under that same system. I'll check 1.3 now - it  
could be that some commits are missing over there.



On Aug 4, 2008, at 12:06 PM, Greg Watson wrote:

I have a fairly simple test program that runs fine under 1.2 on  
MacOS X 10.5 . When I recompile and run it under 1.3 (head of 1.3  
branch) it just hangs.


They are both built using --with-platform=contrib/platform/lanl/ 
macosx-dynamic. For 1.3, I've added --disable-io-romio.


Any suggestions?

Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] OMPI 1.3 problems

2008-08-04 Thread Greg Watson


Ok, I'll try that. Thanks.

Greg

On Aug 4, 2008, at 2:25 PM, Ralph Castain wrote:

Well, I unfortunately cannot test this right now Greg - the 1.3  
branch won't build due to a problem with the man page installation  
script. The fix is in the trunk, but hasn't migrated across yet.


:-//

My guess is that you are caught on some stage where the hanging bugs  
hadn't been fixed, but you cannot update to the current head of the  
1.3 branch as it won't compile. All I can suggest is shifting to the  
trunk (which definitely works) for now as the man page fix should  
migrate soon.


Ralph

On Aug 4, 2008, at 12:12 PM, Ralph Castain wrote:

Depending upon the r-level, there was a problem for awhile with the  
system hanging that was caused by a couple of completely unrelated  
issues. I believe these have been fixed now - at least, it is fixed  
on the trunk for me under that same system. I'll check 1.3 now - it  
could be that some commits are missing over there.



On Aug 4, 2008, at 12:06 PM, Greg Watson wrote:

I have a fairly simple test program that runs fine under 1.2 on  
MacOS X 10.5 . When I recompile and run it under 1.3 (head of 1.3  
branch) it just hangs.


They are both built using --with-platform=contrib/platform/lanl/ 
macosx-dynamic. For 1.3, I've added --disable-io-romio.


Any suggestions?

Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Problem with SVN access.

2008-08-04 Thread Greg Watson


Anton,

I'm using Subversive and it seems to work fine. I wonder if there's  
something wrong with your setup?


Greg

On Aug 4, 2008, at 2:01 PM, Anton Soppelsa wrote:


Dear OpenMPI repository maintainers,

I tryed to download the source code of OpenMPI with a couple of SVN  
graphical clients, namely the SVN plug-in for eclipse and Tortoise  
SVN. It simply does not work. i receive the following errors:


Tortoise:

  Command: Checkout from http://svn.open-mpi.org/svn/ompi, revision  
HEAD, Fully recursive, Externals includedError: OPTIONS of 'http://svn.open-mpi.org/svn/ompi' 
: could not connect to server (http://svn.open-mpi.org)Finished!:

Subclipse:

  RA layer request failed
  svn: OPTIONS of 'http://svn.open-mpi.org/svn/ompi': could not  
connect to server (http://svn.open-mpi.org)


This is the first project that does not work with the above plugins.

Any ideas how to solve this problem?

Cheers,
Anton
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] OMPI 1.3 problems

2008-08-04 Thread Greg Watson

I have a fairly simple test program that runs fine under 1.2 on MacOS  
X 10.5 . When I recompile and run it under 1.3 (head of 1.3 branch) it  
just hangs.


They are both built using --with-platform=contrib/platform/lanl/macosx- 
dynamic. For 1.3, I've added --disable-io-romio.


Any suggestions?

Greg

[OMPI devel] 1.3 build failing on MacOSX

2008-07-28 Thread Greg Watson


I'm getting the following when I try and build 1.3 from SVN:

 gcc -DHAVE_CONFIG_H -I. -I../../adio/include -DOMPI_BUILDING=1 -I/ 
Users/greg/Documents/workspaces/ptp_head/ompi/ompi/mca/io/romio/ 
romio/../../../../.. -I/Users/greg/Documents/workspaces/ptp_head/ompi/ 
ompi/mca/io/romio/romio/../../../../../opal/include - 
I../../../../../../../opal/include -I../../../../../../../ompi/include  
-I/Users/greg/Documents/workspaces/ptp_head/ompi/ompi/mca/io/romio/ 
romio/include -I/Users/greg/Documents/workspaces/ptp_head/ompi/ompi/ 
mca/io/romio/romio/adio/include -D_REENTRANT -g -Wall -Wundef -Wno- 
long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes - 
Wcomment -pedantic -Wno-long-double -Werror-implicit-function- 
declaration -finline-functions -fno-strict-aliasing -DHAVE_ROMIOCONF_H  
-DHAVE_ROMIOCONF_H -I../../include -MT ad_write_nolock.lo -MD -MP - 
MF .deps/ad_write_nolock.Tpo -c ad_write_nolock.c  -fno-common -DPIC - 
o .libs/ad_write_nolock.o

ad_write_nolock.c: In function ‘ADIOI_NOLOCK_WriteStrided’:
ad_write_nolock.c:92: error: implicit declaration of function ‘lseek64’
make[5]: *** [ad_write_nolock.lo] Error 1
make[4]: *** [all-recursive] Error 1
make[3]: *** [all-recursive] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

Configured with:

./configure --with-platform=contrib/platform/lanl/macosx-dynamic

Any ideas?

Greg

Re: [OMPI devel] mpirun hangs

2008-05-28 Thread Greg Watson

That fixed it, thanks. I wonder if this is the same problem I'm seeing  
for 1.2.x?


Greg

On May 27, 2008, at 10:34 PM, Ralph Castain wrote:

Aha! This is a problem that continues to bite us - it relates to the  
pty
problem in Mac OSX. Been a ton of chatter about this, but Mac  
doesn't seem

inclined to fix it.

Try configuring --disable-pty-support and see if that helps. FWIW,  
you will
find a platform file for Mac OSX in the trunk - I always build with  
it, and

have spent considerable time fine-tuning it. You configure with:

./configure --prefix=whatever
--with-platform=contrib/platform/lanl/macosx-dynamic

In that directory, you will also find platform files for static  
builds under

both Tiger and Leopard (slight differences).

ralph


On 5/27/08 8:01 PM, "Greg Watson"  wrote:


Ralph,

I tried rolling back to 18513 but no luck. Steps:

$ ./autogen.sh
$ ./configure --prefix=/usr/local/openmpi-1.3-devel
$ make
$ make install
$ mpicc -g -o xxx xxx.c
$ mpirun -np 2 ./xxx
$ ps x
44832 s001  R+ 0:50.00 mpirun -np 2 ./xxx
44833 s001  S+ 0:00.03 ./xxx
$ gdb /usr/local/openmpi-1.3-devel/bin/mpirun
...
(gdb) attach 44832
Attaching to program: `/usr/local/openmpi-1.3-devel/bin/mpirun',
process 44832.
Reading symbols for shared libraries 
+.. done
0x9371b3dd in ioctl ()
(gdb) where
#0  0x9371b3dd in ioctl ()
#1  0x93754812 in grantpt ()
#2  0x9375470b in openpty ()
#3  0x001446d9 in opal_openpty ()
#4  0x000bf3bf in orte_iof_base_setup_prefork ()
#5  0x003da62f in odls_default_fork_local_proc (context=0x216a60,
child=0x216dd0, environ_copy=0x217930) at odls_default_module.c:191
#6  0x000c3e76 in orte_odls_base_default_launch_local ()
#7  0x003daace in orte_odls_default_launch_local_procs  
(data=0x216780)

at odls_default_module.c:360
#8  0x000ad2f6 in process_commands (sender=0x216768, buffer=0x216780,
tag=1) at orted/orted_comm.c:441
#9  0x000acd52 in orte_daemon_cmd_processor (fd=-1, opal_event=1,
data=0x216750) at orted/orted_comm.c:346
#10 0x0012bd21 in event_process_active () at opal_object.h:498
#11 0x0012c3c5 in opal_event_base_loop () at opal_object.h:498
#12 0x0012bf8c in opal_event_loop () at opal_object.h:498
#13 0x0011b334 in opal_progress () at runtime/opal_progress.c:169
#14 0x000cd9b4 in orte_plm_base_report_launched () at opal_object.h: 
498

#15 0x000cc2b7 in orte_plm_base_launch_apps () at opal_object.h:498
#16 0x0003d626 in orte_plm_rsh_launch (jdata=0x200ae0) at
plm_rsh_module.c:1126
#17 0x2604 in orterun (argc=4, argv=0xb880) at orterun.c:549
#18 0x1bd6 in main (argc=4, argv=0xb880) at main.c:13

On May 27, 2008, at 9:11 PM, Ralph Castain wrote:


Yo Greg

I'm not seeing any problem on my Mac OSX - I'm running Leopard. Can
you tell
me how you configured, and the precise command you executed?

Thanks
Ralph



On 5/27/08 5:15 PM, "Ralph Castain"  wrote:


Hmmm...well, it was working about 3 hours ago! I'll try to take a
look
tonight, but it may be tomorrow.

Try rolling it back just a little to r18513 - that's the last rev I
tested
on my Mac.


On 5/27/08 5:00 PM, "Greg Watson"  wrote:

Something seems to be broken in the trunk for MacOS X. I can run  
a 1

process job, but a >1 process job hangs. It was working a few days
ago.

Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] mpirun hangs

2008-05-27 Thread Greg Watson


BTW, this is Leopard.

Greg

On May 27, 2008, at 9:11 PM, Ralph Castain wrote:


Yo Greg

I'm not seeing any problem on my Mac OSX - I'm running Leopard. Can  
you tell

me how you configured, and the precise command you executed?

Thanks
Ralph



On 5/27/08 5:15 PM, "Ralph Castain"  wrote:

Hmmm...well, it was working about 3 hours ago! I'll try to take a  
look

tonight, but it may be tomorrow.

Try rolling it back just a little to r18513 - that's the last rev I  
tested

on my Mac.


On 5/27/08 5:00 PM, "Greg Watson"  wrote:


Something seems to be broken in the trunk for MacOS X. I can run a 1
process job, but a >1 process job hangs. It was working a few days  
ago.


Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] mpirun hangs

2008-05-27 Thread Greg Watson


Ralph,

I tried rolling back to 18513 but no luck. Steps:

$ ./autogen.sh
$ ./configure --prefix=/usr/local/openmpi-1.3-devel
$ make
$ make install
$ mpicc -g -o xxx xxx.c
$ mpirun -np 2 ./xxx
$ ps x
44832 s001  R+ 0:50.00 mpirun -np 2 ./xxx
44833 s001  S+ 0:00.03 ./xxx
$ gdb /usr/local/openmpi-1.3-devel/bin/mpirun
...
(gdb) attach 44832
Attaching to program: `/usr/local/openmpi-1.3-devel/bin/mpirun',  
process 44832.
Reading symbols for shared libraries  
+.. done

0x9371b3dd in ioctl ()
(gdb) where
#0  0x9371b3dd in ioctl ()
#1  0x93754812 in grantpt ()
#2  0x9375470b in openpty ()
#3  0x001446d9 in opal_openpty ()
#4  0x000bf3bf in orte_iof_base_setup_prefork ()
#5  0x003da62f in odls_default_fork_local_proc (context=0x216a60,  
child=0x216dd0, environ_copy=0x217930) at odls_default_module.c:191

#6  0x000c3e76 in orte_odls_base_default_launch_local ()
#7  0x003daace in orte_odls_default_launch_local_procs (data=0x216780)  
at odls_default_module.c:360
#8  0x000ad2f6 in process_commands (sender=0x216768, buffer=0x216780,  
tag=1) at orted/orted_comm.c:441
#9  0x000acd52 in orte_daemon_cmd_processor (fd=-1, opal_event=1,  
data=0x216750) at orted/orted_comm.c:346

#10 0x0012bd21 in event_process_active () at opal_object.h:498
#11 0x0012c3c5 in opal_event_base_loop () at opal_object.h:498
#12 0x0012bf8c in opal_event_loop () at opal_object.h:498
#13 0x0011b334 in opal_progress () at runtime/opal_progress.c:169
#14 0x000cd9b4 in orte_plm_base_report_launched () at opal_object.h:498
#15 0x000cc2b7 in orte_plm_base_launch_apps () at opal_object.h:498
#16 0x0003d626 in orte_plm_rsh_launch (jdata=0x200ae0) at  
plm_rsh_module.c:1126

#17 0x2604 in orterun (argc=4, argv=0xb880) at orterun.c:549
#18 0x1bd6 in main (argc=4, argv=0xb880) at main.c:13

On May 27, 2008, at 9:11 PM, Ralph Castain wrote:


Yo Greg

I'm not seeing any problem on my Mac OSX - I'm running Leopard. Can  
you tell

me how you configured, and the precise command you executed?

Thanks
Ralph



On 5/27/08 5:15 PM, "Ralph Castain"  wrote:

Hmmm...well, it was working about 3 hours ago! I'll try to take a  
look

tonight, but it may be tomorrow.

Try rolling it back just a little to r18513 - that's the last rev I  
tested

on my Mac.


On 5/27/08 5:00 PM, "Greg Watson"  wrote:


Something seems to be broken in the trunk for MacOS X. I can run a 1
process job, but a >1 process job hangs. It was working a few days  
ago.


Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] mpirun hangs

2008-05-27 Thread Greg Watson

Something seems to be broken in the trunk for MacOS X. I can run a 1  
process job, but a >1 process job hangs. It was working a few days ago.


Greg

Re: [OMPI devel] Fwd: OpenMPI changes

2008-03-05 Thread Greg Watson



On Mar 5, 2008, at 7:38 AM, Jeff Squyres wrote:


On Mar 4, 2008, at 5:44 PM, Greg Watson wrote:


I certainly don't (nor anyone in PTP as far as I know) have the
resources to re-add functionality to OMPI, so unfortunately it  
appears

that 1.2 will be the end of the line for PTP supported versions. As I
mentioned to Ralph, I don't follow your developer discussions closely
enough to understand the details of every change that is proposed.


I can understand that.

But please also appreciate our point of view: we put out public
notices saying that interface changes were going to occur and
specifically, deliberately asked if anyone cared.  IBM did not say
"stop!!".  If no one says anything, how are we to know that what we're
doing is going to adversely affect anyone?


Looking back through the mailing list, I can only see two references  
that seem relevant to this. One was titled "Major reduction in ORTE"  
and does allude to the event model changes. The other "OMPI/ORTE and  
tools" talks about "alternative methods of interaction". Neither  
mentions changes to the spawning and I/O forwarding functionality  
(that I can see), or that this would be the exclusive mechanism for  
interaction. In the future (assuming there are more changes), it would  
be helpful if there was at least some information about what specific  
API's are being removed.


BTW, IBM is simply a stakeholder in PTP. It's not up to IBM to make  
this decision.


Cheers,

Greg

Re: [OMPI devel] Fwd: OpenMPI changes

2008-03-04 Thread Greg Watson


Ralph,

Looking at PTP, the only thing we need is to query the process  
information (PID, rank, node) when the job is created. Perhaps if only  
queries are allowed from callbacks then recursion would be eliminated?


If you can get this functionality into your new interface and back in  
the trunk, I take a look at porting PTP to use it.


Thanks,
Greg

On Mar 4, 2008, at 6:14 PM, Ralph Castain wrote:


Yeah, the problem we had in the past was:

1. something would trigger in the system - e.g., a particular job  
state was
reached. This would cause us to execute a callback function via the  
GPR


2. the callback function would take some action. Typically, this  
involved
sending out a message or calling another function. Either way, the  
eventual
result of that action would be to cause another GPR trigger to fire  
- either

the job or a process changing state

This loop would continue ad infinitum. Sometimes, I would see stack  
traces
hundreds of calls deep. Debugging and maintaining something that  
intertwined

was impossible.

People tried to impose order by establishing rules about what could  
and

could not be called from various situations, but that also proved
intractable. Problem was that we could get it to work for a "normal"  
code

path, but all the variety of failure modes, combined with all the
flexibility built into the code base, created so many code paths  
that you

inevitably wound up deadlocked under some corner case conditions.

Which we generally agreed was unacceptable.

It -is- possible to have callback functions that avoid this situation.
However, it is very easy to make a mistake and "hang" the whole  
system. Just

seemed easier to avoid the entire problem. (I don't get that option!)

The ability to get an allocation without launching is easy to add.

I/O forwarding is currently an issue. Our IOF doesn't seem to like  
it when I
try to create an "alternate" tap (the default always goes back  
through the
persistent orted, so the tool looks like a second "tap" on the  
flow). This
is noted as a "bug" on our tracker, and I expect it will be  
addressed prior

to releasing 1.3. I will ask that it be raised in priority.

I'll review what I had done and see about bringing it into the trunk  
by the

end of the week.

Ralph



On 3/4/08 4:00 PM, "Greg Watson"  wrote:


I don't have a problem using a different interface, assuming it's
adequately supported and provides the functionality we need. I  
presume

the recursive behavior you're referring to is calling OMPI interfaces
from the callback functions. Any event-based system has this issue,
and it is usually solved by clearly specifying the allowable
interfaces that can be called (possibly none). Since PTP doesn't call
OMPI functions from callbacks, it's not a problem for us if no
interfaces can be called.

The major missing features appear to be:

- Ability to request a process allocation without launching the job
- I/O forwarding callbacks

Without these, PTP support will be so limited that I'd be reluctant  
to

say we support OMPI.

Greg

On Mar 4, 2008, at 4:50 PM, Ralph H Castain wrote:


It is buried deep-down in the thread, but I'll just reiterate it
here. I
have "restored" the ability to "subscribe" to changes in job, proc,
and node
state via OMPI's tool interface library. I have -not- checked this
into the
trunk yet, though, until the community has a chance to consider
whether or
not it wants it.

Restoring the ability to have such changes "callback" to user
functions
raises the concern again about recursive behavior. We worked hard to
remove
recursion from the code base, and it would be a concern to see it
potentially re-enter.

I realize there is some difference between ORTE calling back into
itself vs
calling back into a user-specified function. However, unless that
user truly
understands ORTE/OMPI and takes considerable precautions, it is very
easy to
recreate the recursive behavior without intending to do so.

The tool interface library was built to accomplish two things:

1. help reduce the impact on external tools of changes to ORTE/OMPI
interfaces, and

2. provide a degree of separation to prevent the tool from
inadvertently
causing OMPI to "behave badly"

I think we accomplished that - I would encourage you to at least
consider
using the library. If there is something missing, we can always add
it.

Ralph



On 3/4/08 2:37 PM, "Jeff Squyres"  wrote:


Greg --

I admit to being a bit puzzled here.  Ralph sent around RFCs about
these changes many months ago.  Everyone said they didn't want this
functionality -- it was seen as excess functionality that Open MPI
didn't want or need -- so it was all removed.

As such, I have to agree with Ralph that it is an "enhancement" to
re-
add the functionality.  That being said, patches

Re: [OMPI devel] Fwd: OpenMPI changes

2008-03-04 Thread Greg Watson

I don't have a problem using a different interface, assuming it's  
adequately supported and provides the functionality we need. I presume  
the recursive behavior you're referring to is calling OMPI interfaces  
from the callback functions. Any event-based system has this issue,  
and it is usually solved by clearly specifying the allowable  
interfaces that can be called (possibly none). Since PTP doesn't call  
OMPI functions from callbacks, it's not a problem for us if no  
interfaces can be called.


The major missing features appear to be:

- Ability to request a process allocation without launching the job
- I/O forwarding callbacks

Without these, PTP support will be so limited that I'd be reluctant to  
say we support OMPI.


Greg

On Mar 4, 2008, at 4:50 PM, Ralph H Castain wrote:

It is buried deep-down in the thread, but I'll just reiterate it  
here. I
have "restored" the ability to "subscribe" to changes in job, proc,  
and node
state via OMPI's tool interface library. I have -not- checked this  
into the
trunk yet, though, until the community has a chance to consider  
whether or

not it wants it.

Restoring the ability to have such changes "callback" to user  
functions
raises the concern again about recursive behavior. We worked hard to  
remove

recursion from the code base, and it would be a concern to see it
potentially re-enter.

I realize there is some difference between ORTE calling back into  
itself vs
calling back into a user-specified function. However, unless that  
user truly
understands ORTE/OMPI and takes considerable precautions, it is very  
easy to

recreate the recursive behavior without intending to do so.

The tool interface library was built to accomplish two things:

1. help reduce the impact on external tools of changes to ORTE/OMPI
interfaces, and

2. provide a degree of separation to prevent the tool from  
inadvertently

causing OMPI to "behave badly"

I think we accomplished that - I would encourage you to at least  
consider
using the library. If there is something missing, we can always add  
it.


Ralph



On 3/4/08 2:37 PM, "Jeff Squyres"  wrote:


Greg --

I admit to being a bit puzzled here.  Ralph sent around RFCs about
these changes many months ago.  Everyone said they didn't want this
functionality -- it was seen as excess functionality that Open MPI
didn't want or need -- so it was all removed.

As such, I have to agree with Ralph that it is an "enhancement" to  
re-

add the functionality.  That being said, patches are always welcome!
IBM has signed the OMPI 3rd party contribution agreement, so it could
be contributed directly.

Sidenote: I was also under the impression that PTP was being re- 
geared

towards STCI and moving away from ORTE anyway.  Is this incorrect?



On Mar 4, 2008, at 3:24 PM, Greg Watson wrote:


Hi all,

Ralph informs me that significant functionality has been removed  
from
ORTE in 1.3. Unfortunately this functionality was being used by  
PTP to

provide support for OMPI, and without it, it seems unlikely that PTP
will be able to work with 1.3. Apparently restoring this lost
functionality is an "enhancement" of 1.3, and so is something that
will not necessarily be done. Having worked with OMPI from a very
early stage to ensure that we were able to provide robust support, I
must say it is a bit disappointing that this approach is being  
taken.
I hope that the community will view this "enhancement" as  
worthwhile.


Regards,

Greg

Begin forwarded message:



On 2/29/08 7:13 AM, "Gregory R Watson"  wrote:




Ralph Castain  wrote on 02/29/2008 12:18:39 AM:


Ralph Castain 
02/29/08 12:18 AM

To

Gregory R Watson/Watson/IBM@IBMUS

cc

Subject

Re: OpenMPI changes

Hi Greg

All of the prior options (and some new ones) for spawning a job

are fully

supported in the new interface. Instead of setting them with

"attributes",

you create an orte_job_t object and just fill them in. This is

precisely how

mpirun does it - you can look at that code if you want an

example, though it

is somewhat complex. Alternatively, you can look at the way it is

done for

comm_spawn, which may be more analogous to your situation - that

code is in

ompi/mca/dpm/orte.

All the tools library does is communicate the job object to the

target

persistent daemon so it can do the work. This way, you don't have

to open

all the frameworks, deal directly with the plm interface, etc.

Alternatively, you are welcome to do a full orte_init and use the

frameworks

yourself - there is no requirement to use the library. I only

offer it as an

alternative.


As far as I can tell, neither API provides the same functionality

as that

available in 1.2. While this might be beneficial for OMPI-specific

activities,

the changes appear to severely limit the interaction of tools with

the
runtime. At this po

Re: [OMPI devel] Fwd: OpenMPI changes

2008-03-04 Thread Greg Watson

I certainly don't (nor anyone in PTP as far as I know) have the  
resources to re-add functionality to OMPI, so unfortunately it appears  
that 1.2 will be the end of the line for PTP supported versions. As I  
mentioned to Ralph, I don't follow your developer discussions closely  
enough to understand the details of every change that is proposed.  
Since PTP has provided requirements and been supported since 1.0, I  
was under the (seemingly incorrect) impression that this support would  
continue in future versions.


PTP will very likely support STCI when it becomes available. However,  
the intention was to continue to support OMPI also. Maybe this will be  
possible without ORTE, but it seems uncertain at this stage.


Greg

On Mar 4, 2008, at 4:37 PM, Jeff Squyres wrote:


Greg --

I admit to being a bit puzzled here.  Ralph sent around RFCs about
these changes many months ago.  Everyone said they didn't want this
functionality -- it was seen as excess functionality that Open MPI
didn't want or need -- so it was all removed.

As such, I have to agree with Ralph that it is an "enhancement" to re-
add the functionality.  That being said, patches are always welcome!
IBM has signed the OMPI 3rd party contribution agreement, so it could
be contributed directly.

Sidenote: I was also under the impression that PTP was being re-geared
towards STCI and moving away from ORTE anyway.  Is this incorrect?



On Mar 4, 2008, at 3:24 PM, Greg Watson wrote:


Hi all,

Ralph informs me that significant functionality has been removed from
ORTE in 1.3. Unfortunately this functionality was being used by PTP  
to

provide support for OMPI, and without it, it seems unlikely that PTP
will be able to work with 1.3. Apparently restoring this lost
functionality is an "enhancement" of 1.3, and so is something that
will not necessarily be done. Having worked with OMPI from a very
early stage to ensure that we were able to provide robust support, I
must say it is a bit disappointing that this approach is being taken.
I hope that the community will view this "enhancement" as worthwhile.

Regards,

Greg

Begin forwarded message:



On 2/29/08 7:13 AM, "Gregory R Watson"  wrote:




Ralph Castain  wrote on 02/29/2008 12:18:39 AM:


Ralph Castain 
02/29/08 12:18 AM

To

Gregory R Watson/Watson/IBM@IBMUS

cc

Subject

Re: OpenMPI changes

Hi Greg

All of the prior options (and some new ones) for spawning a job

are fully

supported in the new interface. Instead of setting them with

"attributes",

you create an orte_job_t object and just fill them in. This is

precisely how

mpirun does it - you can look at that code if you want an

example, though it

is somewhat complex. Alternatively, you can look at the way it is

done for

comm_spawn, which may be more analogous to your situation - that

code is in

ompi/mca/dpm/orte.

All the tools library does is communicate the job object to the

target

persistent daemon so it can do the work. This way, you don't have

to open

all the frameworks, deal directly with the plm interface, etc.

Alternatively, you are welcome to do a full orte_init and use the

frameworks

yourself - there is no requirement to use the library. I only

offer it as an

alternative.


As far as I can tell, neither API provides the same functionality

as that

available in 1.2. While this might be beneficial for OMPI-specific

activities,

the changes appear to severely limit the interaction of tools with

the
runtime. At this point, I can't see either interface supporting  
PTP.


I went ahead and added a notification capability to the system -
took about
30 minutes. I can provide notice of job and process state changes
since I
see those. Node state changes, however, are different - I can notify
on
them, but we have no way of seeing them. None of the environments we
support
tell us when a node fails.





I know that the tool library works because it uses the identical

APIs as

comm_spawn and mpirun. I have also tested them by building my own

tools.


There's a big difference being on a code path that *must* work

because it is

used by core components, to one that is provided as an add-on for

external
tools. I may be worrying needlessly if this new interface becomes  
an

"officially supported" API. Is that planned? At a minimum, it

seems like it's

going to complicate your testing process, since you're going to

need to

provide a separate set of tests that exercise this interface

independent of

the rest of OMPI.


It is an officially supported API. Testing is not as big a problem
as you
might expect since the library exercises the same code paths as
mpirun and
comm_spawn. Like I said, I have written my own tools that exercise
the
library - no problem using them as tests.





We do not launch an orted for any tool-library query. All we do is
communicate the query to the target persistent daemon or mpiru

[OMPI devel] Fwd: OpenMPI changes

2008-03-04 Thread Greg Watson


Hi all,

Ralph informs me that significant functionality has been removed from  
ORTE in 1.3. Unfortunately this functionality was being used by PTP to  
provide support for OMPI, and without it, it seems unlikely that PTP  
will be able to work with 1.3. Apparently restoring this lost  
functionality is an "enhancement" of 1.3, and so is something that  
will not necessarily be done. Having worked with OMPI from a very  
early stage to ensure that we were able to provide robust support, I  
must say it is a bit disappointing that this approach is being taken.  
I hope that the community will view this "enhancement" as worthwhile.


Regards,

Greg

Begin forwarded message:



On 2/29/08 7:13 AM, "Gregory R Watson"  wrote:

>
>
> Ralph Castain  wrote on 02/29/2008 12:18:39 AM:
>
>> Ralph Castain 
>> 02/29/08 12:18 AM
>>
>> To
>>
>> Gregory R Watson/Watson/IBM@IBMUS
>>
>> cc
>>
>> Subject
>>
>> Re: OpenMPI changes
>>
>> Hi Greg
>>
>> All of the prior options (and some new ones) for spawning a job  
are fully
>> supported in the new interface. Instead of setting them with  
"attributes",
>> you create an orte_job_t object and just fill them in. This is  
precisely how
>> mpirun does it - you can look at that code if you want an  
example, though it
>> is somewhat complex. Alternatively, you can look at the way it is  
done for
>> comm_spawn, which may be more analogous to your situation - that  
code is in

>> ompi/mca/dpm/orte.
>>
>> All the tools library does is communicate the job object to the  
target
>> persistent daemon so it can do the work. This way, you don't have  
to open

>> all the frameworks, deal directly with the plm interface, etc.
>>
>> Alternatively, you are welcome to do a full orte_init and use the  
frameworks
>> yourself - there is no requirement to use the library. I only  
offer it as an

>> alternative.
>
> As far as I can tell, neither API provides the same functionality  
as that
> available in 1.2. While this might be beneficial for OMPI-specific  
activities,
> the changes appear to severely limit the interaction of tools with  
the

> runtime. At this point, I can't see either interface supporting PTP.

I went ahead and added a notification capability to the system -  
took about
30 minutes. I can provide notice of job and process state changes  
since I
see those. Node state changes, however, are different - I can notify  
on
them, but we have no way of seeing them. None of the environments we  
support

tell us when a node fails.

>
>>
>> I know that the tool library works because it uses the identical  
APIs as
>> comm_spawn and mpirun. I have also tested them by building my own  
tools.

>
> There's a big difference being on a code path that *must* work  
because it is
> used by core components, to one that is provided as an add-on for  
external

> tools. I may be worrying needlessly if this new interface becomes an
> "officially supported" API. Is that planned? At a minimum, it  
seems like it's
> going to complicate your testing process, since you're going to  
need to
> provide a separate set of tests that exercise this interface  
independent of

> the rest of OMPI.

It is an officially supported API. Testing is not as big a problem  
as you
might expect since the library exercises the same code paths as  
mpirun and

comm_spawn. Like I said, I have written my own tools that exercise the
library - no problem using them as tests.

>
>>
>> We do not launch an orted for any tool-library query. All we do is
>> communicate the query to the target persistent daemon or mpirun.  
Those
>> entities have recv's posted to catch any incoming messages and  
execute the

>> request.
>>
>> You are correct that we no longer have event driven notification  
in the
>> system. I repeatedly asked the community (on both devel and core  
lists) for
>> input on that question, and received no indications that anyone  
wanted it
>> supported. It can be added back into the system, but would  
require the
>> approval of the OMPI community. I don't know how problematic that  
would be -
>> there is a lot of concern over the amount of memory, overhead,  
and potential
>> reliability issues that surround event notification. If you want  
that
>> capability, I suggest we discuss it, come up with a plan that  
deals with
>> those issues, and then take a proposal to the devel list for  
discussion.

>>
>> As for reliability, the objectives of the last year's effort were  
precisely
>> scalability and reliability. We did a lot of work to eliminate  
recursive
>> deadlocks and improve the reliability of the code. Our current  
testing
>> indicates we had considerable success in that regard,  
particularly with the

>> recursion elimination commit earlier today.
>>
>> I would be happy to work with you to meet the PTP's needs - we'll  
just need
>> to work with the OMPI community to ensure everyone buys into the  
plan. If it
>> would help, I could come and review the new arch with the team (I  
already

[OMPI devel] Leopard problems

2008-02-11 Thread Greg Watson


Hi,

Since I upgraded to MacOS X 10.5.1, I've been having problems running  
MPI programs (using both 1.2.4 and 1.2.5). The symptoms are  
intermittent (i.e. sometimes the application runs fine), and appear as  
follows:


1. One or more of the application processes die (I've see both one and  
two processes die).


2. (It appears) that the orted's associated with these application  
process then spin continually.


Here is what I see when I run "mpirun -np 4 ./mpitest":

12467   ??  Rs 1:26.52 orted --bootproxy 1 --name 0.0.1 -- 
num_procs 5 --vpid_start 0 --nodename node0 --universe  
greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp:// 
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 
10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp:// 
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 
10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
12468   ??  Rs 1:26.63 orted --bootproxy 1 --name 0.0.2 -- 
num_procs 5 --vpid_start 0 --nodename node1 --universe  
greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp:// 
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 
10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp:// 
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 
10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
12469   ??  Ss 0:00.04 orted --bootproxy 1 --name 0.0.3 -- 
num_procs 5 --vpid_start 0 --nodename node2 --universe  
greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp:// 
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 
10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp:// 
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 
10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
12470   ??  Ss 0:00.04 orted --bootproxy 1 --name 0.0.4 -- 
num_procs 5 --vpid_start 0 --nodename node3 --universe  
greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp:// 
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 
10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp:// 
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 
10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid

12471   ??  S  0:00.05 ./mpitest
12472   ??  S  0:00.05 ./mpitest

Killing the mpirun results in:

$ mpirun -np 4 ./mpitest
^Cmpirun: killing job...

^ 
C 
--

WARNING: mpirun is in the process of killing a job, but has detected an
interruption (probably control-C).

It is dangerous to interrupt mpirun while it is killing a job (proper
termination may not be guaranteed).  Hit control-C again within 1
second if you really want to kill mpirun immediately.
--
^Cmpirun: forcibly killing job...
--
WARNING: mpirun has exited before it received notification that all
started processes had terminated.  You should double check and ensure
that there are no runaway processes still executing.
--

At this point, the two spinning orted's are left running, and the only  
way to kill them is with -9.


Is anyone else seeing this problem?

Greg

Re: [OMPI devel] thread model

2007-08-28 Thread Greg Watson



On Aug 27, 2007, at 10:04 PM, Jeff Squyres wrote:


On Aug 27, 2007, at 2:50 PM, Greg Watson wrote:


Until now I haven't had to worry about the opal/orte thread model.
However, there are now people who would like to use ompi that has
been configured with --with-threads=posix and --with-enable-mpi-
threads. Can someone give me some pointers as to what I need to do in
order to make sure I don't violate any threading model?


Note that this is *NOT* well tested.  There is work going on right
now to make the OMPI layer be able to support MPI_THREAD_MULTIPLE
(support was designed in from the beginning, but we haven't ever done
any kind of comprehensive testing/stressing of multi-thread support
such that it is pretty much guaranteed not to work), but it is
occurring on the trunk (i.e., what will eventually become v1.3) --
not the v1.2 branch.


The interfaces I'm calling are:

opal_event_loop()


Brian or George will have to answer about that one...


opal_path_findv()


This guy should be multi-thread safe (disclaimer: haven't tested it
myself); it doesn't rely on any global state.


orte_init()
orte_ns.create_process_name()
orte_iof.iof_subscribe()
orte_iof.iof_unsubscribe()
orte_schema.get_job_segment_name()
orte_gpr.get()
orte_dss.get()
orte_rml.send_buffer()
orte_rmgr.spawn_job()
orte_pls.terminate_job()
orte_rds.query()
orte_smr.job_stage_gate_subscribe()
orte_rmgr.get_vpid_range()


Note that all of ORTE is *NOT* thread safe, nor is it planned to be
(it just seemed way more trouble than it was worth).  You need to
serialize access to it.



Does that mean just calling OPAL_THREAD_LOCK() and OPAL_THREAD_UNLOCK 
() around each?


Greg

[OMPI devel] thread model

2007-08-27 Thread Greg Watson


Hi,

Until now I haven't had to worry about the opal/orte thread model.  
However, there are now people who would like to use ompi that has  
been configured with --with-threads=posix and --with-enable-mpi- 
threads. Can someone give me some pointers as to what I need to do in  
order to make sure I don't violate any threading model?


The interfaces I'm calling are:

opal_event_loop()
opal_path_findv()
orte_init()
orte_ns.create_process_name()
orte_iof.iof_subscribe()
orte_iof.iof_unsubscribe()
orte_schema.get_job_segment_name()
orte_gpr.get()
orte_dss.get()
orte_rml.send_buffer()
orte_rmgr.spawn_job()
orte_pls.terminate_job()
orte_rds.query()
orte_smr.job_stage_gate_subscribe()
orte_rmgr.get_vpid_range()

Thanks,

Greg

Re: [OMPI devel] RH Enterprise Linux issue

2007-03-22 Thread Greg Watson

Scratch that. The problem was an installation over an old copy of  
ompi. Obviously picking up some old stuff.


Sorry for the disturbance. Back to the bat cave...

Greg

On Mar 22, 2007, at 12:46 PM, Jeff Squyres wrote:


Yes, if you could recompile with debugging, that would be great.

What launcher are you trying to use?


On Mar 22, 2007, at 2:35 PM, Greg Watson wrote:


gdb says this:

#0  0x2e342e33 in ?? ()
#1  0xb7fe1d31 in orte_pls_base_select () from /usr/local/lib/ 
libopen-

rte.so.0
#2  0xb7fc50cb in orte_init_stage1 () from /usr/local/lib/libopen-
rte.so.0
#3  0xb7fc84be in orte_system_init () from /usr/local/lib/libopen-
rte.so.0
#4  0xb7fc4cee in orte_init () from /usr/local/lib/libopen-rte.so.0
#5  0x08049ecb in orterun (argc=4, argv=0xb9f4) at orterun.c:369
#6  0x08049d7a in main (argc=4, argv=0xb9f4) at main.c:13
(gdb) The program is running.  Exit anyway? (y or n) y

I can recompile with debugging if that would be useful. Let me know
if there's anything else I can do.

Here's ompi_info in case it helps:

 Open MPI: 1.2
Open MPI SVN revision: r14027
 Open RTE: 1.2
Open RTE SVN revision: r14027
 OPAL: 1.2
OPAL SVN revision: r14027
   Prefix: /usr/local
Configured architecture: i686-pc-linux-gnu
Configured on: Thu Mar 22 13:39:30 EDT 2007
 Built on: Thu Mar 22 13:55:38 EDT 2007
   C bindings: yes
 C++ bindings: yes
   Fortran77 bindings: yes (all)
   Fortran90 bindings: no
Fortran90 bindings size: na
   C compiler: gcc
  C compiler absolute: /usr/bin/gcc
 C++ compiler: g++
C++ compiler absolute: /usr/bin/g++
   Fortran77 compiler: g77
   Fortran77 compiler abs: /usr/bin/g77
   Fortran90 compiler: none
   Fortran90 compiler abs: none
  C profiling: yes
C++ profiling: yes
  Fortran77 profiling: yes
  Fortran90 profiling: no
   C++ exceptions: no
   Thread support: no
   Internal debug support: no
  MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
  libltdl support: yes
Heterogeneous support: yes
mpirun default --prefix: no
mca: base: component_find: unable to open pml teg: file not found
(ignored)
MCA backtrace: execinfo (MCA v1.0, API v1.0, Component
v1.2)
   MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component
v1.2)
MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2)
MCA maffinity: first_use (MCA v1.0, API v1.0, Component
v1.2)
MCA timer: linux (MCA v1.0, API v1.0, Component v1.2)
MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
MCA allocator: bucket (MCA v1.0, API v1.0, Component  
v1.0)

 MCA coll: basic (MCA v1.0, API v1.0, Component v1.2)
 MCA coll: self (MCA v1.0, API v1.0, Component v1.2)
 MCA coll: sm (MCA v1.0, API v1.0, Component v1.2)
 MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2)
   MCA io: romio (MCA v1.0, API v1.0, Component v1.2)
MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2)
  MCA pml: cm (MCA v1.0, API v1.0, Component v1.2)
  MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2)
  MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2)
   MCA rcache: rb (MCA v1.0, API v1.0, Component v1.2)
   MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2)
  MCA btl: self (MCA v1.0, API v1.0.1, Component  
v1.2)

  MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2)
  MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0)
 MCA topo: unity (MCA v1.0, API v1.0, Component v1.2)
  MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2)
   MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2)
   MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2)
   MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2)
  MCA gpr: null (MCA v1.0, API v1.0, Component v1.2)
  MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2)
  MCA gpr: replica (MCA v1.0, API v1.0, Component
v1.2)
  MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2)
  MCA iof: svc (MCA v1.0, API v1.0, Component v1.2)
   MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2)
   MCA ns: replica (MCA v1.0, API v2.0, Component
v1.2)
  MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
  MCA ras: dash_host (MCA v1.0, API v1.3, Component
v1.2)
  MCA ras: gridengine (MCA v1.0, API v1.3, Component
v1.2)
  MCA ras: hostfile (MCA v1.0, API v1.0, Component
v1.0.2)
  MCA ras: localhost (MCA v1.0

Re: [OMPI devel] RH Enterprise Linux issue

2007-03-22 Thread Greg Watson

, API v2.0, Component v1.2)
MCA rmgr: urm (MCA v1.0, API v2.0, Component v1.2)
 MCA rml: oob (MCA v1.0, API v1.0, Component v1.2)
 MCA pls: daemon (MCA v1.0, API v1.0, Component v1.0.2)
 MCA pls: fork (MCA v1.0, API v1.0, Component v1.0.2)
 MCA pls: gridengine (MCA v1.0, API v1.3, Component  
v1.2)

 MCA pls: proxy (MCA v1.0, API v1.3, Component v1.2)
 MCA pls: rsh (MCA v1.0, API v1.3, Component v1.2)
 MCA pls: slurm (MCA v1.0, API v1.3, Component v1.2)
 MCA sds: env (MCA v1.0, API v1.0, Component v1.2)
 MCA sds: pipe (MCA v1.0, API v1.0, Component v1.2)
 MCA sds: seed (MCA v1.0, API v1.0, Component v1.2)
 MCA sds: singleton (MCA v1.0, API v1.0, Component  
v1.2)

 MCA sds: slurm (MCA v1.0, API v1.0, Component v1.2)

Greg

On Mar 22, 2007, at 12:29 PM, Jeff Squyres wrote:


No, not a known problem -- my cluster is RHEL4U4 -- I use it for many
thousands of runs of the OMPI v1.2 branch every day...

Can you see where it's dying in orte_init_stage1?


On Mar 22, 2007, at 2:17 PM, Greg Watson wrote:


Is this a known problem? Building ompi 1.2 on RHEL4:

./configure --with-devel-headers --without-threads

(actually tried without '--without-threads' too, but no change)

$ mpirun -np 2 test
[beth:06029] *** Process received signal ***
[beth:06029] Signal: Segmentation fault (11)
[beth:06029] Signal code: Address not mapped (1)
[beth:06029] Failing at address: 0x2e342e33
[beth:06029] [ 0] /lib/tls/libc.so.6 [0x21b890]
[beth:06029] [ 1] /usr/local/lib/libopen-rte.so.0(orte_init_stage1
+0x293) [0xb7fc50cb]
[beth:06029] [ 2] /usr/local/lib/libopen-rte.so.0(orte_system_init
+0x1e) [0xb7fc84be]
[beth:06029] [ 3] /usr/local/lib/libopen-rte.so.0(orte_init+0x6a)
[0xb7fc4cee]
[beth:06029] [ 4] mpirun(orterun+0x14b) [0x8049ecb]
[beth:06029] [ 5] mpirun(main+0x2a) [0x8049d7a]
[beth:06029] [ 6] /lib/tls/libc.so.6(__libc_start_main+0xd3)
[0x208de3]
[beth:06029] [ 7] mpirun [0x8049cc9]
[beth:06029] *** End of error message ***
Segmentation fault

Thanks,

Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RH Enterprise Linux issue

2007-03-22 Thread Greg Watson


Oh, and this is a single x86 machine. Just trying to launch locally.

 $uname -a
Linux  2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 18:00:32 EDT 2006 i686  
i686 i386 GNU/Linux


Greg

On Mar 22, 2007, at 12:17 PM, Greg Watson wrote:


Is this a known problem? Building ompi 1.2 on RHEL4:

./configure --with-devel-headers --without-threads

(actually tried without '--without-threads' too, but no change)

$ mpirun -np 2 test
[beth:06029] *** Process received signal ***
[beth:06029] Signal: Segmentation fault (11)
[beth:06029] Signal code: Address not mapped (1)
[beth:06029] Failing at address: 0x2e342e33
[beth:06029] [ 0] /lib/tls/libc.so.6 [0x21b890]
[beth:06029] [ 1] /usr/local/lib/libopen-rte.so.0(orte_init_stage1 
+0x293) [0xb7fc50cb]
[beth:06029] [ 2] /usr/local/lib/libopen-rte.so.0(orte_system_init 
+0x1e) [0xb7fc84be]
[beth:06029] [ 3] /usr/local/lib/libopen-rte.so.0(orte_init+0x6a)  
[0xb7fc4cee]

[beth:06029] [ 4] mpirun(orterun+0x14b) [0x8049ecb]
[beth:06029] [ 5] mpirun(main+0x2a) [0x8049d7a]
[beth:06029] [ 6] /lib/tls/libc.so.6(__libc_start_main+0xd3)  
[0x208de3]

[beth:06029] [ 7] mpirun [0x8049cc9]
[beth:06029] *** End of error message ***
Segmentation fault

Thanks,

Greg

[OMPI devel] RH Enterprise Linux issue

2007-03-22 Thread Greg Watson


Is this a known problem? Building ompi 1.2 on RHEL4:

./configure --with-devel-headers --without-threads

(actually tried without '--without-threads' too, but no change)

$ mpirun -np 2 test
[beth:06029] *** Process received signal ***
[beth:06029] Signal: Segmentation fault (11)
[beth:06029] Signal code: Address not mapped (1)
[beth:06029] Failing at address: 0x2e342e33
[beth:06029] [ 0] /lib/tls/libc.so.6 [0x21b890]
[beth:06029] [ 1] /usr/local/lib/libopen-rte.so.0(orte_init_stage1 
+0x293) [0xb7fc50cb]
[beth:06029] [ 2] /usr/local/lib/libopen-rte.so.0(orte_system_init 
+0x1e) [0xb7fc84be]
[beth:06029] [ 3] /usr/local/lib/libopen-rte.so.0(orte_init+0x6a)  
[0xb7fc4cee]

[beth:06029] [ 4] mpirun(orterun+0x14b) [0x8049ecb]
[beth:06029] [ 5] mpirun(main+0x2a) [0x8049d7a]
[beth:06029] [ 6] /lib/tls/libc.so.6(__libc_start_main+0xd3) [0x208de3]
[beth:06029] [ 7] mpirun [0x8049cc9]
[beth:06029] *** End of error message ***
Segmentation fault

Thanks,

Greg

Re: [OMPI devel] Open MPI v1.2rc4 has been posted

2007-03-14 Thread Greg Watson


Looks good.

Greg

On Mar 13, 2007, at 5:37 PM, Tim Mattox wrote:


Hi All,
The fourth release candidate of v1.2 is now up on the website:
http://www.open-mpi.org/software/ompi/v1.2/

Please run it through it's paces as best you can.

This fixes an incorrect result in MPI_Allgather when
using MPI_IN_PLACE with only two processes.
Also, one more compiler warning is fixed.
--
Tim Mattox - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Open MPI v1.2rc3 has been posted

2007-03-13 Thread Greg Watson

Termination seems to be working again with this version. Don't know  
if it was something I was doing or not, but I'm happy now... :-)


Greg

On Mar 12, 2007, at 5:11 PM, Tim Mattox wrote:


Hi All,
The third release candidate of v1.2 is now up on the website:
http://www.open-mpi.org/software/ompi/v1.2/

Please run it through it's paces as best you can.

This fixes two compiler warnings introduced in rc2.
--
Tim Mattox - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Open MPI v1.2rc2 has been posted

2007-03-11 Thread Greg Watson


Not sure. The message was:

OOB: Connection to HNP lost

I have a bigger problem now though. As of rc1, terminating a job no  
longer works. I'll try rc2 and let you know if the problem persists.  
Since the API for terminate changed recently, I updated the code to  
replicate what happens in orterun. However this doesn't seem to work  
correctly (at least in our case).


Greg

On Mar 10, 2007, at 4:11 AM, Jeff Squyres wrote:


Hopefully.  Was it IOF-related?

The error was that some in-flight IOF fragments (meaning that they
had been read from the local source and were in the process of being
sent across OOB) could be incorrectly removed from the list, later
causing either a segv in production builds or, more reliably, an
assertion failure in debugging builds.



On Mar 9, 2007, at 10:49 PM, Greg Watson wrote:


Thanks. I was seeing an error when I shut down orted. Sounds like
it's now fixed...

Greg

On Mar 9, 2007, at 5:25 PM, Jeff Squyres wrote:


- An IOF race condition in the shutdown of the orted
- Some sm btl fixes
- Patch to change Libtool 2.0 libltdl's behavior with regards to
lt_dlopen'ing  DSOs


On Mar 9, 2007, at 7:54 PM, Greg Watson wrote:


What changed between rc1 and rc2?

Greg

On Mar 9, 2007, at 1:50 PM, Tim Mattox wrote:


Hi All,
The second release condidate of v1.2 is now up on the website:
http://www.open-mpi.org/software/ompi/v1.2/

Please run it through it's paces as best you can.
--
Tim Mattox - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Open MPI v1.2rc2 has been posted

2007-03-09 Thread Greg Watson

Thanks. I was seeing an error when I shut down orted. Sounds like  
it's now fixed...


Greg

On Mar 9, 2007, at 5:25 PM, Jeff Squyres wrote:


- An IOF race condition in the shutdown of the orted
- Some sm btl fixes
- Patch to change Libtool 2.0 libltdl's behavior with regards to
lt_dlopen'ing  DSOs


On Mar 9, 2007, at 7:54 PM, Greg Watson wrote:


What changed between rc1 and rc2?

Greg

On Mar 9, 2007, at 1:50 PM, Tim Mattox wrote:


Hi All,
The second release condidate of v1.2 is now up on the website:
http://www.open-mpi.org/software/ompi/v1.2/

Please run it through it's paces as best you can.
--
Tim Mattox - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Open MPI v1.2rc2 has been posted

2007-03-09 Thread Greg Watson


What changed between rc1 and rc2?

Greg

On Mar 9, 2007, at 1:50 PM, Tim Mattox wrote:


Hi All,
The second release condidate of v1.2 is now up on the website:
http://www.open-mpi.org/software/ompi/v1.2/

Please run it through it's paces as best you can.
--
Tim Mattox - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] orted --seed and orte_init()

2007-02-03 Thread Greg Watson



On Feb 3, 2007, at 10:35 AM, Ralph Castain wrote:

Something did occur to me that *might* help with the problem of  
detecting
when the seed is running. There is an option to orted "-- report- 
uri pipe"
that will cause the orted to write it's uri to the specified pipe.  
This
comes after the orted has completed orte_init, and so it *should*  
be ready

at that time for you to connect to it.

So you might try using that option when you kickoff the seed, and then
reading from the pipe until you get the uri back. Or you can just  
wait to
see when the pipe closes since the orted closes the pipe  
immediately after

writing to it.

There is still some stuff that the orted does before it accepts  
commands
send directly to it etc., but that shouldn't impact your ability to  
connect.


Let me know how that goes. If we need to do so, we can shift the  
timing of
that report-uri output so it comes a little later in the orted's  
setup.


Ralph

On 2/3/07 6:51 AM, "Ralph Castain"  wrote:


I'll give it a try this week.

Greg

Re: [OMPI devel] orted --seed and orte_init()

2007-02-03 Thread Greg Watson



On Feb 3, 2007, at 6:51 AM, Ralph Castain wrote:





On 2/2/07 8:44 AM, "Greg Watson"  wrote:


We're launching a seed daemon so that we can get registry persistence
across multiple job launches. However, there is a race condition
between launching the daemon and the first call to orte_init() that
can result in a bus error. We set the OMPI_MCA_universe and
OMPI_MCA_orte_univ_exist environment variables prior to calling
orte_init() so that orte knows how to connect to the daemon, but if
the daemon hasn't started this causes a bus error in
orte_rds_base_close(). Stack trace below.

Exception:  EXC_BAD_ACCESS (0x0001)
Codes:  KERN_PROTECTION_FAILURE (0x0002) at 0x001c

Thread 0 Crashed:
0   libopen-rte.0.dylib  0x000c6d59 orte_rds_base_close + 66
1   libopen-rte.0.dylib  0x000a3ba7 orte_system_finalize + 121
2   libopen-rte.0.dylib  0x000d41f9
orte_sds_base_basic_contact_universe + 648
3   libopen-rte.0.dylib  0x000a06ce orte_init_stage1 + 898
4   libopen-rte.0.dylib  0x000a3c0b orte_system_init + 25
5   libopen-rte.0.dylib  0x000a0190 orte_init + 81



Hmmm...can you tell me which version you are working with?  
Obviously, that
shouldn't happen. My best initial guess is that rds is being  
opened, but
hasn't selected components yet when we try to contact the universe.  
When
that fails and we call finalize, rds tries to "close" a component  
list that

is NULL. I can look into that.


1.2b3

Greg

[OMPI devel] orted --seed and orte_init()

2007-02-02 Thread Greg Watson

We're launching a seed daemon so that we can get registry persistence  
across multiple job launches. However, there is a race condition  
between launching the daemon and the first call to orte_init() that  
can result in a bus error. We set the OMPI_MCA_universe and  
OMPI_MCA_orte_univ_exist environment variables prior to calling  
orte_init() so that orte knows how to connect to the daemon, but if  
the daemon hasn't started this causes a bus error in  
orte_rds_base_close(). Stack trace below.


Exception:  EXC_BAD_ACCESS (0x0001)
Codes:  KERN_PROTECTION_FAILURE (0x0002) at 0x001c

Thread 0 Crashed:
0   libopen-rte.0.dylib 0x000c6d59 orte_rds_base_close + 66
1   libopen-rte.0.dylib 0x000a3ba7 orte_system_finalize + 121
2   libopen-rte.0.dylib 	0x000d41f9  
orte_sds_base_basic_contact_universe + 648

3   libopen-rte.0.dylib 0x000a06ce orte_init_stage1 + 898
4   libopen-rte.0.dylib 0x000a3c0b orte_system_init + 25
5   libopen-rte.0.dylib 0x000a0190 orte_init + 81

A related question, is there any way to check for the daemon other  
than calling orte_init()? At the moment we just sleep for a few  
seconds after launching the daemon, but this is obviously not a very  
satisfactory solution. I can't see any places where this is done in  
the source.


Thanks,

Greg

Re: [OMPI devel] Urgent: ORTE_RML_NAME_SEED removed from 1.2b3!

2007-01-30 Thread Greg Watson



On Jan 30, 2007, at 9:39 AM, Ralph H Castain wrote:





On 1/30/07 9:24 AM, "Greg Watson"  wrote:


Yes, we need the hostfile information before job execution. We call
setup_job() before a debug job to request a process allocation for
the application being debugged. We use spawn() to launch a non-debug
application. It sounds like I should just leave things the way they
currently are.

I think we've had the discussion about bproc before, but the reason
we still support 1.0.2 is that the registry *is* populated with node
information prior to launch. This was an agreed on feature that
OpenMPI was to provide for PTP. I haven't been able to test 1.2 on a
bproc machine (since I can't get it to work), but it sounds like the
changes removed this functionality. Frankly, this makes OpenMPI less
attractive to us, since we now have to go and get this information
ourselves. My thinking now is that in the future we probably won't
use OpenMPI for anything other than building and launching the
application.


Decisions such as that, of course, are up to you. Meantime, take a  
gander at
the data in ORTE_BPROC_NODE_SEGMENT within the registry. I tried to  
preserve
some of what was being done, even though the method used to  
populate the
bproc data was problematic and not really correct. You may find  
that the

info stuck in there meets your needs for the GUI.

My point, though, is that only bproc and hostfile would be  
supported under
the best of current conditions, and we only get that by  
circumscribing the
functional intent of several key frameworks. The general ability  
(across all
systems) to obtain the node info prior to launch isn't built into  
the code
at the moment, but is planned for ORTE 2.0 (and was built in a  
separate
prototype branch). For reasons totally beyond my control, the  
prototype ORTE

2.0 code has *not* been incorporated into Open MPI yet.

Sorry...I like that no more than you... :-/


I suppose it is our decision in the sense that we could decide not to  
provide the functionality and hope that it is implemented in OpenMPI  
sometime in the future. If we have to implement something ourselves  
to provide this functionality, then we may as well do it in a generic  
way that will work with any runtime systems.


Greg

Re: [OMPI devel] Urgent: ORTE_RML_NAME_SEED removed from 1.2b3!

2007-01-30 Thread Greg Watson

Yes, we need the hostfile information before job execution. We call  
setup_job() before a debug job to request a process allocation for  
the application being debugged. We use spawn() to launch a non-debug  
application. It sounds like I should just leave things the way they  
currently are.


I think we've had the discussion about bproc before, but the reason  
we still support 1.0.2 is that the registry *is* populated with node  
information prior to launch. This was an agreed on feature that  
OpenMPI was to provide for PTP. I haven't been able to test 1.2 on a  
bproc machine (since I can't get it to work), but it sounds like the  
changes removed this functionality. Frankly, this makes OpenMPI less  
attractive to us, since we now have to go and get this information  
ourselves. My thinking now is that in the future we probably won't  
use OpenMPI for anything other than building and launching the  
application.


Greg

On Jan 29, 2007, at 6:57 PM, Ralph Castain wrote:

On further thought, perhaps I should be clearer. If you are saying  
that you
need to read the hostfile to display the cluster *before* the user  
actually

submits a job for execution, then fine - go ahead and call rds.query.

What I'm trying to communicate to you is that you need to call  
setup_job
when you are launching the resulting application. If you want, you  
could do

the following:

1. call orte_rds.query(ORTE_JOBID_INVALID) to get your host info.  
Note that
only a hostfile will be read here - so if you are in (for example)  
a bproc

environment, you won't get any node info at this point.

2. when you are ready to launch the app, call orte_rmgr.spawn with an
attribute list that contains ORTE_RMGR_SPAWN_FLOW with a value of
ORTE_RMGR_SETUP | ORTE_RMGR_ALLOC | ORTE_RMGR_MAP |  
ORTE_RMGR_SETUP_TRIGS |
ORTE_RMGR_LAUNCH. This will tell spawn to do everything *except*  
rds.query

so you avoid re-entering the hostfile info.

Unfortunately, if you want to see node info prior to launch on  
anything
other than a hostfile, we really don't have a way to do that right  
now. The

ORTE 2.0 design allows for it, but we haven't implemented that yet -
probably a few months away.

Hope that helps
Ralph


On 1/29/07 6:45 PM, "Ralph Castain"  wrote:





On 1/29/07 5:57 PM, "Greg Watson"  wrote:


Ralph,

On Jan 29, 2007, at 11:10 AM, Ralph H Castain wrote:





On 1/29/07 10:20 AM, "Greg Watson"  wrote:



No, we have always called query() first, just after orte_init().
Since query() has never required a job id before, this used to  
work.

I think the call was required to kick the SOH into action, but I'm
not sure if it was needed for any other purpose.


Query has nothing to do with the SOH - the only time you would
"need" it
would be if you are reading a hostfile. Otherwise, it doesn't do
anything at
the moment.


Not calling setup_job would be risky, in my opinion...


We've had this discussion before. We *need* to read the hostfile
before calling setup_job() because we have to populate the registry
with node information. If you're saying that this is now no longer
possible, then I'd respectfully ask that this functionality be
restored before you release 1.2. If there is some other way to
achieve this, then please let me know. We've been doing this ever
since 1.0 and in the alpha and beta versions of 1.2.


I think you don't understand what setup_job does. Setup_job has four
arguments:

(a) an array of app_context objects that contain the application  
to be

launched

(b) the number of elements in that array

(c) a pointer to a location where the jobid for this job is to be  
returned;

and

(d) a list of attributes that allows the caller to "fine-tune"  
behavior


With that info, setup_job will:

(a) create a new jobid for your application; and

(b) store the app_context info in appropriate places in the registry

And that is *all* setup_job does - it simply gets a jobid and  
initializes some
important info in the registry. It never looks at node  
information, nor does

it in any way impact node info.

Calling rds.query after rmgr.setup_job is how we always do it. In  
truth, the
precise ordering of those two operations is immaterial as they  
have absolutely
nothing in common. However, we always do it in the described order  
so that
rds.query can have a valid jobid. As I said, at the moment  
rds.query doesn't
actually use the jobid, though that will change at some point in  
the future.


Although it isn't *absolutely* necessary, I would still suggest  
that you call

rmgr.setup_job before calling rds.query to ensure that any subsequent
operations have all the info they require to function correctly.  
You can see
the progression we use in orte/mca/rmgr/urm/rmgr_urm.c - I believe  
you will

find it helpful to follow that logic.

Alternatively, if you want, you can simply repeatedly call

Re: [OMPI devel] Urgent: ORTE_RML_NAME_SEED removed from 1.2b3!

2007-01-29 Thread Greg Watson


Ralph,

On Jan 29, 2007, at 11:10 AM, Ralph H Castain wrote:





On 1/29/07 10:20 AM, "Greg Watson"  wrote:



No, we have always called query() first, just after orte_init().
Since query() has never required a job id before, this used to work.
I think the call was required to kick the SOH into action, but I'm
not sure if it was needed for any other purpose.


Query has nothing to do with the SOH - the only time you would  
"need" it
would be if you are reading a hostfile. Otherwise, it doesn't do  
anything at

the moment.


Not calling setup_job would be risky, in my opinion...


We've had this discussion before. We *need* to read the hostfile  
before calling setup_job() because we have to populate the registry  
with node information. If you're saying that this is now no longer  
possible, then I'd respectfully ask that this functionality be  
restored before you release 1.2. If there is some other way to  
achieve this, then please let me know. We've been doing this ever  
since 1.0 and in the alpha and beta versions of 1.2.







Are there likely to be further API changes before the release
version? We are trying to release PTP, but I think this is impossible
until your API's stabilize.


None planned, other than what I mentioned above. If you want to  
support Open
MPI 1.2, you may need a slight phase shift, though, so you can see  
the final

release.


Please explain "phase shift".

Greg

Re: [OMPI devel] Urgent: ORTE_RML_NAME_SEED removed from 1.2b3!

2007-01-29 Thread Greg Watson



On Jan 29, 2007, at 6:47 AM, Ralph H Castain wrote:





On 1/27/07 9:37 AM, "Greg Watson"  wrote:


There are two more interfaces that have changed:

1. orte_rds.query() now takes a job id, whereas in 1.2b1 it didn't
take any arguments. I seem to remember that I call this to kick orted
into action, but I'm not sure of the implications of not calling it.
In any case, I don't have a job id when I call it, so what do I pass
to get the old behavior?


For now, you can just use ORTE_JOBID_INVALID (defined in
orte/mca/ns/ns_types.h).

However, your question raises a flag. You should be calling
orte_rmgr.setup_job before you call the RDS, and that function  
returns the
jobid for your job. Failing to call setup_job first may cause other  
parts of
the code base to fail as they are expecting certain data to be  
setup in the

registry by setup_job.

If you do call setup_job first, then just pass the returned jobid  
along to

rds.query.


No, we have always called query() first, just after orte_init().  
Since query() has never required a job id before, this used to work.  
I think the call was required to kick the SOH into action, but I'm  
not sure if it was needed for any other purpose.






2. orte_pls.terminate_job() now takes a list of attributes in
addition to a job id. What are the attributes for, and what happens
if I pass a NULL here? Do I  need to crate an empty attribute list?



You can always pass a NULL to any function looking for attributes -  
the

system knows how to handle that situation.

What you should pass here depends upon what you are trying to do.  
If you

just want to terminate a specific job, then you can just pass a NULL.
However, if you want to terminate the specified job AND any  
"children" that

were dynamically spawned by that job, then you need to pass the
ORTE_NS_INCLUDE_DESCENDANTS attribute - something like the  
following code

snippet (pulled from orterun) would work:

#include "opal/class/opal_list.h"

#include "orte/mca/pls/pls.h"
#include "orte/mca/rmgr/rmgr.h"
#include "orte/mca/ns/ns_types.h"
#include "orte/runtime/params.h"

opal_list_t attrs;
opal_list_item_t *item;

OBJ_CONSTRUCT(&attrs, opal_list_t);
orte_rmgr.add_attribute(&attrs, ORTE_NS_INCLUDE_DESCENDANTS,  
ORTE_UNDEF,

NULL, ORTE_RMGR_ATTR_OVERRIDE);
ret = orte_pls.terminate_job(jobid, &orte_abort_timeout, &attrs);
while (NULL != (item = opal_list_remove_first(&attrs)))
OBJ_RELEASE(item);
OBJ_DESTRUCT(&attrs);


Please note that the orte_pls.terminate_job API in 1.2 will undergo  
a change

in the next few days (it already is changed in the trunk). The change,
included in the code snippet above, adds a timeout capability to  
have the
function "give up" if the job doesn't terminate within the  
specified time.
The parameter given above references the orte-wide default value  
(adjustable

via MCA param), but you can give it anything you like - a NULL for the
timeout param means don't timeout so we'll try until you order us  
to quit.




Is this going to be in "1.2b4", or some other version? The previous  
API changes mean that PTP will no longer work with pre-1.2b3  
versions. It sounds like this is going to cause a similar issue.


Are there likely to be further API changes before the release  
version? We are trying to release PTP, but I think this is impossible  
until your API's stabilize.


What about orte_ns.free_name()?

Thanks,

Greg

Re: [OMPI devel] Urgent: ORTE_RML_NAME_SEED removed from 1.2b3!

2007-01-27 Thread Greg Watson


There are two more interfaces that have changed:

1. orte_rds.query() now takes a job id, whereas in 1.2b1 it didn't  
take any arguments. I seem to remember that I call this to kick orted  
into action, but I'm not sure of the implications of not calling it.  
In any case, I don't have a job id when I call it, so what do I pass  
to get the old behavior?


2. orte_pls.terminate_job() now takes a list of attributes in  
addition to a job id. What are the attributes for, and what happens  
if I pass a NULL here? Do I  need to crate an empty attribute list?


Greg


On Jan 27, 2007, at 6:51 AM, Ralph Castain wrote:





On 1/26/07 11:36 PM, "Greg Watson"  wrote:


I have been using this define to implement the orte_stage_gate_init()
functionality in PTP using OpenMPI 1.2b1 for some months now. As of
1.2b3 it appears that this define has been removed. New users
attempting to build PTP against the latest 1.2b3 build are
complaining that they are getting build errors.

Please let me know what has replaced this define in 1.2b3, and how we
can obtain the same functionality that we had in 1.2b1.


You need to use ORTE_PROC_MY_HNP - no API change is involved, it is  
just a
#define. You may need to add #include "orte/mca/ns/ns_types.h" to  
pick it

up.

You will also find that ORTE_RML_NAME_ANY is likewise gone - you  
need to use

ORTE_NAME_WILDCARD instead for the same reasons as described below.
Similarly, ORTE_RML_NAME_SELF has been replaced by ORTE_PROC_MY_NAME.

We discovered during the testing/debugging of 1.2 that people had
unintentionally created multiple definitions for several critical  
names in

the system. Hence, we had an ORTE_RML_NAME_SEED, an ORTE_OOB_SEED, and
several others. In the event that definition had to change, we  
found the

code "cracking" all over the place - it was literally impossible to
maintain.

So we bit the bullet and cleaned it up. No API changes were  
involved, but we
did remove duplicative defines (and their associated storage  
locations).
Hopefully, people will take the time to lookup and use these system- 
level

defines instead of re-creating the problem!



Also, I would like to know what the policy of changing interfaces is,
and when in the release cycle you freeze API changes. It's going to
be extremely difficult to release a version of PTP built against
OpenMPI if you change interfaces between beta versions.


In my opinion, that is what "beta" is for - it isn't a "lock-down"  
release,
but rather a time to find your cracks and fix them. That said, we  
don't
change APIs for no reason, but only because we either (a) needed to  
do so to
add some requested functionality (e.g., the recent request for  
"pernode"
launch capabilities), or (b) found a bug in the system that  
required some
change or added functionality to fix (e.g., the recent changes in  
the PLS

behavior and API to support ctrl-c interrupt capabilities).

I generally try to send emails out alerting people to these changes  
when

they occur (in fact, I'm pretty certain I sent one out on this issue).
However, looking back, I find that I send them to the OMPI "core  
developers"
list - not the "developers" one. I note that the OMPI layer  
developers tend
to do the same thing. I'll try to rectify that in the future and  
suggest my

OMPI compatriots do so too.

Once an actual release is made, we only make an API change if a  
major bug is
found and an API change simply must be done to fix it. I don't  
recall such
an instance, though I think it may have happened once between minor  
release

numbers in the 1.1 family (not sure).




Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] Urgent: ORTE_RML_NAME_SEED removed from 1.2b3!

2007-01-27 Thread Greg Watson

I have been using this define to implement the orte_stage_gate_init()  
functionality in PTP using OpenMPI 1.2b1 for some months now. As of  
1.2b3 it appears that this define has been removed. New users  
attempting to build PTP against the latest 1.2b3 build are  
complaining that they are getting build errors.


Please let me know what has replaced this define in 1.2b3, and how we  
can obtain the same functionality that we had in 1.2b1.


Also, I would like to know what the policy of changing interfaces is,  
and when in the release cycle you freeze API changes. It's going to  
be extremely difficult to release a version of PTP built against  
OpenMPI if you change interfaces between beta versions.


Greg

Re: [OMPI devel] 1.2b3 fails on bluesteel

2007-01-22 Thread Greg Watson



On Jan 22, 2007, at 9:48 AM, Ralph H Castain wrote:





On 1/22/07 9:39 AM, "Greg Watson"  wrote:


I tried adding '-mca btl ^sm -mca mpi_preconnect_all 1' to the mpirun
command line but it still fails with identical error messages.

I don't understand the issue with allocating nodes under bproc. Older
versions of OMPI have always just queried bproc for the nodes that
have permissions set so I can execute on them. I've never had to
allocate any nodes using a hostfile or any other mechanism. Are you
saying that this no longer works?


Turned out that mode of operation was a "bug" that caused all kinds of
problems in production environments - that's been fixed for quite  
some time.
So, yes - you do have to get an official "allocation" of some kind.  
Even the

changes I mentioned wouldn't remove that requirement in the way you
describe.


BTW, there's no requirement for a bproc system to employ a job  
scheduler. So in my view OMPI is "broken" for bproc systems if it  
imposes such a requirement.


Greg

Re: [OMPI devel] 1.2b3 fails on bluesteel

2007-01-22 Thread Greg Watson



On Jan 22, 2007, at 9:48 AM, Ralph H Castain wrote:





On 1/22/07 9:39 AM, "Greg Watson"  wrote:


I tried adding '-mca btl ^sm -mca mpi_preconnect_all 1' to the mpirun
command line but it still fails with identical error messages.

I don't understand the issue with allocating nodes under bproc. Older
versions of OMPI have always just queried bproc for the nodes that
have permissions set so I can execute on them. I've never had to
allocate any nodes using a hostfile or any other mechanism. Are you
saying that this no longer works?


Turned out that mode of operation was a "bug" that caused all kinds of
problems in production environments - that's been fixed for quite  
some time.
So, yes - you do have to get an official "allocation" of some kind.  
Even the

changes I mentioned wouldn't remove that requirement in the way you
describe.


I you must do this kind of allocation, can't you just use a NODES  
environment variable, or something just as simple?


Greg

1 2 >

1 - 100 of 144 matches

Mail list logo