Hi, I was just reviewing my files in order to sent them to Jeff, And fixed the problem!! I should've written: mca_base_param_string_name("rds_hostfile", "path" . . . ); instead if: mca_base_param_string("rds_hostfile", "path" . . .); in the component file, 'open' function.
But I don't understand how it compiled? The is no function mca_base_param_string that takes string as first param (I know it doesn't comple in the module file) I compile using 'make all install' in the openmpi dir Thanks --David On Mon, 29 Oct 2007, Jeff Squyres wrote: > Sorry guys, I did miss this earlier. > > I don't see a patch anywhere in the e-mail thread below -- can > someone send me the problematic code in question? > > FWIW: The MCA param space is global, so there's no reason that a new/ > different RDS shouldn't be able to read the hostfile MCA parameter. > > > > On Oct 28, 2007, at 2:09 PM, Ralph Castain wrote: > > > Yo Jeff > > > > This may have slipped through your inbox (had OMPI devel in > > subject, so may > > have been caught in some filter) - could you please provide any > > thoughts on > > why the hostfile isn't getting picked up correctly? As I indicated > > on the > > prior note, I verified that it is working for the default hostfile > > component > > - I can't see anything wrong in David's call to cause the problem. > > Please > > refer to the prior note for that code. > > > > Thanks > > Ralph > > > > > > > > On 10/28/07 10:31 AM, "David Erukhimovich" > > <davider...@cs.huji.ac.il> wrote: > > > >> Thank you very much for the patch, it helped me a lot (It works!) and > >> I'm really appreciate this. > >> > >> p.s. Any idea about the rds thing? > >> > >> Regards > >> --David > >> > >> > >> Ralph H Castain wrote: > >>> Hi David > >>> > >>> Here is the promised patch - it passes params just fine, but I > >>> cannot vouch > >>> for any unintended consequences. I -think- it will be fine, but > >>> it lacks all > >>> the usual testing for a patch to an official release. > >>> > >>> Hope it helps > >>> Ralph > >>> > >>> > >>> > >>> On 10/20/07 10:10 AM, "David Erukhimovich" > >>> <davider...@cs.huji.ac.il> wrote: > >>> > >>>> > >>>> Hi Ralph, > >>>> > >>>> 2. I do want the user to be able to switch between my way of > >>>> process > >>>> launching, and the default way. I can do it using an mca flag, > >>>> but I would > >>>> prefer a new component. If I is not too defficult for you, > >>>> please make the > >>>> patch, if it is, I'll just use an mca flag. > >>>> > >>>> 1. Just remmembered another difficulty I had: I've created a new > >>>> rds > >>>> component identical to the hostfile one. lets call it mosix. > >>>> Now, orterun > >>>> is saving the hostfile path in the mca parameter - > >>>> rds_hostfile_path or > >>>> something like that. when I try to retrieve rds_hostfile_path or > >>>> rds_mosix_path in rds_mosix component I always get the default > >>>> hostfile path > >>>> (doesn't matter if I gave an hostfile or not). And I tried > >>>> everything - > >>>> changing names in rds_mosix_component, declaring a new parameter > >>>> rds_mosix_path in various places etc. So now I'm just altering > >>>> the existing > >>>> hostfile component. > >>>> Do you have any suggestions how to make it work? > >>>> > >>>> Sorry for all the questions and thank you very much for the > >>>> quick answers > >>>> > >>>> Regards > >>>> --David > >>>> > >>>> ---------- Forwarded message ---------- > >>>> From: Ralph Castain <r...@lanl.gov> > >>>> Date: Oct 20, 2007 5:12 PM > >>>> Subject: Re: [OMPI devel] Trying to get total procs num in odls > >>>> framework > >>>> To: David Erukhimovich <davider...@cs.huji.ac.il> > >>>> > >>>> Hi David > >>>> > >>>> Thanks for the info - see comments below. > >>>> > >>>> Ralph > >>>> > >>>> > >>>> On 10/20/07 6:58 AM, "David Erukhimovich" > >>>> <davider...@cs.huji.ac.il> wrote: > >>>> > >>>>> Hi > >>>>> Thank you for your answer. > >>>>> > >>>>> First of all, my two questions wasn't connected and they belong to > >>>> different > >>>>> part of my project. and the subject of the mail should have > >>>>> been: Trying > >>>> to > >>>>> get total procs num in rds framework (sorry my mistake). > >>>>> > >>>>> Here the parts in the order of the last email > >>>>> > >>>>> 1. I've solved the problem about getting total num of procs in > >>>>> rds (just > >>>>> called some function incorrectly), so sorry for disturbing you > >>>>> about > >>>> that. > >>>>> Now a bit more about what I'm trying to do, maybe there is a > >>>>> better way > >>>> then > >>>>> mine: > >>>>> I have a tool (external application) that given a list of > >>>>> machines and a > >>>>> number n , it chooses the n best ones from the list (least > >>>>> loaded ones) > >>>> and > >>>>> if the list of machines isn't given, it just returns the n best > >>>>> machines > >>>>> from the claster. I am wishing to include this in ompi. hence - > >>>>> given a > >>>>> machinefile, It'll run the process only on the best nodes. If a > >>>> machinefile > >>>>> isn't given, it'll take the best node that my application returns. > >>>>> I think the best place to implement it is in rds - after > >>>>> building the list > >>>>> of newly discovered nodes: if it is empty, fill it using my tool, > >>>> otherwise > >>>>> filter it using my tool. It seems to me the most logical way to > >>>>> do it. Am > >>>> I > >>>>> right? I am asking you because I guess you have a better > >>>>> knowledge in ompi > >>>>> architecture. > >>>> It sounds like the correct place to me. At some point in the > >>>> future, you > >>>> could migrate that logic to the RAS instead, but I would just > >>>> continue as > >>>> you are doing for now. > >>>> > >>>>> 2. The other thing I am trying to do is to make ompi to run > >>>>> every process, > >>>>> not directly, but through external program. e.g: If I want to > >>>>> launch the > >>>>> program "hostname", I want that following to be launched: "<my- > >>>>> program> > >>>>> <my-program's-flags> hostname". > >>>>> I figured that the best way to do it is in odls framework > >>>>> because there I > >>>>> have the exact executing point. > >>>> I guess I wouldn't do it that way if I were doing a project of > >>>> my own. I > >>>> would just go into the default odls module and hardcode the > >>>> revised launch. > >>>> I can't see this coming back into the production system, so > >>>> unless you have > >>>> some reason to want to run both with and without your revision, > >>>> why go > >>>> through the pain? > >>>> > >>>>> I am currently working on the checkpoint 1.2.3. I don't work on > >>>>> the trunk > >>>>> because I need the patches to be added on some stable release. > >>>>> Is there a > >>>>> 1.2.* release where the bug is fixed. And if not - when can > >>>>> such fixed > >>>>> version be stable > >>>> I don't think there are any plans to backport that fix, though I > >>>> imagine it > >>>> could be done. If not, I could try and create a patch for you > >>>> next week, > >>>> though I would again suggest you just hardcode your change into > >>>> the existing > >>>> odls default component to make your life easier. > >>>> > >>>> Ralph > >>>> > >>>>> Thank you > >>>>> --Davis > >>>>> > >>>>> ---------- Forwarded message ---------- > >>>>> From: Ralph Castain <r...@lanl.gov> > >>>>> Date: Oct 17, 2007 11:22 PM > >>>>> Subject: Re: [OMPI devel] Trying to get total procs num in odls > >>>>> framework > >>>>> To: davider...@cs.huji.ac.il > >>>>> Cc: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > >>>>> > >>>>> Hi David > >>>>> > >>>>> I could probably answer your questions better if I had a better > >>>>> understanding of what you are trying to do. For example, > >>>>> looking in the > >>>>> hostfile rds for the number of procs to be launched seems > >>>>> strange as the > >>>>> functional role of the framework is to simply learn what nodes are > >>>>> available. > >>>>> > >>>>> It would also help to have some idea of what environment you > >>>>> are working > >>>> in, > >>>>> and how you configured the beast. > >>>>> > >>>>> Please see comments below. > >>>>> Ralph > >>>>> > >>>>> > >>>>> On 10/17/07 2:47 PM, "Jeff Squyres" <jsquy...@cisco.com> wrote: > >>>>> > >>>>>> Yo Ralph -- > >>>>>> > >>>>>> Can you answer these questions? > >>>>>> > >>>>>> Begin forwarded message: > >>>>>> > >>>>>>> From: David Erukhimovich <davider...@cs.huji.ac.il> > >>>>>>> Date: October 14, 2007 5:08:45 PM EDT > >>>>>>> To: de...@open-mpi.org > >>>>>>> Subject: [OMPI devel] Trying to get total procs num in odls > >>>>>>> framework > >>>>>>> Reply-To: Open MPI Developers <de...@open-mpi.org> > >>>>>>> > >>>>>>> Hello, > >>>>>>> I have 2 questions: > >>>>>>> 1. I am trying to get the total number of requested processes > >>>>>>> for > >>>>>>> the job > >>>>>>> in' hostfile' component in rds. I took the job object that was > >>>>>>> given as a > >>>>>>> parameter, extracted the application objects and checked how > >>>>>>> many > >>>>>>> procs > >>>>>>> each application has. The result in every run was 0. As I > >>>>>>> understand, this > >>>>>>> variable is updated before the rds part. So what am I doing > >>>>>>> wrong? > >>>>> Do you mean you took the jobid given to the hostfile RDS (which > >>>>> isn't an > >>>>> object, but just a number) and did an orte_rmgr.get_app_context > >>>>> to get the > >>>>> array of app_contexts? Is there some reason why you would want > >>>>> to do that > >>>>> there? > >>>>> > >>>>> Depending upon what the command line looks like, it is possible > >>>>> for the > >>>>> number of procs to be zero - we allow that option and then fill > >>>>> in the > >>>>> number later. If it was specified, though, we do insert the > >>>>> number in the > >>>>> app_context object. > >>>>> > >>>>> Maybe you could tell me what the command line looks like, the > >>>>> function > >>>> call > >>>>> you used to get the "application objects", and what field you > >>>>> were looking > >>>>> at when you found zero? > >>>>> > >>>>>>> 2. I've discovered an undocumented framework - odls. > >>>>> It wasn't exactly hidden...we haven't documented it because we > >>>>> are lazy > >>>> and > >>>>> the existing components cover every known environment (or so we > >>>>> thought). > >>>>> ;-) > >>>>> > >>>>> Is there some special reason to want to create another one? > >>>>> > >>>>>>> I've created a > >>>>>>> new > >>>>>>> component for it. The problem is that there is no way to switch > >>>>>>> between > >>>>>>> the default component and mine (--mca odls <my component> > >>>>>>> doesn't > >>>>>>> work). > >>>>>>> Is there a way to switch between odls components (I saw bprocs > >>>>>>> there and > >>>>>>> I guess it is used)? > >>>>> Are you working on the trunk? What r level? > >>>>> > >>>>> Reason I ask: I recently fixed a problem where the command line > >>>>> mca params > >>>>> were not getting passed to the orteds. Your description looks > >>>>> like you > >>>>> haven't picked up that change. If you have updated recently, > >>>>> and you still > >>>>> can't get it to work, then we likely have a lingering problem. > >>>>> > >>>>> > >>>>> If I read your subject line correctly, then I am somewhat > >>>>> puzzled. You can > >>>>> look at the orte/mca/odls/base/odls_base_default_fns.c file, the > >>>>> orte_odls_base_default_get_add_procs_data function and see > >>>>> where we get > >>>> the > >>>>> total number of procs in a job and how that is passed to the > >>>>> orteds. If > >>>> you > >>>>> have some new environment that the existing odls components > >>>>> can't handle, > >>>>> then I would strongly suggest you at least use the default > >>>>> functions in > >>>> the > >>>>> base to provide as much support as possible as this will help > >>>>> you to keep > >>>>> pace with changes in the system. > >>>>> > >>>>> I would also welcome feedback on what you encountered that > >>>>> required a new > >>>>> odls component - perhaps we can modify the base support > >>>>> functions to make > >>>> it > >>>>> fit within one of the existing components. > >>>>> > >>>>> Thanks > >>>>> Ralph > >>>>> > >>>>> > >>>>>>> Thank you, > >>>>>>> --David > >>>>>>> _______________________________________________ > >>>>>>> devel mailing list > >>>>>>> de...@open-mpi.org > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> > >> > > > -- > Jeff Squyres > Cisco Systems > >