I just noticed that the first execution of my program failed due to missing library. This is a bit unexpected b/c the LD_LIBRARY_PATH is set in my .bashrc . Does OpenMOLE/GridScale pass my local LD_LIBRARY_PATH on to the compute node or do I have to do something for it to happen ?
Regardless of this program execution failure, I am not sure why the following error occurred upon the second and third run. It may be related to the “workDirectory” setting, however, because when I set it I don’t get such error any more. Caused by: java.io.FileNotFoundException: /homes/as12312/.openmole/.tmp/ssh/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430578796223/95c2c56e-8503-48d5-a3a2-72dce3715e09/fbcd4ea6-15b8-4085-90cb-b796f31f39 e0/.tmp/a637ef86-c722-4e9f-b23b-a1450c611c5d/filec91d4680-b643-466b-928d-4a0cdd0a9ca1.bin (No such file or directory) After setting the workDirectory to one inside my “Workspace” directory, I see now that OpenMOLE is indeed using symbolic links for the replica: lrwxrwxrwx 1 as12312 vip 76 May 2 16:32 1430580762764_4776f49e-61ae-45b3-8513-b6e03aa69956.rep -> /vol/medic01/users/as12312/Code/REPEAT/target/scala-2.11/repeat_2.11-0.1.jar lrwxrwxrwx 1 as12312 vip 82 May 2 16:32 1430580762790_d941fe37-c611-497e-807f-d970c86df795.rep -> /vol/medic01/users/as12312/Data/Registrations/Dataset/MAC2012/Images/1000_3.nii.gz lrwxrwxrwx 1 as12312 vip 82 May 2 16:32 1430580762814_e3eab25f-9707-4552-8afa-5e0da8d8eb16.rep -> /vol/medic01/users/as12312/Data/Registrations/Dataset/MAC2012/Images/1001_3.nii.gz lrwxrwxrwx 1 as12312 vip 82 May 2 16:32 1430580762839_87b59538-1889-49ed-925e-b64147a0813e.rep -> /vol/medic01/users/as12312/Data/Registrations/Dataset/MAC2012/Images/1002_3.nii.gz lrwxrwxrwx 1 as12312 vip 68 May 2 16:32 1430580762864_3bd2a9dd-e91a-482f-bff6-58ed713dc924.rep -> /vol/medic01/users/as12312/Data/Registrations/Template/mni305.nii.gz lrwxrwxrwx 1 as12312 vip 134 May 2 16:32 1430580762890_ce37b658-81f0-4a9d-ba2e-069dbbf8ddf7.rep -> /homes/as12312/.openmole/merapi.doc.ic.ac.uk/.tmp/40ba4e65-2df2-4c14-879f-d57d99ff7b0e/archive4fc50dbc-acbe-40cb-aa7d-281f8162d947.tar <http://merapi.doc.ic.ac.uk/.tmp/40ba4e65-2df2-4c14-879f-d57d99ff7b0e/archive4fc50dbc-acbe-40cb-aa7d-281f8162d947.tar> What I don’t understand, however, is why the canonical path of the files printed within my ScalaTask of the input files and resource directory (“rootfs”) within the task workDir looks as follows: -----------------Output on remote host----------------- total 9 lrwxrwxrwx 1 as12312 vip 280 May 2 16:40 1000_3.nii.gz -> /vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67 -461931b901c6/662d254e-acc8-45dc-b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/file1ee53920-fe2e-415d-a9d5-ff839290265a.bin lrwxrwxrwx 1 as12312 vip 280 May 2 16:40 mni305.nii.gz -> /vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67 -461931b901c6/662d254e-acc8-45dc-b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/file23a8dc3e-88a5-4a29-80de-f561c19a3d07.bin lrwxrwxrwx 1 as12312 vip 282 May 2 16:40 rootfs -> /vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67-461931 b901c6/662d254e-acc8-45dc-b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/dirReplicae1964a36-ea8e-49a4-b301-d0caf6b439b4 Canonical path of refIm: /vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67-461931b901c6/662d254e-acc8-45dc- b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/category118e2faa-58a2-4862-b5b1-1a0181a6b208/mni305.nii.gz Canonical path of srcIm: /vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67-461931b901c6/662d254e-acc8-45dc- b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/category118e2faa-58a2-4862-b5b1-1a0181a6b208/1000_3.nii.gz Canonical path of rootfs: /vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67-461931b901c6/662d254e-acc8-45dc- b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/category118e2faa-58a2-4862-b5b1-1a0181a6b208/rootfs I would have expected these paths to refer to my input files and the canonical path of “rootfs” to be "/vol/medic01/users/as12312/Data/Registrations/Workspace/rootfs”. > On 2 May 2015, at 16:12, Andreas Schuh <[email protected]> wrote: > > Hi Romain, > > thanks very much for realising this so quickly. > > Find attached the log output of 3 runs. In the task output found in the first > log file you can see the output of “ls -l $workDir”. It seems there are still > replica created in the OpenMOLE tmp directory. Are these symbolic links ? > Because what I need for my workflow to not require a file copy is that the > links in the task workDir are eventually links to my input files/directories. > > The following two task executions failed without any task output. I had added > code to print the canonical path of the files in the workDir to see for > myself if these are my actual input files in > “/homes/as12312/Data/Registrations/Workspace/rootfs”. Not sure why these > fails. > > Andreas > > <openmole-storageSharedLocally-1.log><openmole-storageSharedLocally-2.log><openmole-storageSharedLocally-3.log> > >> On 2 May 2015, at 15:09, Romain Reuillon <[email protected]> wrote: >> >> Hi Andreas, >> >> I just pushed a first implementation of the optimisation for cluster >> environments in case of shared storage with the submission node. To enable >> it you should add the storageSharedLocally = true in you environment >> constructor. You should kill the dbServer when you update to this version >> (so it can reinitialize the db) since some files where compressed and are >> not anymore. >> >> There are still room for optimisation, especially concerning the output >> files and the directories (in input and output) which are still subject to >> several transformations which might be bypassed in case of a shared storage. >> >> I tried it on my local machine with the SshEnvironment and its functionnal. >> Could you test it on your environments? >> >> cheers, >> Romain >> >> Le 01/05/2015 19:09, Andreas Schuh a écrit : >>>> On 1 May 2015, at 18:00, Romain Reuillon <[email protected]> wrote: >>>> >>>> The default is in home, but you can configure where the jobs should be >>>> working in as an option of the environment. In the present implementation >>>> it has to be a shared storage, but I guess that $WORK is one. >>> Yes, it is. Only the TMPDIR is local to each compute node and not shared. >>> >>>> Le 01/05/2015 18:55, Andreas Schuh a écrit : >>>>> FYI I just refreshed my memory of our college HPC cluster (it’s actually >>>>> using PBS, not SGE as mentioned before). >>>>> >>>>> From their intro document, the following information may be useful while >>>>> revising the OpenMOLE storage handling: >>>>> >>>>> >>>>> On the HPC system, there are two file stores available to the user: HOME >>>>> and WORK . HOME has a relatively small quota of 10GB and is intended for >>>>> storing binaries, source and modest amounts of data. It should not be >>>>> written to directly by jobs. >>>>> >>>>> WORK is a larger area which is intended for staging files between jobs >>>>> and for long term data >>>>> >>>>> These areas should be referred to using the environment variables $HOME >>>>> and $WORK as their absolute locations are subject to change. >>>>> >>>>> Additionally,$TMPDIR. Jobs requiring scratch space at run time should >>>>> write to $TMPDIR. >>>>> >>>>>> On 1 May 2015, at 11:57, Andreas Schuh <[email protected]> >>>>>> wrote: >>>>>> >>>>>> >>>>>>> On 1 May 2015, at 11:49, Romain Reuillon <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>>> That would be great as I was hoping to finally be able to run my tasks >>>>>>>> to get actual results… it’s been 1 month now developing the OpenMOLE >>>>>>>> workflow :( >>>>>>>> >>>>>>>> I’ll be happy to test it in our environment. I have access to our lab >>>>>>>> dedicated SLURM cluster and the department HTCondor setup. I could >>>>>>>> also try it on our college HPC which uses SGE and shared storage. >>>>>>>> >>>>>>>> I also agree that these options should be part of the environment >>>>>>>> specification. >>>>>>>> >>>>>>> Great ! >>>>>>>>> I basically agree with you for the file in ~/.openmole: file are >>>>>>>>> transfered to the node through the shared FS. So it has to be copied >>>>>>>>> here. What could be optimized, is the temporary dir location of >>>>>>>>> execution for task. It is also created in this folder and therefore >>>>>>>>> on the sharded FS, which is not actually requiered. This workdir >>>>>>>>> could be optionnaly relocated somewhere using an environment >>>>>>>>> parameter. >>>>>>>>> >>>>>>>> Not sure if I follow this solution outline, but I’m sure you have a >>>>>>>> better idea of how things are working right now and need to be >>>>>>>> modified. Why do files have to be copied to ~/.openmole when the >>>>>>>> original input files to the workflow (exploration SelectFileDomain), >>>>>>>> is already located in a shared FS ? >>>>>>>> >>>>>>>> That the location of the local and remote temporary directory location >>>>>>>> can be configured via environment variable would solve the second >>>>>>>> issue of where temporary files such as wrapper scripts and remote >>>>>>>> resources are located. >>>>>>>> >>>>>>>> The first issue is how to deal with input and output files of tasks >>>>>>>> which are located on a shared FS already and thus should not require a >>>>>>>> copy to the temporary directories. >>>>>>> OpenMOLE env works by copying file to storages. In the general case the >>>>>>> storage is not shared between the submission machine and the execution >>>>>>> machines. In the case of a cluster OpenMOLE copy everything on the >>>>>>> shared FS by using ssh transfer to the master node (entry point of the >>>>>>> cluster) so it is accessible to all the computing nodes. In the >>>>>>> particular case where the submission machine shares it's FS with the >>>>>>> computing node I intend to substitute copy operations by simlink >>>>>>> creations, in order for this particular case to be handled by the >>>>>>> generic submission code of OpenMOLE. >>>>>> Ok, got it, and sounds like a good solution. >>>>>> >>>>>> So the optional symbolic links (“link” option of “addInputFile” and >>>>>> “addResource”) from the temporary directory/workingDir of each >>>>>> individual task are pointing to the storage on the master node of the >>>>>> execution machines. That is why I encounter currently an unexpected copy >>>>>> of my files. When the storage used by the execution machines themselves, >>>>>> however, uses symbolic links to the storage of the submission machine >>>>>> (as all machines share the same FS), no files are actually copied. >>>>>> >>>>>> What would have been when I had executed the OpenMOLE console on the >>>>>> master node of the environment ? Would then OpenMOLE already know that >>>>>> submission machine and execution machine are actually identical and thus >>>>>> inherently share the same storage ? >>>>>> >>>> >> >> >
_______________________________________________ OpenMOLE-users mailing list [email protected] http://fedex.iscpif.fr/mailman/listinfo/openmole-users
