Re: [OpenMOLE-users] How to prevent uploadedTar archive

Andreas Schuh Sat, 02 May 2015 08:48:52 -0700

I just noticed that the first execution of my program failed due to missing 
library. This is a bit unexpected b/c the LD_LIBRARY_PATH is set in my .bashrc 
. Does OpenMOLE/GridScale pass my local LD_LIBRARY_PATH on to the compute node 
or do I have to do something for it to happen ?


Regardless of this program execution failure, I am not sure why the following 
error occurred upon the second and third run. It may be related to the 
“workDirectory” setting, however, because when I set it I don’t get such error 
any more.

Caused by: java.io.FileNotFoundException: 
/homes/as12312/.openmole/.tmp/ssh/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430578796223/95c2c56e-8503-48d5-a3a2-72dce3715e09/fbcd4ea6-15b8-4085-90cb-b796f31f39
e0/.tmp/a637ef86-c722-4e9f-b23b-a1450c611c5d/filec91d4680-b643-466b-928d-4a0cdd0a9ca1.bin
 (No such file or directory)


After setting the workDirectory to one inside my “Workspace” directory, I see 
now that OpenMOLE is indeed using symbolic links for the replica:

lrwxrwxrwx 1 as12312 vip  76 May  2 16:32 
1430580762764_4776f49e-61ae-45b3-8513-b6e03aa69956.rep -> 
/vol/medic01/users/as12312/Code/REPEAT/target/scala-2.11/repeat_2.11-0.1.jar
lrwxrwxrwx 1 as12312 vip  82 May  2 16:32 
1430580762790_d941fe37-c611-497e-807f-d970c86df795.rep -> 
/vol/medic01/users/as12312/Data/Registrations/Dataset/MAC2012/Images/1000_3.nii.gz
lrwxrwxrwx 1 as12312 vip  82 May  2 16:32 
1430580762814_e3eab25f-9707-4552-8afa-5e0da8d8eb16.rep -> 
/vol/medic01/users/as12312/Data/Registrations/Dataset/MAC2012/Images/1001_3.nii.gz
lrwxrwxrwx 1 as12312 vip  82 May  2 16:32 
1430580762839_87b59538-1889-49ed-925e-b64147a0813e.rep -> 
/vol/medic01/users/as12312/Data/Registrations/Dataset/MAC2012/Images/1002_3.nii.gz
lrwxrwxrwx 1 as12312 vip  68 May  2 16:32 
1430580762864_3bd2a9dd-e91a-482f-bff6-58ed713dc924.rep -> 
/vol/medic01/users/as12312/Data/Registrations/Template/mni305.nii.gz
lrwxrwxrwx 1 as12312 vip 134 May  2 16:32 
1430580762890_ce37b658-81f0-4a9d-ba2e-069dbbf8ddf7.rep -> 
/homes/as12312/.openmole/merapi.doc.ic.ac.uk/.tmp/40ba4e65-2df2-4c14-879f-d57d99ff7b0e/archive4fc50dbc-acbe-40cb-aa7d-281f8162d947.tar
 
<http://merapi.doc.ic.ac.uk/.tmp/40ba4e65-2df2-4c14-879f-d57d99ff7b0e/archive4fc50dbc-acbe-40cb-aa7d-281f8162d947.tar>


What I don’t understand, however, is why the canonical path of the files 
printed within my ScalaTask of the input files and resource directory 
(“rootfs”) within the task workDir looks as follows:

-----------------Output on remote host-----------------
total 9  
lrwxrwxrwx 1 as12312 vip 280 May  2 16:40 1000_3.nii.gz -> 
/vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67
-461931b901c6/662d254e-acc8-45dc-b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/file1ee53920-fe2e-415d-a9d5-ff839290265a.bin
lrwxrwxrwx 1 as12312 vip 280 May  2 16:40 mni305.nii.gz -> 
/vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67
-461931b901c6/662d254e-acc8-45dc-b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/file23a8dc3e-88a5-4a29-80de-f561c19a3d07.bin
lrwxrwxrwx 1 as12312 vip 282 May  2 16:40 rootfs -> 
/vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67-461931
b901c6/662d254e-acc8-45dc-b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/dirReplicae1964a36-ea8e-49a4-b301-d0caf6b439b4
Canonical path of refIm:  
/vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67-461931b901c6/662d254e-acc8-45dc-
b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/category118e2faa-58a2-4862-b5b1-1a0181a6b208/mni305.nii.gz
Canonical path of srcIm:  
/vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67-461931b901c6/662d254e-acc8-45dc-
b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/category118e2faa-58a2-4862-b5b1-1a0181a6b208/1000_3.nii.gz
Canonical path of rootfs: 
/vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67-461931b901c6/662d254e-acc8-45dc-
b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/category118e2faa-58a2-4862-b5b1-1a0181a6b208/rootfs

I would have expected these paths to refer to my input files and the canonical 
path of “rootfs” to be 
"/vol/medic01/users/as12312/Data/Registrations/Workspace/rootfs”.

> On 2 May 2015, at 16:12, Andreas Schuh <[email protected]> wrote:
> 
> Hi Romain,
> 
> thanks very much for realising this so quickly.
> 
> Find attached the log output of 3 runs. In the task output found in the first 
> log file you can see the output of “ls -l $workDir”. It seems there are still 
> replica created in the OpenMOLE tmp directory. Are these symbolic links ? 
> Because what I need for my workflow to not require a file copy is that the 
> links in the task workDir are eventually links to my input files/directories.
> 
> The following two task executions failed without any task output. I had added 
> code to print the canonical path of the files in the workDir to see for 
> myself if these are my actual input files in 
> “/homes/as12312/Data/Registrations/Workspace/rootfs”. Not sure why these 
> fails.
> 
> Andreas
> 
> <openmole-storageSharedLocally-1.log><openmole-storageSharedLocally-2.log><openmole-storageSharedLocally-3.log>
> 
>> On 2 May 2015, at 15:09, Romain Reuillon <[email protected]> wrote:
>> 
>> Hi Andreas,
>> 
>> I just pushed a first implementation of the optimisation for cluster 
>> environments in case of shared storage with the submission node. To enable 
>> it you should add the storageSharedLocally = true in you environment 
>> constructor. You should kill the dbServer when you update to this version 
>> (so it can reinitialize the db) since some files where compressed and are 
>> not anymore.
>> 
>> There are still room for optimisation, especially concerning the output 
>> files and the directories (in input and output) which are still subject to 
>> several transformations which might be bypassed in case of a shared storage.
>> 
>> I tried it on my local machine with the SshEnvironment and its functionnal. 
>> Could you test it on your environments?
>> 
>> cheers,
>> Romain
>> 
>> Le 01/05/2015 19:09, Andreas Schuh a écrit :
>>>> On 1 May 2015, at 18:00, Romain Reuillon <[email protected]> wrote:
>>>> 
>>>> The default is in home, but you can configure where the jobs should be 
>>>> working in as an option of the environment. In the present implementation 
>>>> it has to be a shared storage, but I guess that $WORK is one.
>>> Yes, it is. Only the TMPDIR is local to each compute node and not shared.
>>> 
>>>> Le 01/05/2015 18:55, Andreas Schuh a écrit :
>>>>> FYI I just refreshed my memory of our college HPC cluster (it’s actually 
>>>>> using PBS, not SGE as mentioned before).
>>>>> 
>>>>> From their intro document, the following information may be useful while 
>>>>> revising the OpenMOLE storage handling:
>>>>> 
>>>>> 
>>>>> On the HPC system, there are two file stores available to the user: HOME 
>>>>> and WORK . HOME has a relatively small quota of 10GB and is intended for 
>>>>> storing binaries, source and modest amounts of data. It should not be 
>>>>> written to directly by jobs.
>>>>> 
>>>>> WORK is a larger area which is intended for staging files between jobs 
>>>>> and for long term data
>>>>> 
>>>>> These areas should be referred to using the environment variables $HOME 
>>>>> and $WORK as their absolute locations are subject to change.
>>>>> 
>>>>> Additionally,$TMPDIR. Jobs requiring scratch space at run time should 
>>>>> write to $TMPDIR.
>>>>> 
>>>>>> On 1 May 2015, at 11:57, Andreas Schuh <[email protected]> 
>>>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>>> On 1 May 2015, at 11:49, Romain Reuillon <[email protected]> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> That would be great as I was hoping to finally be able to run my tasks 
>>>>>>>> to get actual results… it’s been 1 month now developing the OpenMOLE 
>>>>>>>> workflow :(
>>>>>>>> 
>>>>>>>> I’ll be happy to test it in our environment. I have access to our lab 
>>>>>>>> dedicated SLURM cluster and the department HTCondor setup. I could 
>>>>>>>> also try it on our college HPC which uses SGE and shared storage.
>>>>>>>> 
>>>>>>>> I also agree that these options should be part of the environment 
>>>>>>>> specification.
>>>>>>>> 
>>>>>>> Great !
>>>>>>>>> I basically agree with you for the file in ~/.openmole: file are 
>>>>>>>>> transfered to the node through the shared FS. So it has to be copied 
>>>>>>>>> here. What could be optimized, is the temporary dir location of 
>>>>>>>>> execution for task. It is also created in this folder and therefore 
>>>>>>>>> on the sharded FS, which is not actually requiered. This workdir 
>>>>>>>>> could be optionnaly relocated somewhere using an environment 
>>>>>>>>> parameter.
>>>>>>>>> 
>>>>>>>> Not sure if I follow this solution outline, but I’m sure you have a 
>>>>>>>> better idea of how things are working right now and need to be 
>>>>>>>> modified. Why do files have to be copied to ~/.openmole when the 
>>>>>>>> original input files to the workflow (exploration SelectFileDomain), 
>>>>>>>> is already located in a shared FS ?
>>>>>>>> 
>>>>>>>> That the location of the local and remote temporary directory location 
>>>>>>>> can be configured via environment variable would solve the second 
>>>>>>>> issue of where temporary files such as wrapper scripts and remote 
>>>>>>>> resources are located.
>>>>>>>> 
>>>>>>>> The first issue is how to deal with input and output files of tasks 
>>>>>>>> which are located on a shared FS already and thus should not require a 
>>>>>>>> copy to the temporary directories.
>>>>>>> OpenMOLE env works by copying file to storages. In the general case the 
>>>>>>> storage is not shared between the submission machine and the execution 
>>>>>>> machines. In the case of a cluster OpenMOLE copy everything on the 
>>>>>>> shared FS by using ssh transfer to the master node (entry point of the 
>>>>>>> cluster) so it is accessible to all the computing nodes.  In the 
>>>>>>> particular case where the submission machine shares it's FS with the 
>>>>>>> computing node I intend to substitute copy operations by simlink 
>>>>>>> creations, in order for this particular case to be handled by the 
>>>>>>> generic submission code of OpenMOLE.
>>>>>> Ok, got it, and sounds like a good solution.
>>>>>> 
>>>>>> So the optional symbolic links (“link” option of “addInputFile” and 
>>>>>> “addResource”) from the temporary directory/workingDir of each 
>>>>>> individual task are pointing to the storage on the master node of the 
>>>>>> execution machines. That is why I encounter currently an unexpected copy 
>>>>>> of my files. When the storage used by the execution machines themselves, 
>>>>>> however, uses symbolic links to the storage of the submission machine 
>>>>>> (as all machines share the same FS), no files are actually copied.
>>>>>> 
>>>>>> What would have been when I had executed the OpenMOLE console on the 
>>>>>> master node of the environment ? Would then OpenMOLE already know that 
>>>>>> submission machine and execution machine are actually identical and thus 
>>>>>> inherently share the same storage ?
>>>>>> 
>>>> 
>> 
>> 
>

_______________________________________________
OpenMOLE-users mailing list
[email protected]
http://fedex.iscpif.fr/mailman/listinfo/openmole-users

Re: [OpenMOLE-users] How to prevent uploadedTar archive

Reply via email to