I now get the following error after cleaning out the previous “workDirectory” 
of the CondorEnvironment:

May 02, 2015 4:57:23 PM org.openmole.core.batch.jobservice.JobService$class 
submit
FINE: Successful submission: 
fr.iscpif.gridscale.condor.CondorJobService$CondorJob@34ff9849
May 02, 2015 4:58:17 PM org.openmole.core.batch.environment.BatchJobWatcher 
update
FINE: Watch jobs 1
May 02, 2015 4:58:26 PM org.openmole.core.batch.refresh.JobManager $bang
FINE: Error in job refresh
java.io.FileNotFoundException: 
/homes/as12312/.openmole/merapi.doc.ic.ac.uk/.tmp/4e135519-8fa4-4c04-bc71-b3b1e157be5a/file5ee7bc14-ce94-4726-894a-38840383ad3d.bin
 (No such file or directory)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.<init>(FileInputStream.java:146)
        at 
org.openmole.core.serializer.SerialiserService$$anonfun$deserialise$1.apply(SerialiserService.scala:86)
        at 
org.openmole.tool.lock.package$ReadWriteLockDecorator.read(package.scala:48)
        at 
org.openmole.core.serializer.SerialiserService$.deserialise(SerialiserService.scala:85)
        at 
org.openmole.core.batch.refresh.GetResultActor$$anonfun$getRuntimeResult$1.apply(GetResultActor.scala:90)
        at 
org.openmole.core.batch.refresh.GetResultActor$$anonfun$getRuntimeResult$1.apply(GetResultActor.scala:88)
        at 
org.openmole.core.workspace.Workspace.withTmpFile(Workspace.scala:217)
        at 
org.openmole.core.batch.refresh.GetResultActor.getRuntimeResult(GetResultActor.scala:88)
        at 
org.openmole.core.batch.refresh.GetResultActor.getResult(GetResultActor.scala:63)
        at 
org.openmole.core.batch.refresh.GetResultActor$$anonfun$receive$1$$anonfun$apply$mcV$sp$1.apply(GetResultActor.scala:50)
        at 
org.openmole.core.batch.refresh.GetResultActor$$anonfun$receive$1$$anonfun$apply$mcV$sp$1.apply(GetResultActor.scala:48)
        at 
org.openmole.core.batch.control.UsageControl$class.tryWithToken(UsageControl.scala:28)
        at 
org.openmole.plugin.environment.ssh.SSHPersistentStorage$$anon$2.tryWithToken(SSHPersistentStorage.scala:46)
        at 
org.openmole.core.batch.refresh.GetResultActor$$anonfun$receive$1.apply$mcV$sp(GetResultActor.scala:48)
        at 
org.openmole.core.batch.refresh.GetResultActor$$anonfun$receive$1.apply(GetResultActor.scala:46)
        at 
org.openmole.core.batch.refresh.GetResultActor$$anonfun$receive$1.apply(GetResultActor.scala:46)
        at 
org.openmole.core.batch.refresh.package$.withRunFinalization(package.scala:23)
        at 
org.openmole.core.batch.refresh.GetResultActor.receive(GetResultActor.scala:46)
        at 
org.openmole.core.batch.refresh.JobManager$DispatcherActor$.receive(JobManager.scala:84)
        at 
org.openmole.core.batch.refresh.JobManager$$anonfun$1$$anon$1.run(JobManager.scala:63)

> On 2 May 2015, at 16:47, Andreas Schuh <[email protected]> wrote:
> 
> I just noticed that the first execution of my program failed due to missing 
> library. This is a bit unexpected b/c the LD_LIBRARY_PATH is set in my 
> .bashrc . Does OpenMOLE/GridScale pass my local LD_LIBRARY_PATH on to the 
> compute node or do I have to do something for it to happen ?
> 
> Regardless of this program execution failure, I am not sure why the following 
> error occurred upon the second and third run. It may be related to the 
> “workDirectory” setting, however, because when I set it I don’t get such 
> error any more.
> 
> Caused by: java.io.FileNotFoundException: 
> /homes/as12312/.openmole/.tmp/ssh/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430578796223/95c2c56e-8503-48d5-a3a2-72dce3715e09/fbcd4ea6-15b8-4085-90cb-b796f31f39
> e0/.tmp/a637ef86-c722-4e9f-b23b-a1450c611c5d/filec91d4680-b643-466b-928d-4a0cdd0a9ca1.bin
>  (No such file or directory)
> 
> 
> After setting the workDirectory to one inside my “Workspace” directory, I see 
> now that OpenMOLE is indeed using symbolic links for the replica:
> 
> lrwxrwxrwx 1 as12312 vip  76 May  2 16:32 
> 1430580762764_4776f49e-61ae-45b3-8513-b6e03aa69956.rep -> 
> /vol/medic01/users/as12312/Code/REPEAT/target/scala-2.11/repeat_2.11-0.1.jar
> lrwxrwxrwx 1 as12312 vip  82 May  2 16:32 
> 1430580762790_d941fe37-c611-497e-807f-d970c86df795.rep -> 
> /vol/medic01/users/as12312/Data/Registrations/Dataset/MAC2012/Images/1000_3.nii.gz
> lrwxrwxrwx 1 as12312 vip  82 May  2 16:32 
> 1430580762814_e3eab25f-9707-4552-8afa-5e0da8d8eb16.rep -> 
> /vol/medic01/users/as12312/Data/Registrations/Dataset/MAC2012/Images/1001_3.nii.gz
> lrwxrwxrwx 1 as12312 vip  82 May  2 16:32 
> 1430580762839_87b59538-1889-49ed-925e-b64147a0813e.rep -> 
> /vol/medic01/users/as12312/Data/Registrations/Dataset/MAC2012/Images/1002_3.nii.gz
> lrwxrwxrwx 1 as12312 vip  68 May  2 16:32 
> 1430580762864_3bd2a9dd-e91a-482f-bff6-58ed713dc924.rep -> 
> /vol/medic01/users/as12312/Data/Registrations/Template/mni305.nii.gz
> lrwxrwxrwx 1 as12312 vip 134 May  2 16:32 
> 1430580762890_ce37b658-81f0-4a9d-ba2e-069dbbf8ddf7.rep -> 
> /homes/as12312/.openmole/merapi.doc.ic.ac.uk/.tmp/40ba4e65-2df2-4c14-879f-d57d99ff7b0e/archive4fc50dbc-acbe-40cb-aa7d-281f8162d947.tar
>  
> <http://merapi.doc.ic.ac.uk/.tmp/40ba4e65-2df2-4c14-879f-d57d99ff7b0e/archive4fc50dbc-acbe-40cb-aa7d-281f8162d947.tar>
> 
> 
> What I don’t understand, however, is why the canonical path of the files 
> printed within my ScalaTask of the input files and resource directory 
> (“rootfs”) within the task workDir looks as follows:
> 
> -----------------Output on remote host-----------------
> total 9  
> lrwxrwxrwx 1 as12312 vip 280 May  2 16:40 1000_3.nii.gz -> 
> /vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67
> -461931b901c6/662d254e-acc8-45dc-b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/file1ee53920-fe2e-415d-a9d5-ff839290265a.bin
> lrwxrwxrwx 1 as12312 vip 280 May  2 16:40 mni305.nii.gz -> 
> /vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67
> -461931b901c6/662d254e-acc8-45dc-b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/file23a8dc3e-88a5-4a29-80de-f561c19a3d07.bin
> lrwxrwxrwx 1 as12312 vip 282 May  2 16:40 rootfs -> 
> /vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67-461931
> b901c6/662d254e-acc8-45dc-b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/dirReplicae1964a36-ea8e-49a4-b301-d0caf6b439b4
> Canonical path of refIm:  
> /vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67-461931b901c6/662d254e-acc8-45dc-
> b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/category118e2faa-58a2-4862-b5b1-1a0181a6b208/mni305.nii.gz
> Canonical path of srcIm:  
> /vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67-461931b901c6/662d254e-acc8-45dc-
> b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/category118e2faa-58a2-4862-b5b1-1a0181a6b208/1000_3.nii.gz
> Canonical path of rootfs: 
> /vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67-461931b901c6/662d254e-acc8-45dc-
> b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/category118e2faa-58a2-4862-b5b1-1a0181a6b208/rootfs
> 
> I would have expected these paths to refer to my input files and the 
> canonical path of “rootfs” to be 
> "/vol/medic01/users/as12312/Data/Registrations/Workspace/rootfs”.
> 
>> On 2 May 2015, at 16:12, Andreas Schuh <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hi Romain,
>> 
>> thanks very much for realising this so quickly.
>> 
>> Find attached the log output of 3 runs. In the task output found in the 
>> first log file you can see the output of “ls -l $workDir”. It seems there 
>> are still replica created in the OpenMOLE tmp directory. Are these symbolic 
>> links ? Because what I need for my workflow to not require a file copy is 
>> that the links in the task workDir are eventually links to my input 
>> files/directories.
>> 
>> The following two task executions failed without any task output. I had 
>> added code to print the canonical path of the files in the workDir to see 
>> for myself if these are my actual input files in 
>> “/homes/as12312/Data/Registrations/Workspace/rootfs”. Not sure why these 
>> fails.
>> 
>> Andreas
>> 
>> <openmole-storageSharedLocally-1.log><openmole-storageSharedLocally-2.log><openmole-storageSharedLocally-3.log>
>> 
>>> On 2 May 2015, at 15:09, Romain Reuillon <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hi Andreas,
>>> 
>>> I just pushed a first implementation of the optimisation for cluster 
>>> environments in case of shared storage with the submission node. To enable 
>>> it you should add the storageSharedLocally = true in you environment 
>>> constructor. You should kill the dbServer when you update to this version 
>>> (so it can reinitialize the db) since some files where compressed and are 
>>> not anymore.
>>> 
>>> There are still room for optimisation, especially concerning the output 
>>> files and the directories (in input and output) which are still subject to 
>>> several transformations which might be bypassed in case of a shared storage.
>>> 
>>> I tried it on my local machine with the SshEnvironment and its functionnal. 
>>> Could you test it on your environments?
>>> 
>>> cheers,
>>> Romain
>>> 
>>> Le 01/05/2015 19:09, Andreas Schuh a écrit :
>>>>> On 1 May 2015, at 18:00, Romain Reuillon <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> The default is in home, but you can configure where the jobs should be 
>>>>> working in as an option of the environment. In the present implementation 
>>>>> it has to be a shared storage, but I guess that $WORK is one.
>>>> Yes, it is. Only the TMPDIR is local to each compute node and not shared.
>>>> 
>>>>> Le 01/05/2015 18:55, Andreas Schuh a écrit :
>>>>>> FYI I just refreshed my memory of our college HPC cluster (it’s actually 
>>>>>> using PBS, not SGE as mentioned before).
>>>>>> 
>>>>>> From their intro document, the following information may be useful while 
>>>>>> revising the OpenMOLE storage handling:
>>>>>> 
>>>>>> 
>>>>>> On the HPC system, there are two file stores available to the user: HOME 
>>>>>> and WORK . HOME has a relatively small quota of 10GB and is intended for 
>>>>>> storing binaries, source and modest amounts of data. It should not be 
>>>>>> written to directly by jobs.
>>>>>> 
>>>>>> WORK is a larger area which is intended for staging files between jobs 
>>>>>> and for long term data
>>>>>> 
>>>>>> These areas should be referred to using the environment variables $HOME 
>>>>>> and $WORK as their absolute locations are subject to change.
>>>>>> 
>>>>>> Additionally,$TMPDIR. Jobs requiring scratch space at run time should 
>>>>>> write to $TMPDIR.
>>>>>> 
>>>>>>> On 1 May 2015, at 11:57, Andreas Schuh <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> On 1 May 2015, at 11:49, Romain Reuillon <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> That would be great as I was hoping to finally be able to run my 
>>>>>>>>> tasks to get actual results… it’s been 1 month now developing the 
>>>>>>>>> OpenMOLE workflow :(
>>>>>>>>> 
>>>>>>>>> I’ll be happy to test it in our environment. I have access to our lab 
>>>>>>>>> dedicated SLURM cluster and the department HTCondor setup. I could 
>>>>>>>>> also try it on our college HPC which uses SGE and shared storage.
>>>>>>>>> 
>>>>>>>>> I also agree that these options should be part of the environment 
>>>>>>>>> specification.
>>>>>>>>> 
>>>>>>>> Great !
>>>>>>>>>> I basically agree with you for the file in ~/.openmole: file are 
>>>>>>>>>> transfered to the node through the shared FS. So it has to be copied 
>>>>>>>>>> here. What could be optimized, is the temporary dir location of 
>>>>>>>>>> execution for task. It is also created in this folder and therefore 
>>>>>>>>>> on the sharded FS, which is not actually requiered. This workdir 
>>>>>>>>>> could be optionnaly relocated somewhere using an environment 
>>>>>>>>>> parameter.
>>>>>>>>>> 
>>>>>>>>> Not sure if I follow this solution outline, but I’m sure you have a 
>>>>>>>>> better idea of how things are working right now and need to be 
>>>>>>>>> modified. Why do files have to be copied to ~/.openmole when the 
>>>>>>>>> original input files to the workflow (exploration SelectFileDomain), 
>>>>>>>>> is already located in a shared FS ?
>>>>>>>>> 
>>>>>>>>> That the location of the local and remote temporary directory 
>>>>>>>>> location can be configured via environment variable would solve the 
>>>>>>>>> second issue of where temporary files such as wrapper scripts and 
>>>>>>>>> remote resources are located.
>>>>>>>>> 
>>>>>>>>> The first issue is how to deal with input and output files of tasks 
>>>>>>>>> which are located on a shared FS already and thus should not require 
>>>>>>>>> a copy to the temporary directories.
>>>>>>>> OpenMOLE env works by copying file to storages. In the general case 
>>>>>>>> the storage is not shared between the submission machine and the 
>>>>>>>> execution machines. In the case of a cluster OpenMOLE copy everything 
>>>>>>>> on the shared FS by using ssh transfer to the master node (entry point 
>>>>>>>> of the cluster) so it is accessible to all the computing nodes.  In 
>>>>>>>> the particular case where the submission machine shares it's FS with 
>>>>>>>> the computing node I intend to substitute copy operations by simlink 
>>>>>>>> creations, in order for this particular case to be handled by the 
>>>>>>>> generic submission code of OpenMOLE.
>>>>>>> Ok, got it, and sounds like a good solution.
>>>>>>> 
>>>>>>> So the optional symbolic links (“link” option of “addInputFile” and 
>>>>>>> “addResource”) from the temporary directory/workingDir of each 
>>>>>>> individual task are pointing to the storage on the master node of the 
>>>>>>> execution machines. That is why I encounter currently an unexpected 
>>>>>>> copy of my files. When the storage used by the execution machines 
>>>>>>> themselves, however, uses symbolic links to the storage of the 
>>>>>>> submission machine (as all machines share the same FS), no files are 
>>>>>>> actually copied.
>>>>>>> 
>>>>>>> What would have been when I had executed the OpenMOLE console on the 
>>>>>>> master node of the environment ? Would then OpenMOLE already know that 
>>>>>>> submission machine and execution machine are actually identical and 
>>>>>>> thus inherently share the same storage ?
>>>>>>> 
>>>>> 
>>> 
>>> 
>> 
> 

_______________________________________________
OpenMOLE-users mailing list
[email protected]
http://fedex.iscpif.fr/mailman/listinfo/openmole-users

Reply via email to