I now get the following error after cleaning out the previous “workDirectory”
of the CondorEnvironment:
May 02, 2015 4:57:23 PM org.openmole.core.batch.jobservice.JobService$class
submit
FINE: Successful submission:
fr.iscpif.gridscale.condor.CondorJobService$CondorJob@34ff9849
May 02, 2015 4:58:17 PM org.openmole.core.batch.environment.BatchJobWatcher
update
FINE: Watch jobs 1
May 02, 2015 4:58:26 PM org.openmole.core.batch.refresh.JobManager $bang
FINE: Error in job refresh
java.io.FileNotFoundException:
/homes/as12312/.openmole/merapi.doc.ic.ac.uk/.tmp/4e135519-8fa4-4c04-bc71-b3b1e157be5a/file5ee7bc14-ce94-4726-894a-38840383ad3d.bin
(No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at
org.openmole.core.serializer.SerialiserService$$anonfun$deserialise$1.apply(SerialiserService.scala:86)
at
org.openmole.tool.lock.package$ReadWriteLockDecorator.read(package.scala:48)
at
org.openmole.core.serializer.SerialiserService$.deserialise(SerialiserService.scala:85)
at
org.openmole.core.batch.refresh.GetResultActor$$anonfun$getRuntimeResult$1.apply(GetResultActor.scala:90)
at
org.openmole.core.batch.refresh.GetResultActor$$anonfun$getRuntimeResult$1.apply(GetResultActor.scala:88)
at
org.openmole.core.workspace.Workspace.withTmpFile(Workspace.scala:217)
at
org.openmole.core.batch.refresh.GetResultActor.getRuntimeResult(GetResultActor.scala:88)
at
org.openmole.core.batch.refresh.GetResultActor.getResult(GetResultActor.scala:63)
at
org.openmole.core.batch.refresh.GetResultActor$$anonfun$receive$1$$anonfun$apply$mcV$sp$1.apply(GetResultActor.scala:50)
at
org.openmole.core.batch.refresh.GetResultActor$$anonfun$receive$1$$anonfun$apply$mcV$sp$1.apply(GetResultActor.scala:48)
at
org.openmole.core.batch.control.UsageControl$class.tryWithToken(UsageControl.scala:28)
at
org.openmole.plugin.environment.ssh.SSHPersistentStorage$$anon$2.tryWithToken(SSHPersistentStorage.scala:46)
at
org.openmole.core.batch.refresh.GetResultActor$$anonfun$receive$1.apply$mcV$sp(GetResultActor.scala:48)
at
org.openmole.core.batch.refresh.GetResultActor$$anonfun$receive$1.apply(GetResultActor.scala:46)
at
org.openmole.core.batch.refresh.GetResultActor$$anonfun$receive$1.apply(GetResultActor.scala:46)
at
org.openmole.core.batch.refresh.package$.withRunFinalization(package.scala:23)
at
org.openmole.core.batch.refresh.GetResultActor.receive(GetResultActor.scala:46)
at
org.openmole.core.batch.refresh.JobManager$DispatcherActor$.receive(JobManager.scala:84)
at
org.openmole.core.batch.refresh.JobManager$$anonfun$1$$anon$1.run(JobManager.scala:63)
> On 2 May 2015, at 16:47, Andreas Schuh <[email protected]> wrote:
>
> I just noticed that the first execution of my program failed due to missing
> library. This is a bit unexpected b/c the LD_LIBRARY_PATH is set in my
> .bashrc . Does OpenMOLE/GridScale pass my local LD_LIBRARY_PATH on to the
> compute node or do I have to do something for it to happen ?
>
> Regardless of this program execution failure, I am not sure why the following
> error occurred upon the second and third run. It may be related to the
> “workDirectory” setting, however, because when I set it I don’t get such
> error any more.
>
> Caused by: java.io.FileNotFoundException:
> /homes/as12312/.openmole/.tmp/ssh/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430578796223/95c2c56e-8503-48d5-a3a2-72dce3715e09/fbcd4ea6-15b8-4085-90cb-b796f31f39
> e0/.tmp/a637ef86-c722-4e9f-b23b-a1450c611c5d/filec91d4680-b643-466b-928d-4a0cdd0a9ca1.bin
> (No such file or directory)
>
>
> After setting the workDirectory to one inside my “Workspace” directory, I see
> now that OpenMOLE is indeed using symbolic links for the replica:
>
> lrwxrwxrwx 1 as12312 vip 76 May 2 16:32
> 1430580762764_4776f49e-61ae-45b3-8513-b6e03aa69956.rep ->
> /vol/medic01/users/as12312/Code/REPEAT/target/scala-2.11/repeat_2.11-0.1.jar
> lrwxrwxrwx 1 as12312 vip 82 May 2 16:32
> 1430580762790_d941fe37-c611-497e-807f-d970c86df795.rep ->
> /vol/medic01/users/as12312/Data/Registrations/Dataset/MAC2012/Images/1000_3.nii.gz
> lrwxrwxrwx 1 as12312 vip 82 May 2 16:32
> 1430580762814_e3eab25f-9707-4552-8afa-5e0da8d8eb16.rep ->
> /vol/medic01/users/as12312/Data/Registrations/Dataset/MAC2012/Images/1001_3.nii.gz
> lrwxrwxrwx 1 as12312 vip 82 May 2 16:32
> 1430580762839_87b59538-1889-49ed-925e-b64147a0813e.rep ->
> /vol/medic01/users/as12312/Data/Registrations/Dataset/MAC2012/Images/1002_3.nii.gz
> lrwxrwxrwx 1 as12312 vip 68 May 2 16:32
> 1430580762864_3bd2a9dd-e91a-482f-bff6-58ed713dc924.rep ->
> /vol/medic01/users/as12312/Data/Registrations/Template/mni305.nii.gz
> lrwxrwxrwx 1 as12312 vip 134 May 2 16:32
> 1430580762890_ce37b658-81f0-4a9d-ba2e-069dbbf8ddf7.rep ->
> /homes/as12312/.openmole/merapi.doc.ic.ac.uk/.tmp/40ba4e65-2df2-4c14-879f-d57d99ff7b0e/archive4fc50dbc-acbe-40cb-aa7d-281f8162d947.tar
>
> <http://merapi.doc.ic.ac.uk/.tmp/40ba4e65-2df2-4c14-879f-d57d99ff7b0e/archive4fc50dbc-acbe-40cb-aa7d-281f8162d947.tar>
>
>
> What I don’t understand, however, is why the canonical path of the files
> printed within my ScalaTask of the input files and resource directory
> (“rootfs”) within the task workDir looks as follows:
>
> -----------------Output on remote host-----------------
> total 9
> lrwxrwxrwx 1 as12312 vip 280 May 2 16:40 1000_3.nii.gz ->
> /vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67
> -461931b901c6/662d254e-acc8-45dc-b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/file1ee53920-fe2e-415d-a9d5-ff839290265a.bin
> lrwxrwxrwx 1 as12312 vip 280 May 2 16:40 mni305.nii.gz ->
> /vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67
> -461931b901c6/662d254e-acc8-45dc-b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/file23a8dc3e-88a5-4a29-80de-f561c19a3d07.bin
> lrwxrwxrwx 1 as12312 vip 282 May 2 16:40 rootfs ->
> /vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67-461931
> b901c6/662d254e-acc8-45dc-b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/dirReplicae1964a36-ea8e-49a4-b301-d0caf6b439b4
> Canonical path of refIm:
> /vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67-461931b901c6/662d254e-acc8-45dc-
> b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/category118e2faa-58a2-4862-b5b1-1a0181a6b208/mni305.nii.gz
> Canonical path of srcIm:
> /vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67-461931b901c6/662d254e-acc8-45dc-
> b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/category118e2faa-58a2-4862-b5b1-1a0181a6b208/1000_3.nii.gz
> Canonical path of rootfs:
> /vol/medic01/users/as12312/Data/Registrations/Workspace/openmole/5255b992-aae3-4fa6-8dce-a56079207f3d/tmp/1430581204175/4d4419d1-7b67-4052-8a67-461931b901c6/662d254e-acc8-45dc-
> b344-951c673ef0b5/.tmp/6d621846-b552-410f-a96a-a14bfe7cfd52/category118e2faa-58a2-4862-b5b1-1a0181a6b208/rootfs
>
> I would have expected these paths to refer to my input files and the
> canonical path of “rootfs” to be
> "/vol/medic01/users/as12312/Data/Registrations/Workspace/rootfs”.
>
>> On 2 May 2015, at 16:12, Andreas Schuh <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> Hi Romain,
>>
>> thanks very much for realising this so quickly.
>>
>> Find attached the log output of 3 runs. In the task output found in the
>> first log file you can see the output of “ls -l $workDir”. It seems there
>> are still replica created in the OpenMOLE tmp directory. Are these symbolic
>> links ? Because what I need for my workflow to not require a file copy is
>> that the links in the task workDir are eventually links to my input
>> files/directories.
>>
>> The following two task executions failed without any task output. I had
>> added code to print the canonical path of the files in the workDir to see
>> for myself if these are my actual input files in
>> “/homes/as12312/Data/Registrations/Workspace/rootfs”. Not sure why these
>> fails.
>>
>> Andreas
>>
>> <openmole-storageSharedLocally-1.log><openmole-storageSharedLocally-2.log><openmole-storageSharedLocally-3.log>
>>
>>> On 2 May 2015, at 15:09, Romain Reuillon <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>> Hi Andreas,
>>>
>>> I just pushed a first implementation of the optimisation for cluster
>>> environments in case of shared storage with the submission node. To enable
>>> it you should add the storageSharedLocally = true in you environment
>>> constructor. You should kill the dbServer when you update to this version
>>> (so it can reinitialize the db) since some files where compressed and are
>>> not anymore.
>>>
>>> There are still room for optimisation, especially concerning the output
>>> files and the directories (in input and output) which are still subject to
>>> several transformations which might be bypassed in case of a shared storage.
>>>
>>> I tried it on my local machine with the SshEnvironment and its functionnal.
>>> Could you test it on your environments?
>>>
>>> cheers,
>>> Romain
>>>
>>> Le 01/05/2015 19:09, Andreas Schuh a écrit :
>>>>> On 1 May 2015, at 18:00, Romain Reuillon <[email protected]
>>>>> <mailto:[email protected]>> wrote:
>>>>>
>>>>> The default is in home, but you can configure where the jobs should be
>>>>> working in as an option of the environment. In the present implementation
>>>>> it has to be a shared storage, but I guess that $WORK is one.
>>>> Yes, it is. Only the TMPDIR is local to each compute node and not shared.
>>>>
>>>>> Le 01/05/2015 18:55, Andreas Schuh a écrit :
>>>>>> FYI I just refreshed my memory of our college HPC cluster (it’s actually
>>>>>> using PBS, not SGE as mentioned before).
>>>>>>
>>>>>> From their intro document, the following information may be useful while
>>>>>> revising the OpenMOLE storage handling:
>>>>>>
>>>>>>
>>>>>> On the HPC system, there are two file stores available to the user: HOME
>>>>>> and WORK . HOME has a relatively small quota of 10GB and is intended for
>>>>>> storing binaries, source and modest amounts of data. It should not be
>>>>>> written to directly by jobs.
>>>>>>
>>>>>> WORK is a larger area which is intended for staging files between jobs
>>>>>> and for long term data
>>>>>>
>>>>>> These areas should be referred to using the environment variables $HOME
>>>>>> and $WORK as their absolute locations are subject to change.
>>>>>>
>>>>>> Additionally,$TMPDIR. Jobs requiring scratch space at run time should
>>>>>> write to $TMPDIR.
>>>>>>
>>>>>>> On 1 May 2015, at 11:57, Andreas Schuh <[email protected]
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On 1 May 2015, at 11:49, Romain Reuillon <[email protected]
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> That would be great as I was hoping to finally be able to run my
>>>>>>>>> tasks to get actual results… it’s been 1 month now developing the
>>>>>>>>> OpenMOLE workflow :(
>>>>>>>>>
>>>>>>>>> I’ll be happy to test it in our environment. I have access to our lab
>>>>>>>>> dedicated SLURM cluster and the department HTCondor setup. I could
>>>>>>>>> also try it on our college HPC which uses SGE and shared storage.
>>>>>>>>>
>>>>>>>>> I also agree that these options should be part of the environment
>>>>>>>>> specification.
>>>>>>>>>
>>>>>>>> Great !
>>>>>>>>>> I basically agree with you for the file in ~/.openmole: file are
>>>>>>>>>> transfered to the node through the shared FS. So it has to be copied
>>>>>>>>>> here. What could be optimized, is the temporary dir location of
>>>>>>>>>> execution for task. It is also created in this folder and therefore
>>>>>>>>>> on the sharded FS, which is not actually requiered. This workdir
>>>>>>>>>> could be optionnaly relocated somewhere using an environment
>>>>>>>>>> parameter.
>>>>>>>>>>
>>>>>>>>> Not sure if I follow this solution outline, but I’m sure you have a
>>>>>>>>> better idea of how things are working right now and need to be
>>>>>>>>> modified. Why do files have to be copied to ~/.openmole when the
>>>>>>>>> original input files to the workflow (exploration SelectFileDomain),
>>>>>>>>> is already located in a shared FS ?
>>>>>>>>>
>>>>>>>>> That the location of the local and remote temporary directory
>>>>>>>>> location can be configured via environment variable would solve the
>>>>>>>>> second issue of where temporary files such as wrapper scripts and
>>>>>>>>> remote resources are located.
>>>>>>>>>
>>>>>>>>> The first issue is how to deal with input and output files of tasks
>>>>>>>>> which are located on a shared FS already and thus should not require
>>>>>>>>> a copy to the temporary directories.
>>>>>>>> OpenMOLE env works by copying file to storages. In the general case
>>>>>>>> the storage is not shared between the submission machine and the
>>>>>>>> execution machines. In the case of a cluster OpenMOLE copy everything
>>>>>>>> on the shared FS by using ssh transfer to the master node (entry point
>>>>>>>> of the cluster) so it is accessible to all the computing nodes. In
>>>>>>>> the particular case where the submission machine shares it's FS with
>>>>>>>> the computing node I intend to substitute copy operations by simlink
>>>>>>>> creations, in order for this particular case to be handled by the
>>>>>>>> generic submission code of OpenMOLE.
>>>>>>> Ok, got it, and sounds like a good solution.
>>>>>>>
>>>>>>> So the optional symbolic links (“link” option of “addInputFile” and
>>>>>>> “addResource”) from the temporary directory/workingDir of each
>>>>>>> individual task are pointing to the storage on the master node of the
>>>>>>> execution machines. That is why I encounter currently an unexpected
>>>>>>> copy of my files. When the storage used by the execution machines
>>>>>>> themselves, however, uses symbolic links to the storage of the
>>>>>>> submission machine (as all machines share the same FS), no files are
>>>>>>> actually copied.
>>>>>>>
>>>>>>> What would have been when I had executed the OpenMOLE console on the
>>>>>>> master node of the environment ? Would then OpenMOLE already know that
>>>>>>> submission machine and execution machine are actually identical and
>>>>>>> thus inherently share the same storage ?
>>>>>>>
>>>>>
>>>
>>>
>>
>
_______________________________________________
OpenMOLE-users mailing list
[email protected]
http://fedex.iscpif.fr/mailman/listinfo/openmole-users