Hi Riccardo,

Sorry for the late reply, the error didn't occur for a while after my 
initial post.
'Luckily' it happened again this morning, so I can finally give some more 
information.

Each job consists of 7 calculations, each producing 3 output files, so 21 
files in total for each job. Somehow, the calculation stops, and therefore 
only a few output files are produced. When Gc3pie downloads the output, it 
can not find all files in the folder, and gives an error. It is not clear 
to me if the calculation really stops, or if GC3pie somehow terminates the 
calculation. The calculation runs without trouble on my local machine. 

[It would help to see if the DEBUG level logs have something to say. 
Can you collect the DEBUG logs from such a problem situation?  ]

The debug log is huge. So I have only printed below the part where one of 
the jobs get into trouble

[2018-10-01 10:21:05] gc3.gc3libs  DEBUG   : About to update state of 
application: MatlabApp.354662 (currently: RUNNING)
[2018-10-01 10:21:05] gc3.gc3libs  DEBUG   : SshTransport running `ps -p 
1465 -o state=`... 
[2018-10-01 10:21:05] paramiko.transport DEBUG   : [chan 123] Max packet 
in: 32768 bytes
[2018-10-01 10:21:05] paramiko.transport DEBUG   : [chan 123] Max packet 
out: 32768 bytes
[2018-10-01 10:21:05] paramiko.transport DEBUG   : Secsh channel 123 opened.
[2018-10-01 10:21:05] paramiko.transport DEBUG   : [chan 123] Sesch channel 
123 request ok
[2018-10-01 10:21:05] paramiko.transport DEBUG   : [chan 123] EOF received 
(123)
[2018-10-01 10:21:05] gc3.gc3libs  DEBUG   : Executed command 'ps -p 1465 
-o state=' on host '172.23.86.21'; exit code: 1
[2018-10-01 10:21:05] gc3.gc3libs  DEBUG   : Process with PID 1465 not 
found, assuming task MatlabApp.354662 has finished running.
[2018-10-01 10:21:05] gc3.gc3libs  DEBUG   : Calling state-transition 
handler 'terminating' on MatlabApp.354662 ...
[2018-10-01 10:21:05] paramiko.transport DEBUG   : [chan 123] EOF sent (123)
[2018-10-01 10:21:05] gc3.gc3libs  DEBUG   : Updating job info file for pid 
1465
[2018-10-01 10:21:05] paramiko.transport.sftp DEBUG   : [chan 0] 
open('/home/ubuntu/.gc3/shellcmd.d/1465', 'wb')
[2018-10-01 10:21:05] paramiko.transport.sftp DEBUG   : [chan 0] 
open('/home/ubuntu/.gc3/shellcmd.d/1465', 'wb') -> 00000000
[2018-10-01 10:21:05] paramiko.transport.sftp DEBUG   : [chan 0] 
close(00000000)
[2018-10-01 10:21:05] gc3.gc3libs  DEBUG   : Reading resource utilization 
from wrapper file 
`/home/ubuntu/gc3libs.Wcavth/.gc3pie_shellcmd/resource_usage.txt` for task 
MatlabApp.354662 ...
[2018-10-01 10:21:05] paramiko.transport.sftp DEBUG   : [chan 0] 
open('/home/ubuntu/gc3libs.Wcavth/.gc3pie_shellcmd/resource_usage.txt', 'r')
[2018-10-01 10:21:05] paramiko.transport.sftp DEBUG   : [chan 0] 
open('/home/ubuntu/gc3libs.Wcavth/.gc3pie_shellcmd/resource_usage.txt', 
'r') -> 00000000
[2018-10-01 10:21:05] paramiko.transport.sftp DEBUG   : [chan 0] 
close(00000000)
 
and later on:

[2018-10-01 11:03:25] gc3.gc3libs  DEBUG   : Ignored error in fecthing 
output of task 'MatlabApp.354662': TransportError: Could not download 
'/home/ubuntu/gc3libs.Wcavth/timefile16734.txt' on host '172.23.86.21' to 
'/Volumes/HtB_Data/Results_Competition1C/run16731/timefile16734.txt': 
TransportError: Could not stat() file 
'/home/ubuntu/gc3libs.Wcavth/timefile16734.txt' on host '172.23.86.21': 
IOError: [Errno 2] No such file
[2018-10-01 11:03:25] gc3.gc3libs  DEBUG   : (Original traceback follows.)
Traceback (most recent call last):
  File "/Users/hanna/gc3pie/src/gc3libs/core.py", line 1874, in progress
    changed_only=self.retrieve_changed_only)
  File "/Users/hanna/gc3pie/src/gc3libs/core.py", line 606, in fetch_output
    app, download_dir, overwrite, changed_only, **extra_args)
  File "/Users/hanna/gc3pie/src/gc3libs/core.py", line 674, in 
__fetch_output_application
    raise ex
TransportError: Could not download 
'/home/ubuntu/gc3libs.Wcavth/timefile16734.txt' on host '172.23.86.21' to 
'/Volumes/HtB_Data/Results_Competition1C/run16731/timefile16734.txt': 
TransportError: Could not stat() file 
'/home/ubuntu/gc3libs.Wcavth/timefile16734.txt' on host '172.23.86.21': 
IOError: [Errno 2] No such file

This timefile16734 doesn't exist, so I can understand this last error.


[Can you please post the output of `gcloud list --verbose` after killing 
the problem jobs? ]

There is no strange message here, just the same messages as before killing 
the problem jobs

[If no instance is running any job, it is safe to delete them all (e.g., 
via `gcloud terminate` or via the Science Cloud web interface) and then 
restart your GC3Pie session-based script. ]

The problem is, is that the terminating jobs are not all on the same 
instance. So some jobs are running fine on a certain instance and some are 
stuck. If I terminate the instance, it will then also kill the successful 
runs. This is not a huge problem, but it is a bit annoying. 

I hope this information helps,
Hanna




Op maandag 24 september 2018 16:32:57 UTC+2 schreef Riccardo Murri:
>
> Hello Hanna, 
>
> > I recently encountered two (related) problems with GC3Pie. 
>
> Lucky you :-)  I have encountered many more ;-) 
>
>
> > Sometimes, a job gets stuck in the terminating stage, and keeps on 
> saving 
> > its output on my local computer, resulting in many folders with the same 
> > files in it (problem 1). I have no idea why this happens, it seems to 
> > happen randomly. 
>
> The only reason I can imagine is that the downloading is considered 
> "unsuccessful" for some reason, so it is attempted again during the next 
> cycle, and then again, and so on. 
>
> It would help to see if the DEBUG level logs have something to say. 
> Can you collect the DEBUG logs from such a problem situation? 
>
> To get the DEBUG logs: look into file `$HOME/.gc3/debug.log` or run 
> your session-based script adding the `-vvvv` option and save the console 
> output. For instance:: 
>
>         ./my-script.py -s session -vvvv 2>&1 | tee debug.log 
>
>
> > If I then manually kill these jobs ("gselect -s SessionName --state 
> > TERMINATING | xargs gkill -s SessionName"), the jobs are killed and get 
> the 
> > label 'failed'. The run stops saving output from these jobs to the local 
> > computer. However, these jobs are not removed from the cloud and occupy 
> > some of the chores. Therefore the progress of the session slows down a 
> lot 
> > because it can not make full use of the available resources (problem 2). 
>
> Can you please post the output of `gcloud list --verbose` after killing 
> the problem jobs? 
>
> If no instance is running any job, it is safe to delete them all (e.g., 
> via `gcloud terminate` or via the Science Cloud web interface) and then 
> restart your GC3Pie session-based script. 
>
> Ciao, 
> R 
>
> -- 
> Riccardo Murri / Email: riccard...@gmail.com <javascript:> / Tel.: +41 77 
> 458 98 32 
>

-- 
You received this message because you are subscribed to the Google Groups 
"gc3pie" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to gc3pie+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to