Hi Riccardo, Sorry for the late reply, the error didn't occur for a while after my initial post. 'Luckily' it happened again this morning, so I can finally give some more information.
Each job consists of 7 calculations, each producing 3 output files, so 21 files in total for each job. Somehow, the calculation stops, and therefore only a few output files are produced. When Gc3pie downloads the output, it can not find all files in the folder, and gives an error. It is not clear to me if the calculation really stops, or if GC3pie somehow terminates the calculation. The calculation runs without trouble on my local machine. [It would help to see if the DEBUG level logs have something to say. Can you collect the DEBUG logs from such a problem situation? ] The debug log is huge. So I have only printed below the part where one of the jobs get into trouble [2018-10-01 10:21:05] gc3.gc3libs DEBUG : About to update state of application: MatlabApp.354662 (currently: RUNNING) [2018-10-01 10:21:05] gc3.gc3libs DEBUG : SshTransport running `ps -p 1465 -o state=`... [2018-10-01 10:21:05] paramiko.transport DEBUG : [chan 123] Max packet in: 32768 bytes [2018-10-01 10:21:05] paramiko.transport DEBUG : [chan 123] Max packet out: 32768 bytes [2018-10-01 10:21:05] paramiko.transport DEBUG : Secsh channel 123 opened. [2018-10-01 10:21:05] paramiko.transport DEBUG : [chan 123] Sesch channel 123 request ok [2018-10-01 10:21:05] paramiko.transport DEBUG : [chan 123] EOF received (123) [2018-10-01 10:21:05] gc3.gc3libs DEBUG : Executed command 'ps -p 1465 -o state=' on host '172.23.86.21'; exit code: 1 [2018-10-01 10:21:05] gc3.gc3libs DEBUG : Process with PID 1465 not found, assuming task MatlabApp.354662 has finished running. [2018-10-01 10:21:05] gc3.gc3libs DEBUG : Calling state-transition handler 'terminating' on MatlabApp.354662 ... [2018-10-01 10:21:05] paramiko.transport DEBUG : [chan 123] EOF sent (123) [2018-10-01 10:21:05] gc3.gc3libs DEBUG : Updating job info file for pid 1465 [2018-10-01 10:21:05] paramiko.transport.sftp DEBUG : [chan 0] open('/home/ubuntu/.gc3/shellcmd.d/1465', 'wb') [2018-10-01 10:21:05] paramiko.transport.sftp DEBUG : [chan 0] open('/home/ubuntu/.gc3/shellcmd.d/1465', 'wb') -> 00000000 [2018-10-01 10:21:05] paramiko.transport.sftp DEBUG : [chan 0] close(00000000) [2018-10-01 10:21:05] gc3.gc3libs DEBUG : Reading resource utilization from wrapper file `/home/ubuntu/gc3libs.Wcavth/.gc3pie_shellcmd/resource_usage.txt` for task MatlabApp.354662 ... [2018-10-01 10:21:05] paramiko.transport.sftp DEBUG : [chan 0] open('/home/ubuntu/gc3libs.Wcavth/.gc3pie_shellcmd/resource_usage.txt', 'r') [2018-10-01 10:21:05] paramiko.transport.sftp DEBUG : [chan 0] open('/home/ubuntu/gc3libs.Wcavth/.gc3pie_shellcmd/resource_usage.txt', 'r') -> 00000000 [2018-10-01 10:21:05] paramiko.transport.sftp DEBUG : [chan 0] close(00000000) and later on: [2018-10-01 11:03:25] gc3.gc3libs DEBUG : Ignored error in fecthing output of task 'MatlabApp.354662': TransportError: Could not download '/home/ubuntu/gc3libs.Wcavth/timefile16734.txt' on host '172.23.86.21' to '/Volumes/HtB_Data/Results_Competition1C/run16731/timefile16734.txt': TransportError: Could not stat() file '/home/ubuntu/gc3libs.Wcavth/timefile16734.txt' on host '172.23.86.21': IOError: [Errno 2] No such file [2018-10-01 11:03:25] gc3.gc3libs DEBUG : (Original traceback follows.) Traceback (most recent call last): File "/Users/hanna/gc3pie/src/gc3libs/core.py", line 1874, in progress changed_only=self.retrieve_changed_only) File "/Users/hanna/gc3pie/src/gc3libs/core.py", line 606, in fetch_output app, download_dir, overwrite, changed_only, **extra_args) File "/Users/hanna/gc3pie/src/gc3libs/core.py", line 674, in __fetch_output_application raise ex TransportError: Could not download '/home/ubuntu/gc3libs.Wcavth/timefile16734.txt' on host '172.23.86.21' to '/Volumes/HtB_Data/Results_Competition1C/run16731/timefile16734.txt': TransportError: Could not stat() file '/home/ubuntu/gc3libs.Wcavth/timefile16734.txt' on host '172.23.86.21': IOError: [Errno 2] No such file This timefile16734 doesn't exist, so I can understand this last error. [Can you please post the output of `gcloud list --verbose` after killing the problem jobs? ] There is no strange message here, just the same messages as before killing the problem jobs [If no instance is running any job, it is safe to delete them all (e.g., via `gcloud terminate` or via the Science Cloud web interface) and then restart your GC3Pie session-based script. ] The problem is, is that the terminating jobs are not all on the same instance. So some jobs are running fine on a certain instance and some are stuck. If I terminate the instance, it will then also kill the successful runs. This is not a huge problem, but it is a bit annoying. I hope this information helps, Hanna Op maandag 24 september 2018 16:32:57 UTC+2 schreef Riccardo Murri: > > Hello Hanna, > > > I recently encountered two (related) problems with GC3Pie. > > Lucky you :-) I have encountered many more ;-) > > > > Sometimes, a job gets stuck in the terminating stage, and keeps on > saving > > its output on my local computer, resulting in many folders with the same > > files in it (problem 1). I have no idea why this happens, it seems to > > happen randomly. > > The only reason I can imagine is that the downloading is considered > "unsuccessful" for some reason, so it is attempted again during the next > cycle, and then again, and so on. > > It would help to see if the DEBUG level logs have something to say. > Can you collect the DEBUG logs from such a problem situation? > > To get the DEBUG logs: look into file `$HOME/.gc3/debug.log` or run > your session-based script adding the `-vvvv` option and save the console > output. For instance:: > > ./my-script.py -s session -vvvv 2>&1 | tee debug.log > > > > If I then manually kill these jobs ("gselect -s SessionName --state > > TERMINATING | xargs gkill -s SessionName"), the jobs are killed and get > the > > label 'failed'. The run stops saving output from these jobs to the local > > computer. However, these jobs are not removed from the cloud and occupy > > some of the chores. Therefore the progress of the session slows down a > lot > > because it can not make full use of the available resources (problem 2). > > Can you please post the output of `gcloud list --verbose` after killing > the problem jobs? > > If no instance is running any job, it is safe to delete them all (e.g., > via `gcloud terminate` or via the Science Cloud web interface) and then > restart your GC3Pie session-based script. > > Ciao, > R > > -- > Riccardo Murri / Email: riccard...@gmail.com <javascript:> / Tel.: +41 77 > 458 98 32 > -- You received this message because you are subscribed to the Google Groups "gc3pie" group. To unsubscribe from this group and stop receiving emails from it, send an email to gc3pie+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.