Dear all, Are there any known issues with the job-splitting code (i.e. the new <parallelism> tags in the tool wrappers) and the order of the sub-jobs? I've noticed on our production Galaxy (a bit old now, 6799:40f1816d6857 from 7 March) two apparent problems here.
I added a diagnostic print statement to the jobs' stdout giving the node number and SGE job number. When viewing the combined stdout in Galaxy, the SGE job numbers should (I think) be strictly increasing. That isn't always the case, e.g. here task_7 was added to the queue before task_6: /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_0: Running on n3 as job 27700 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_1: Running on n12 as job 27701 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_2: Running on n8 as job 27702 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_3: Running on n6 as job 27703 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_4: Running on n11 as job 27704 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_5: Running on n10 as job 27705 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_6: Running on n4 as job 27707 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_7: Running on n5 as job 27706 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_8: Running on n9 as job 27708 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_9: Running on n7 as job 27709 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_10: Running on n12 as job 27710 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_11: Running on n9 as job 27711 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_12: Running on n6 as job 27712 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_13: Running on n7 as job 27713 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_14: Running on n4 as job 27714 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_15: Running on n8 as job 27715 /mnt/galaxy/galaxy-dist/database/job_working_directory/004/4055/task_16: Running on n10 as job 27716 In a separate example with 33 sub-tasks, there were two of these inversions, while in yet another example with 33 sub-tasks there was a trio submitted out of order. This non-deterministic behavior is a little surprising, but in itself not an immediate problem. In what appears to be a separate (and more concerning) loss of order, after merging the output file order appears randomized. I would expect the output from task_0, then task_1, ..., finally task_16. I haven't yet worked out what order I am getting, but it isn't this, and neither is it the order from the SGE job numbers (e.g. correct bar one pair switched round). Having looked at lib/galaxy/jobs/runners/tasks.py the source of this behaviour currently eludes me [*]. Has anyone else observed anything like this before? Regards, Peter [*] P.S. I would like to see an upper bound on the sleep_time in method run_job, say half an hour? Otherwise with a group of long running jobs it seems Galaxy may end up waiting a very long time between checks for their completion since it just doubles the wait at each point. I had sometimes noticed a delay between the sub-jobs finishing according to the cluster and Galaxy doing anything about merging it - this is probably why. ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/