On Feb 13, 2012, at 11:52 AM, Fields, Christopher J wrote: > On Feb 13, 2012, at 9:45 AM, Nate Coraor wrote: > >> On Feb 8, 2012, at 9:32 PM, Fields, Christopher J wrote: >> >>> 'samtools sort' seems to be running on our server end as well (not on the >>> cluster). I may look into it a bit more myself. Snapshot of top off our >>> server (you can see our local runner as well): >>> >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>> >>> 3950 galaxy 20 0 1303m 1.2g 676 R 99.7 15.2 234:48.07 samtools sort >>> /home/a-m/galaxy/dist-database/file/000/dataset_587.dat >>> /home/a-m/galaxy/dist-database/tmp/tmp9tv6zc/sorted >>> 5417 galaxy 20 0 1186m 104m 5384 S 0.3 1.3 0:15.08 python >>> ./scripts/paster.py serve universe_wsgi.runner.ini --server-name=runner0 >>> --pid-file=runner0.pid --log-file=runner0.log --daemon >> >> Hi Chris, >> >> 'samtools sort' is run by groom_dataset_contents, which should only be >> called from within the upload tool, which should run on the cluster unless >> you still have the default local override for it in your job runner's config >> file. > > Yes, that is likely the problem. Our cluster was running an old version of > python (v2.4) that was also UCS2 (bx_python broke), so we were running > locally. That was rectified this past week (the admins insisted on not > installing a python version locally, so we insisted back they install > something modern using UCS4). I tested a single upload with success off the > cluster, so I would guess this is rectified (I'll confirm that). > > Is there any information on data grooming on the wiki? I only found info > relevant to FASTQ grooming, not SAM/BAM.
FASTQ grooming runs voluntarily as a tool. The datatype grooming method is only called at the end of the upload tool, and is only defined for the Bam datatype (although other datatypes could define it). I believe it's implemented this way because it was deemed inefficient to force FASTQ grooming when the FASTQ may already be in an acceptable format. I am not sure why the same determination was not made for BAM, so perhaps one of my colleagues will clarify that. > >> Ryan's instance is running 'samtools index' which is in set_meta which is >> supposed to be run on the cluster if set_metadata_externally = True, but can >> be run locally under certain conditions. >> >> --nate > > Will have to check, but I believe we have not set that yet either. We are in > the midst of moving all jobs to the cluster, just rectifying the various > issues with disparate python versions, etc. which now seem to be rectified, > so that will shortly be resolved as well. set_metadata_externally = True should "just work" and will significantly decrease the performance penalty taken on the server and by the (effectively single-threaded) Galaxy process. --nate > > chris > > ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/