I would humbly guess that most of those truncated files are due to
problematic HTTP uploads - so it saves the day from another problem,
which should be avoided all together.
Maybe most, but definitely not all. We see all kinds of strange
However, I have been thinking about adding a 'check only' option to
the Groomer that would use a naive parser (assume exactly 4 lines to
a read, ascii scores, require input variant==output variant, etc.)
and reuse the underlying original dataset file as the output
(without writing over the file). This would be significantly faster
and not waste disk space, but it would require enhancements to the
I know you (the galaxy team) try very hard to have everything in
native python (for easy deployment) but I still hold the opinion
that these tools should not be done in python. No matter how much
you minimize the processing, it will not be as efficient as good a
compile program. Python (or perl, I don't discriminate) can probably
do this entire "check only mode" in just a few lines of regexes -
but try it on twenty 14GB FASTQ files and you'll realize it's not
Bottom line - I wouldn't use a python "checker" anyhow.
We care more about easy deployment then language. If you have a nice C
function that can do this, wrapping it in cython and packaging it is
trivial and adds minimal overhead.
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at: