I would humbly guess that most of those truncated files are due to problematic HTTP uploads - so it saves the day from another problem, which should be avoided all together.

Maybe most, but definitely not all. We see all kinds of strange corruption.

However, I have been thinking about adding a 'check only' option to
the Groomer that would use a naive parser (assume exactly 4 lines to
a read, ascii scores, require input variant==output variant, etc.)
and reuse the underlying original dataset file as the output
(without writing over the file). This would be significantly faster
and not waste disk space, but it would require enhancements to the
framework.

I know you (the galaxy team) try very hard to have everything in native python (for easy deployment) but I still hold the opinion that these tools should not be done in python. No matter how much you minimize the processing, it will not be as efficient as good a compile program. Python (or perl, I don't discriminate) can probably do this entire "check only mode" in just a few lines of regexes - but try it on twenty 14GB FASTQ files and you'll realize it's not practical.

Bottom line - I wouldn't use a python "checker" anyhow.

We care more about easy deployment then language. If you have a nice C function that can do this, wrapping it in cython and packaging it is trivial and adds minimal overhead.


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/

Reply via email to