If you problem really just is that fastq groomer is slower, I
implemented several small optimizations for fastq groomer that I think
resulted in a big improvement in performance. It seems it is not
really used at my institution any more so I never pushed the changes
out to our production server or pushed to hard on the pull request.
But it did some testing as I was making the changes, and none of the
changes broke the functional tests so there is some chance they don't
break anything. You can pull my changes from here if you are
interested:

https://bitbucket.org/galaxy/galaxy-central/pull-request/20/fastq_groomer-optimizations

-John

------------------------------------------------
John Chilton
Senior Software Developer
University of Minnesota Supercomputing Institute
Office: 612-625-0917
Cell: 612-226-9223

On Tue, Jul 24, 2012 at 6:32 PM, Kenny Sabir <traks...@gmail.com> wrote:
> Hello all,
>
> we were having the same issues with groomer taking up to 12 hours for large
> files. I had a look at the code and saw it was only using the single core. I
> changed the code to split the fastq input into multiple file parts and
> process it in parallel and reassemble the results. It also reassembles the
> aggregator data (which prints the final summary).
>
> For using 8 cores we saw a 7x  improvement. Naturally the data-output is
> identical. One limitation is that it does not support fastq that has
> multiple lines per single sequence. I have read that this practice is
> discouraged anyway as it was problematic (though it was in the original
> spec) and I haven't seen this occur in our data so far.
>
> I believe there is still room to improve as the Python readLine has
> suboptimal performance as it will do too much file I/O without enough
> buffering.
>
> I'm new to bioinformatics, though i come from a history of R&D comp eng. If
> anyone is at the Chicago Galaxy conference, you can talk to Warren Kaplan
> about this. I can provide the code.
>
> regards
> Kenny
>
> ------
> Bioinformatics Architect
> Garvan Institute
>
>
> On Wed, Jul 25, 2012 at 5:54 AM, Langhorst, Brad <langho...@neb.com> wrote:
>>
>> galaxy just wraps existing tools...  so it's probably not galaxy that is
>> slow per se, but the fastqgroomer too.  Each tool has its own performance
>> characteristics.
>>
>> I don't use fastqgroomer, so I don't know how it can be expected to
>> perform.
>>
>> Are you sure you need it?
>>
>> If you know that your error is scaled in sanger units (iontorrent and
>> casava  1.8 fastqs are), then you may not.
>>
>> If you look at your activity monitor you can see if CPU or disk is the
>> limiting factor for the work you are doing.
>>
>>
>> Brad
>> On Jul 24, 2012, at 3:41 PM, Di Nguyen wrote:
>>
>> > Dear All,
>> >
>> > I successfully install Galaxy onto my new MBP with 16Gb or Ram but when
>> > I tried to use Galaxy, it is painfully slow! The first test I did was to
>> > create Admin and import data (RNA-seq fastq, about 6Gb in size) into
>> > database and then history and it worked fine. The second test was to run
>> > fasqgroomer on this fasq and it took forever (3 hours+).
>> >
>> > Anybody got in idea of why it is so slow? Would it be possible that
>> > Galaxy was set up to run a single process instead of 8-core processor? If
>> > that is the case, how to fix it?
>> >
>> > Please help!
>> >
>> > Di Nguyen
>> > Postdoc, U of W, Seattle, WA
>> > ___________________________________________________________
>> > Please keep all replies on the list by using "reply all"
>> > in your mail client.  To manage your subscriptions to this
>> > and other Galaxy lists, please use the interface at:
>> >
>> > http://lists.bx.psu.edu/
>>
>> --
>> Brad Langhorst
>> langho...@neb.com
>> 978-380-7564
>>
>>
>>
>>
>>
>> ___________________________________________________________
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this
>> and other Galaxy lists, please use the interface at:
>>
>>   http://lists.bx.psu.edu/
>
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
>   http://lists.bx.psu.edu/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Reply via email to