> Thank you for the suggestions. Ultimately, I would like to compare
> gene(isoform) expression between two groups of 10 animals with one lane per
> animal. I am using the public server to practice with some small data sets
> right now, but will be getting the real data very soon and plan on using an
> Amazon Cloud account to actually do the analysis. I can see now that this
> approach is going to be met with some difficulty with the current state of
> the data volume restrictions and limited functionality of Galaxy for
> Cuffcompare/diff. Can you comment any further on the timeline of the
> availability of the full functionality of these programs?
Cuffcompare and Cuffdiff generate many more outputs than most other tools;
specifically, both generate multiple output files for each additional input
given. While Galaxy can handle an arbitrary number of inputs easily, handling
so many outputs is challenging and requires extending the framework to handle
so many output files.
I did a bit of the necessary work this week, but more work is required and the
path forward is a bit murky. I'm still hoping to have it available in a couple
weeks, but no guarantees. Also, this is a good time to mention that we welcome
code patches/contributions; if you can make something work in Galaxy, we'll
review the code and, if it looks good, integrate it into our code base.
> You seemed to suggest they will be available on the public server before they
> are available on the Cloud?
I did not mean to imply this, only that the Cloud folks have their own process
and schedule for rolling out changes, and I do not know their schedule.
> Also, for the time being, would you mind clarifying for me what you mean by
> repeatedly merging Cufflinks outputs? I imagine using Tophat to map the
> reads and find splice junctions and assembling transcripts using Cufflinks
> for each of the 20 animals. Are you talking about running the Cufflinks GTF
> output through Cuffcompare, which allows two GTF files in Galaxy, and merging
> that output(the union file) with the third Cufflinks file and so on for all
> ten animals? Then do the same thing for the other group of ten animals, and
> then comparing the two for a rough idea of the differences?
Cuffdiff requires a GTF reference file; this file contains the transcripts that
will be used for comparing samples/replicates. If you're looking only at
existing transcripts, using one from UCSC works fine and no merging is
necessary. However, if you're looking for novel transcripts, you'll want to use
the combined transcripts file that Cuffcompare produces. In this case, you'll
want to iteratively merge the Cufflinks' outputs for all 20 animals so that you
have a complete list of transcripts for Cuffdiff.
> I guess I'm wondering how far I will be able to get with the analysis as
> things stand on the Cloud or the public server....
You should be able to get some preliminary results. As it stands now, you can
run pairwise comparisons through the
Tophat-->Cufflinks-->Cuffcompare-->Cuffdiff pipeline. You might try looking at
two different pairwise comparisons and seeing how many similarities/differences
> I also need to come up with a strategy to work around the 1000Gb space limit,
> as with 20 samples of 25 million reads and repeatedly generating files I
> think it will get used up quickly....
This is a question that Enis can comment on.
galaxy-user mailing list