Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy

Anton Nekrutenko Fri, 08 Apr 2011 05:39:03 -0700

Mike:

Realignment and recalibration is not yet possible on the main site. However, we 
are working on several re-sequencing projects in house where these tools are 
used and will bring them to Galaxy by ISMB conference in Vienna.


The indel analysis at the moment is rather simplistic (yet still very useful) 
and is based on processing on CIGAR strings in aligned SAM files. You can 
simply run datasets generated by BWA through our indels tools.

Thanks and let us know if you have more questions.


anton
galaxy team


On Apr 8, 2011, at 7:42 AM, Mike Dufault wrote:

> Sean, Anton and Jen,
>  
> Thanks for all of the suggestions (in separate replies) on how to better 
> analyze my SelectSure captured Exome data. My original work-flow is below in 
> the e-mail string.
>  
> Based on the suggestions, I plan to change my work-flow by increasing my 
> quality filter from 20 to 25-30 and increasing my minimum coverage from 3x to 
> ~20x. I will use the Join function to compare the SNPs that are in common 
> with the samples from two family members to filter (narrow down) what they 
> have in common, since I am looking for a hereditary disease. Then i will use 
> the Join function again with the SNPs from build (131) to characterize the 
> SNPs.
>  
> Sean suggested realignment around indels and potentially quality score 
> recalibration. Is that even possible with Galaxy at the moment?
>  
> Where in the flow can I perform Indel analysis? Will I need to process my 
> data separately for SNPs and Indel analysis, or can they be done sequentially 
> in the same linear work-flow? I am still a little unsure of the best way to 
> hand this.
>  
> Please let me know if you have any more suggestions or comments before I 
> re-launch the analysis later this evening. Once I get a flow that works, I 
> hope to be able to publish it for everyone to benefit from.
>  
> Thanks to the Galaxy team for an outstanding platform and support!
>  
> Mike
> --- On Tue, 4/5/11, Sean Davis <sdav...@mail.nih.gov> wrote:
> 
> From: Sean Davis <sdav...@mail.nih.gov>
> Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
> To: "Mike Dufault" <dufau...@yahoo.com>
> Cc: "galaxy-user" <galaxy-user@lists.bx.psu.edu>
> Date: Tuesday, April 5, 2011, 4:39 PM
> 
> Hi, Mike.  See my couple of comments below....
> 
> Sean
> 
> On Tue, Apr 5, 2011 at 2:22 PM, Mike Dufault <dufau...@yahoo.com> wrote:
> Hi all,
>  
> Like many people on this e-mail chain, I have been looking for advice on how 
> to process Exome data. Below, I have described in detail what I have done 
> with the hope of getting some clarification. Hopefully it will be helpful to 
> many of us!
>  
> I have SureSelect Exome captured data. The data was delivered to me as two 
> separate files (/1) & (/2). Each file has ~33 million reads; 7.2 GB each. I 
> am looking for SNPs from a family with cancer. Eventually I plan to compare 
> the date from multiple members of the same family to find a related disease 
> SNP.
>  
> Below is the workflow that I used to process my data. I adapted it from the 
> Screencast titles: "Mapping Illumina Reads: Paired Ends Example." I used all 
> of the same default parameters as in the screencast.
>  
> At the end of step 13, I had ~4,700,000 SNPs. This seemed like a lot so in 
> step 14, I filtered on column 7 (c7) which I believe is the Quality SNP 
> value. I set the filter as C7>=1 to remove all of the 0 (zero) values for 
> Quality SNP. I figured that if they have a value of zero, they must not be 
> real SNPs. This left me with ~180,000 SNPs.
>  
> 1: Get Data: Illumina 1.3+ file (/1)
> 2: Get Data: Illumina 1.3+ file (/2)
> 3: FASTQ Groomer on data 1
> 4: FASTQ Groomer on data 2
> 5: FASTQ Summary Statistics on data 3
> 6: FASTQ Summary Statistics on data 4
> 7: Box plot on data 5
> 8: Box plot on data 6
> 9: Map with Bowtie for Illumina on data 4 and data 3: mapped reads
> 
> This might not be the best choice, as bowtie does not allow gapped alignment. 
>  See here for a discussion of indels and SNV calling:
> 
> http://bioinformatics.oxfordjournals.org/content/26/6/722.long
> 
> You will probably also want to consider local realignment around indels and 
> potentially quality score recalibration.  
>  
> 10: Filter Sam on data 9
> 11: SAM-to-BAM on data 10: converted to BAM
> 12: Generate pileup on data 11: converted pileup
> 13: Filter pileup on data 12
> 14: Filter data on 13 (c7>=1)
> 15: Sort on data 15 (C7; descending order)
>  
> First, if anyone has ideas on how to improve the workflow, I would be open to 
> suggestions; especially from people experienced with Galaxy.
>  
> Second, I am concerned that many/most of the SNPs are known. Should I filter 
> my data against the known SNPdb? If so, how can I do this in Galaxy (in 
> Bowtie?)
> 
> Keep in mind that, depending on the version of dbSNP, there are many 
> cancer-associated SNPs contaminating the database.
> 
>  
> Third, as suggested in the screencast, I did not trim or filter my FASTQ 
> Groomed data because I was interested in SNPs and I could filter on Quality 
> later in the workflow. Would implementing a filtering step on phred quality 
> (~20) at this step save me the step of filtering later on. Currently it takes 
> multiple hours (~16) to process the data from start to finish, would 
> filtering at this step reduce the amount of time that it takes to process my 
> data? Presumably, there would be less data to process. I do this on the AWS 
> Cloud and time is money!
>  
> 
> Adding a gapped alignment algorithm, indel realignment, and quality 
> recalibration can easily increase this time to a couple of days per sample.
>  
> Fifth, when using Galaxy on the AWS cloud, does adding additional cores or 
> adding High CPU ( or both) shorten the time to process the data? When I set 
> up extra cores, it appeared that some of them are idle and I don't want to 
> pay for idle cores. If anyone could share information on how best to manage 
> the cloud, it would be appreciated.
>  
> Finally, what is the difference between “stopping” an instance and 
> “terminating” an instance on the cloud? Would I still get charged by AWS if I 
> just stop an instance? Any clarification in this area would also be much 
> appreciated. Again, time is money!
> I hope this helps many of us!
>  
> Unfortunatly, I will not be in Pitt to ask these questions in person.
>  
> Thanks in advance!!!
>  
> Mike
> ___________________________________________________________
> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
> 
>  http://lists.bx.psu.edu/listinfo/galaxy-dev
> 
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
> 
>  http://lists.bx.psu.edu/

Anton Nekrutenko
http://nekrut.bx.psu.edu
http://usegalaxy.org

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy

Reply via email to