Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy

Anton Nekrutenko Tue, 05 Apr 2011 11:36:04 -0700

Mike:

Which parameters did you use at step 13 (if you used main site to perform these 
analyses you can share your history with me).


Thanks,

anton


On Apr 5, 2011, at 2:22 PM, Mike Dufault wrote:

> Hi all,
>  
> Like many people on this e-mail chain, I have been looking for advice on how 
> to process Exome data. Below, I have described in detail what I have done 
> with the hope of getting some clarification. Hopefully it will be helpful to 
> many of us!
>  
> I have SureSelect Exome captured data. The data was delivered to me as two 
> separate files (/1) & (/2). Each file has ~33 million reads; 7.2 GB each. I 
> am looking for SNPs from a family with cancer. Eventually I plan to compare 
> the date from multiple members of the same family to find a related disease 
> SNP.
>  
> Below is the workflow that I used to process my data. I adapted it from the 
> Screencast titles: "Mapping Illumina Reads: Paired Ends Example." I used all 
> of the same default parameters as in the screencast.
>  
> At the end of step 13, I had ~4,700,000 SNPs. This seemed like a lot so in 
> step 14, I filtered on column 7 (c7) which I believe is the Quality SNP 
> value. I set the filter as C7>=1 to remove all of the 0 (zero) values for 
> Quality SNP. I figured that if they have a value of zero, they must not be 
> real SNPs. This left me with ~180,000 SNPs.
>  
> 1: Get Data: Illumina 1.3+ file (/1)
> 2: Get Data: Illumina 1.3+ file (/2)
> 3: FASTQ Groomer on data 1
> 4: FASTQ Groomer on data 2
> 5: FASTQ Summary Statistics on data 3
> 6: FASTQ Summary Statistics on data 4
> 7: Box plot on data 5
> 8: Box plot on data 6
> 9: Map with Bowtie for Illumina on data 4 and data 3: mapped reads
> 10: Filter Sam on data 9
> 11: SAM-to-BAM on data 10: converted to BAM
> 12: Generate pileup on data 11: converted pileup
> 13: Filter pileup on data 12
> 14: Filter data on 13 (c7>=1)
> 15: Sort on data 15 (C7; descending order)
>  
> First, if anyone has ideas on how to improve the workflow, I would be open to 
> suggestions; especially from people experienced with Galaxy.
>  
> Second, I am concerned that many/most of the SNPs are known. Should I filter 
> my data against the known SNPdb? If so, how can I do this in Galaxy (in 
> Bowtie?)
>  
> Third, as suggested in the screencast, I did not trim or filter my FASTQ 
> Groomed data because I was interested in SNPs and I could filter on Quality 
> later in the workflow. Would implementing a filtering step on phred quality 
> (~20) at this step save me the step of filtering later on. Currently it takes 
> multiple hours (~16) to process the data from start to finish, would 
> filtering at this step reduce the amount of time that it takes to process my 
> data? Presumably, there would be less data to process. I do this on the AWS 
> Cloud and time is money!
>  
> Fifth, when using Galaxy on the AWS cloud, does adding additional cores or 
> adding High CPU ( or both) shorten the time to process the data? When I set 
> up extra cores, it appeared that some of them are idle and I don't want to 
> pay for idle cores. If anyone could share information on how best to manage 
> the cloud, it would be appreciated.
>  
> Finally, what is the difference between “stopping” an instance and 
> “terminating” an instance on the cloud? Would I still get charged by AWS if I 
> just stop an instance? Any clarification in this area would also be much 
> appreciated. Again, time is money!
> I hope this helps many of us!
>  
> Unfortunatly, I will not be in Pitt to ask these questions in person.
>  
> Thanks in advance!!!
>  
> Mike
> 
> --- On Tue, 4/5/11, Lali <[email protected]> wrote:
> 
> From: Lali <[email protected]>
> Subject: Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy
> To: "Anton Nekrutenko" <[email protected]>
> Cc: "galaxy-user" <[email protected]>
> Date: Tuesday, April 5, 2011, 11:50 AM
> 
> Ohh sorry about that!
> I am using both Windows XP and Ubuntu and I usually use Google Chrome.
> 
> 
> On Tue, Apr 5, 2011 at 5:33 PM, Anton Nekrutenko <[email protected]> wrote:
> Lali:
> 
> Please, always CC mailing list when you reply. 
> 
>> My only problem with Galaxy is that I have to keep on clearing my cache in 
>> order to get the history to display correctly, is there another way of 
>> solving this issue?
> 
> 
> Which browser/OS are your using?
> 
> Thanks,
> 
> anton
> galaxy team
> 
> On Apr 5, 2011, at 11:25 AM, Lali wrote:
> 
>> Thanks so much for the tips Anton!
>> I am very excited about the newer developments.
>> I did watch the quickies and they were very useful for a beginner like me, I 
>> actually did my first try at the alignment by following the Illumina 
>> single-end tutorial video step by step, but you need to watch the paired-end 
>> too, for some of the first steps, which are explained better on that one.
>> I have been playing around a lot with Galaxy, and I have several workflows, 
>> my department just started doing sequencing, so we don't have standard 
>> procedures set in place. I was assigned to evaluate Galaxy and CLC, and so 
>> far CLC has not impressed me, except for the fact that it can generate 
>> reports easily.
>> I think Galaxy is the way to go for me (us, if I can convince them to run a 
>> local server), since I am not a bioinformatician, and just the fact that you 
>> can queue up actions and just walk away is fantastic (amongst other things).
>> But because I am a beginner, I am not 100% of the settings I have chosen and 
>> my data is not looking too good so far, but I am having a bioinformatician 
>> come over and help me on Thursday and I think your tips will be of help.
>> My only problem with Galaxy is that I have to keep on clearing my cache in 
>> order to get the history to display correctly, is there another way of 
>> solving this issue?
>> 
>> Best regards,
>> 
>> L
>> 
>> On Tue, Apr 5, 2011 at 3:56 PM, Anton Nekrutenko <[email protected]> wrote:
>> Lali:
>> 
>> In your case the workflow for capture re-sequencing should look like this:
>> 
>> 1. QC data (groom fastq files and plot quality distribution)
>> 2. Map the reads (use bwa)
>> 3. Generate and filter pileup
>> 4. Intersect pileup with coordinates of sure select bates.
>> 
>> However, before you dive in please understand basic Galaxy functionality by 
>> taking a look at http://usegalaxy.org/galaxy101 and watching *all* 
>> Illumina-related Galaxy quickies (black boxes on the front page on Galaxy). 
>> Next, take a look at http://usegalaxy.org/heteroplasmy.
>> 
>> Note, that we are working on bringing "industrial-strength" diploid 
>> genotyping functionality in Galaxy in the next two-three months that will 
>> include more sophisticated genotypers, recalibration and realignment tools, 
>> and novel visualization approaches.
>> 
>> Thank for using Galaxy.
>> 
>> anton
>> galaxy team
>> 
>> 
>> 
>> On Apr 5, 2011, at 2:44 AM, Lali wrote:
>> 
>> > Hi!
>> > I am having problems with my sequencing results, but I am a newbie at 
>> > this; so I am thinking there is something wrong with my analysis. So far, 
>> > I've tried Galaxy and CLC Workbench, but with CLC I could not align to the 
>> > whole genome, only to individual chromosomes (maybe there is a way, but by 
>> > the time the trial ended I had not found it).
>> >
>> > I used SureSelect capture kit and did single end sequencing on an 
>> > Illumina. The files the lab sent me are FastQ Illumina 1.5 files, my 
>> > samples were indexed, and I got a series of files each representing an 
>> > Index.
>> >
>> > What would be the standard workflow for this kind of data?
>> > Which tools/settings?
>> >
>> > Does anyone have an example Galaxy workflow for preparing (clipping 
>> > adapters, quality trimming) and mapping Targeted Resequencing Data?
>> >
>> > Is there a way to obtain a coverage report through Galaxy?
>> >
>> > Is it possible to ignore/discard the reads mapped when the coverage is 
>> > below a certain threshold?
>> >
>> > I know, I know, a lot of things, but I am very lost.
>> > Any help is appreciated.
>> >
>> > L ___________________________________________________________
>> > The Galaxy User list should be used for the discussion of
>> > Galaxy analysis and other features on the public server
>> > at usegalaxy.org.  Please keep all replies on the list by
>> > using "reply all" in your mail client.  For discussion of
>> > local Galaxy instances and the Galaxy source code, please
>> > use the Galaxy Development list:
>> >
>> >  http://lists.bx.psu.edu/listinfo/galaxy-dev
>> >
>> > To manage your subscriptions to this and other Galaxy lists,
>> > please use the interface at:
>> >
>> >  http://lists.bx.psu.edu/
>> 
>> Anton Nekrutenko
>> http://nekrut.bx.psu.edu
>> http://usegalaxy.org
>> 
>> 
>> 
>> 
> 
> Anton Nekrutenko
> http://nekrut.bx.psu.edu
> http://usegalaxy.org
> 
> 
> 
> 
> 
> -----Inline Attachment Follows-----
> 
> ___________________________________________________________
> The Galaxy User list should be used for the discussion of
> Galaxy analysis and other features on the public server
> at usegalaxy.org.  Please keep all replies on the list by
> using "reply all" in your mail client.  For discussion of
> local Galaxy instances and the Galaxy source code, please
> use the Galaxy Development list:
> 
>   http://lists.bx.psu.edu/listinfo/galaxy-dev
> 
> To manage your subscriptions to this and other Galaxy lists,
> please use the interface at:
> 
>   http://lists.bx.psu.edu/

Anton Nekrutenko
http://nekrut.bx.psu.edu
http://usegalaxy.org

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] Analyzing Targeted Resequencing data with Galaxy

Reply via email to