Re: [Bioc-sig-seq] chipseq infrastructure

Raphael Gottardo Wed, 03 Mar 2010 04:15:29 -0800

Thanks Vincent. This looks good. 
On 2010-03-02, at 11:01 PM, Vincent Carey wrote:


> NB: At the end of the comments I meant to write "pretty paltry", not
> "pretty" ... it isn't.
> 
> On Tue, Mar 2, 2010 at 10:59 PM, Vincent Carey
> <[email protected]> wrote:
>> Here is an example of using Rsamtools to reproduce a little bit of the
>> resources associated with some ChIP-seq data from the paper by R
>> Lister et al (Nature 2009 PMID 19829295)
>> 
>> 1) a BAM file was retrieved from GEO, specifically
>> 
>> GSM466738_UCSD.H1.H3K36me3.UCSD_42B3BAAXX_6.bam
>> 
>> This file had to be sorted and indexed using samtools; I added 'srt' before 
>> .bam
>> 
>> 2) the GEO page for the series includes a reference to a browser; i
>> went to chromosome 2
>> 
>> http://www.ncbi.nlm.nih.gov/nuccore/89161199?report=graph&tracks=key:sequence_track;key:graph_track,annots:NA000000149.1|NA000000150.1|NA000000146.1|NA000000148.1|NA000000169.1|NA000000170.1;key:gene_model_track,RNAs:false,CDSs:false,Exons:false,annots:Unnamed;key:feature_track,subkey:misc_RNA,annots:other
>> 
>> and zoomed into a region around 132M + strand, where I saw something
>> that looked like a peak in H3K36me3
>> 
>> 3) Now we can make a BamViews instance based on the GSM466738 file,
>> and focus on any region of interest by setting bamRanges.
>> 
>> library(Rsamtools)
>> zz = BamViews(bamPaths="GSM466738_UCSD.H1.H3K36me3.UCSD_313BFAAXX_6srt.bam")
>> bamRanges(zz) = RangedData(IRanges(132500000,133000000), space="chr2")
>> 
>> 4) The alignment information can be retrieved and coverage data extracted
>> 
>> dd = readBAMasGappedAlignments(zz)
>> cc = coverage(dd[[1]])$chr2
>> ccx = cumsum(c...@lengths)
>> ccy = c...@values
>> 
>> plot(ccx, ccy) # shows the peak
>> 
>> 5) The ease with which detailed information on aligned reads can be
>> extracted using Rsamtools has led me to favor BAM as a 'hub' format
>> for those considering working with Bioconductor for NGS data.  The
>> voluminous resources are left outside of R, and we can bring things in
>> at various stages of refinement.  More examples of efficient
>> reproduction of published NGS results using bioc are desired -- this
>> one is pretty and is offered mainly to start getting a little more
>> concrete on possible roles of Rsamtools in downstream workflows.
>> 
>> On Mon, Mar 1, 2010 at 7:46 PM, Raphael Gottardo
>> <[email protected]> wrote:
>>> Hi Michael (and others),
>>> 
>>> I would certainly second that. You guys have develop great tools for low 
>>> level analysis of next gen data, but higher level analysis are still 
>>> lagging behind. Though, this is rather normal as the higher level stuff 
>>> needs the lower level infrastructure.
>>> 
>>> My group has been working on several aspects of chip-seq analysis and to 
>>> some extend gene regulation.
>>> As noted in one of the email this morning, we are about to submit our PICS 
>>> software based on a version of this paper http://arxiv.org/abs/0903.3206, 
>>> which we hope will be published in Biometrics in the near future. For our 
>>> package we have used some of the infrastructure available in the chip-seq 
>>> package, and IRanges.
>>> 
>>> One the problem we have faced is data input. In chipseq, one does not need 
>>> sequence reads. However, when you use ShortReads you automatically get the 
>>> sequence reads which takes a lot of memory. For some highly sequenced data 
>>> we have, it has been somewhat of a bottleneck.
>>> So it would be nice to be able to only read the chr/start/strand 
>>> information. As pointed out by Wolfgang, rsamtools might be the solution, 
>>> so we will have to see how we can use rsamtools and the classes defined 
>>> there for chip-seq. This being said we still have a lot of files from non 
>>> MAQ aligners.
>>> I think Arnaud Droit, who is in my group, has sent an email about this 
>>> issue already.
>>> 
>>> Besides PICS that will be submitted this week, we have already released a 
>>> package for motif analyses, rGADEM, which can work on standard Biostrings 
>>> objects. rGADEM is relatively fast and well adapted for ChIP-seq enriched 
>>> regions. We also have another package, MotIV for motif validation and 
>>> identification based which is based on STAMP (with many improved 
>>> functionalities). MotIV is under review I believe and should be available 
>>> soon.
>>> 
>>> Anyway, so very soon we will have a complete pipeline from shortread -> 
>>> enriched regions (PICS) -> motifs (rGADEM) -> validated motifs and motif 
>>> occurrences (MotIV) -> other BioC packages (e.g. GenomicsFeatures, etc).
>>> 
>>> So at least this will be a start. Of course we are open to 
>>> suggestions/requests, etc. If any of you guys want more details feel free 
>>> to drop us an email.
>>> 
>>> Cheers,
>>> 
>>> Raphael
>>> 
>>> On 2010-03-01, at 10:08 AM, Michael Lawrence wrote:
>>> 
>>>> Hey guys,
>>>> 
>>>> I'm wondering if anyone has given any thought to some sort of generic
>>>> framework for chipseq analysis in Bioconductor, based on the IRanges,
>>>> Biostrings, etc infrastructure. chipseq has some nice utilities; could it 
>>>> be
>>>> transformed into some sort of generic chipseq pipeline? Something like how
>>>> the 'affy' package (I think?) allows other packages to provide alternative
>>>> implementations for particular stages. Just having a clean, refined,
>>>> approximately complete set of chipseq-focused utilities would be nice.
>>>> Presumably chipseq could fill that role? I think we now have a good idea of
>>>> the basic steps in chipseq analysis, so it's probably time for such a
>>>> package to emerge.
>>>> 
>>>> Comments?
>>>> 
>>>> Michael
>>>> 
>>>>       [[alternative HTML version deleted]]
>>>> 
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> [email protected]
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>> 
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> [email protected]
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>> 
>>

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] chipseq infrastructure

Reply via email to