Re: [Bioc-sig-seq] chipseq infrastructure

Vincent Carey Tue, 02 Mar 2010 20:02:38 -0800

NB: At the end of the comments I meant to write "pretty paltry", not
"pretty" ... it isn't.


On Tue, Mar 2, 2010 at 10:59 PM, Vincent Carey
<[email protected]> wrote:
> Here is an example of using Rsamtools to reproduce a little bit of the
> resources associated with some ChIP-seq data from the paper by R
> Lister et al (Nature 2009 PMID 19829295)
>
> 1) a BAM file was retrieved from GEO, specifically
>
> GSM466738_UCSD.H1.H3K36me3.UCSD_42B3BAAXX_6.bam
>
> This file had to be sorted and indexed using samtools; I added 'srt' before 
> .bam
>
> 2) the GEO page for the series includes a reference to a browser; i
> went to chromosome 2
>
> http://www.ncbi.nlm.nih.gov/nuccore/89161199?report=graph&tracks=key:sequence_track;key:graph_track,annots:NA000000149.1|NA000000150.1|NA000000146.1|NA000000148.1|NA000000169.1|NA000000170.1;key:gene_model_track,RNAs:false,CDSs:false,Exons:false,annots:Unnamed;key:feature_track,subkey:misc_RNA,annots:other
>
> and zoomed into a region around 132M + strand, where I saw something
> that looked like a peak in H3K36me3
>
> 3) Now we can make a BamViews instance based on the GSM466738 file,
> and focus on any region of interest by setting bamRanges.
>
> library(Rsamtools)
> zz = BamViews(bamPaths="GSM466738_UCSD.H1.H3K36me3.UCSD_313BFAAXX_6srt.bam")
> bamRanges(zz) = RangedData(IRanges(132500000,133000000), space="chr2")
>
> 4) The alignment information can be retrieved and coverage data extracted
>
> dd = readBAMasGappedAlignments(zz)
> cc = coverage(dd[[1]])$chr2
> ccx = cumsum(c...@lengths)
> ccy = c...@values
>
> plot(ccx, ccy) # shows the peak
>
> 5) The ease with which detailed information on aligned reads can be
> extracted using Rsamtools has led me to favor BAM as a 'hub' format
> for those considering working with Bioconductor for NGS data.  The
> voluminous resources are left outside of R, and we can bring things in
> at various stages of refinement.  More examples of efficient
> reproduction of published NGS results using bioc are desired -- this
> one is pretty and is offered mainly to start getting a little more
> concrete on possible roles of Rsamtools in downstream workflows.
>
> On Mon, Mar 1, 2010 at 7:46 PM, Raphael Gottardo
> <[email protected]> wrote:
>> Hi Michael (and others),
>>
>> I would certainly second that. You guys have develop great tools for low 
>> level analysis of next gen data, but higher level analysis are still lagging 
>> behind. Though, this is rather normal as the higher level stuff needs the 
>> lower level infrastructure.
>>
>> My group has been working on several aspects of chip-seq analysis and to 
>> some extend gene regulation.
>> As noted in one of the email this morning, we are about to submit our PICS 
>> software based on a version of this paper http://arxiv.org/abs/0903.3206, 
>> which we hope will be published in Biometrics in the near future. For our 
>> package we have used some of the infrastructure available in the chip-seq 
>> package, and IRanges.
>>
>> One the problem we have faced is data input. In chipseq, one does not need 
>> sequence reads. However, when you use ShortReads you automatically get the 
>> sequence reads which takes a lot of memory. For some highly sequenced data 
>> we have, it has been somewhat of a bottleneck.
>> So it would be nice to be able to only read the chr/start/strand 
>> information. As pointed out by Wolfgang, rsamtools might be the solution, so 
>> we will have to see how we can use rsamtools and the classes defined there 
>> for chip-seq. This being said we still have a lot of files from non MAQ 
>> aligners.
>> I think Arnaud Droit, who is in my group, has sent an email about this issue 
>> already.
>>
>> Besides PICS that will be submitted this week, we have already released a 
>> package for motif analyses, rGADEM, which can work on standard Biostrings 
>> objects. rGADEM is relatively fast and well adapted for ChIP-seq enriched 
>> regions. We also have another package, MotIV for motif validation and 
>> identification based which is based on STAMP (with many improved 
>> functionalities). MotIV is under review I believe and should be available 
>> soon.
>>
>> Anyway, so very soon we will have a complete pipeline from shortread -> 
>> enriched regions (PICS) -> motifs (rGADEM) -> validated motifs and motif 
>> occurrences (MotIV) -> other BioC packages (e.g. GenomicsFeatures, etc).
>>
>> So at least this will be a start. Of course we are open to 
>> suggestions/requests, etc. If any of you guys want more details feel free to 
>> drop us an email.
>>
>> Cheers,
>>
>> Raphael
>>
>> On 2010-03-01, at 10:08 AM, Michael Lawrence wrote:
>>
>>> Hey guys,
>>>
>>> I'm wondering if anyone has given any thought to some sort of generic
>>> framework for chipseq analysis in Bioconductor, based on the IRanges,
>>> Biostrings, etc infrastructure. chipseq has some nice utilities; could it be
>>> transformed into some sort of generic chipseq pipeline? Something like how
>>> the 'affy' package (I think?) allows other packages to provide alternative
>>> implementations for particular stages. Just having a clean, refined,
>>> approximately complete set of chipseq-focused utilities would be nice.
>>> Presumably chipseq could fill that role? I think we now have a good idea of
>>> the basic steps in chipseq analysis, so it's probably time for such a
>>> package to emerge.
>>>
>>> Comments?
>>>
>>> Michael
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> [email protected]
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> [email protected]
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] chipseq infrastructure

Reply via email to