Thanks Vincent. This looks good. On 2010-03-02, at 11:01 PM, Vincent Carey wrote:
> NB: At the end of the comments I meant to write "pretty paltry", not > "pretty" ... it isn't. > > On Tue, Mar 2, 2010 at 10:59 PM, Vincent Carey > <[email protected]> wrote: >> Here is an example of using Rsamtools to reproduce a little bit of the >> resources associated with some ChIP-seq data from the paper by R >> Lister et al (Nature 2009 PMID 19829295) >> >> 1) a BAM file was retrieved from GEO, specifically >> >> GSM466738_UCSD.H1.H3K36me3.UCSD_42B3BAAXX_6.bam >> >> This file had to be sorted and indexed using samtools; I added 'srt' before >> .bam >> >> 2) the GEO page for the series includes a reference to a browser; i >> went to chromosome 2 >> >> http://www.ncbi.nlm.nih.gov/nuccore/89161199?report=graph&tracks=key:sequence_track;key:graph_track,annots:NA000000149.1|NA000000150.1|NA000000146.1|NA000000148.1|NA000000169.1|NA000000170.1;key:gene_model_track,RNAs:false,CDSs:false,Exons:false,annots:Unnamed;key:feature_track,subkey:misc_RNA,annots:other >> >> and zoomed into a region around 132M + strand, where I saw something >> that looked like a peak in H3K36me3 >> >> 3) Now we can make a BamViews instance based on the GSM466738 file, >> and focus on any region of interest by setting bamRanges. >> >> library(Rsamtools) >> zz = BamViews(bamPaths="GSM466738_UCSD.H1.H3K36me3.UCSD_313BFAAXX_6srt.bam") >> bamRanges(zz) = RangedData(IRanges(132500000,133000000), space="chr2") >> >> 4) The alignment information can be retrieved and coverage data extracted >> >> dd = readBAMasGappedAlignments(zz) >> cc = coverage(dd[[1]])$chr2 >> ccx = cumsum(c...@lengths) >> ccy = c...@values >> >> plot(ccx, ccy) # shows the peak >> >> 5) The ease with which detailed information on aligned reads can be >> extracted using Rsamtools has led me to favor BAM as a 'hub' format >> for those considering working with Bioconductor for NGS data. The >> voluminous resources are left outside of R, and we can bring things in >> at various stages of refinement. More examples of efficient >> reproduction of published NGS results using bioc are desired -- this >> one is pretty and is offered mainly to start getting a little more >> concrete on possible roles of Rsamtools in downstream workflows. >> >> On Mon, Mar 1, 2010 at 7:46 PM, Raphael Gottardo >> <[email protected]> wrote: >>> Hi Michael (and others), >>> >>> I would certainly second that. You guys have develop great tools for low >>> level analysis of next gen data, but higher level analysis are still >>> lagging behind. Though, this is rather normal as the higher level stuff >>> needs the lower level infrastructure. >>> >>> My group has been working on several aspects of chip-seq analysis and to >>> some extend gene regulation. >>> As noted in one of the email this morning, we are about to submit our PICS >>> software based on a version of this paper http://arxiv.org/abs/0903.3206, >>> which we hope will be published in Biometrics in the near future. For our >>> package we have used some of the infrastructure available in the chip-seq >>> package, and IRanges. >>> >>> One the problem we have faced is data input. In chipseq, one does not need >>> sequence reads. However, when you use ShortReads you automatically get the >>> sequence reads which takes a lot of memory. For some highly sequenced data >>> we have, it has been somewhat of a bottleneck. >>> So it would be nice to be able to only read the chr/start/strand >>> information. As pointed out by Wolfgang, rsamtools might be the solution, >>> so we will have to see how we can use rsamtools and the classes defined >>> there for chip-seq. This being said we still have a lot of files from non >>> MAQ aligners. >>> I think Arnaud Droit, who is in my group, has sent an email about this >>> issue already. >>> >>> Besides PICS that will be submitted this week, we have already released a >>> package for motif analyses, rGADEM, which can work on standard Biostrings >>> objects. rGADEM is relatively fast and well adapted for ChIP-seq enriched >>> regions. We also have another package, MotIV for motif validation and >>> identification based which is based on STAMP (with many improved >>> functionalities). MotIV is under review I believe and should be available >>> soon. >>> >>> Anyway, so very soon we will have a complete pipeline from shortread -> >>> enriched regions (PICS) -> motifs (rGADEM) -> validated motifs and motif >>> occurrences (MotIV) -> other BioC packages (e.g. GenomicsFeatures, etc). >>> >>> So at least this will be a start. Of course we are open to >>> suggestions/requests, etc. If any of you guys want more details feel free >>> to drop us an email. >>> >>> Cheers, >>> >>> Raphael >>> >>> On 2010-03-01, at 10:08 AM, Michael Lawrence wrote: >>> >>>> Hey guys, >>>> >>>> I'm wondering if anyone has given any thought to some sort of generic >>>> framework for chipseq analysis in Bioconductor, based on the IRanges, >>>> Biostrings, etc infrastructure. chipseq has some nice utilities; could it >>>> be >>>> transformed into some sort of generic chipseq pipeline? Something like how >>>> the 'affy' package (I think?) allows other packages to provide alternative >>>> implementations for particular stages. Just having a clean, refined, >>>> approximately complete set of chipseq-focused utilities would be nice. >>>> Presumably chipseq could fill that role? I think we now have a good idea of >>>> the basic steps in chipseq analysis, so it's probably time for such a >>>> package to emerge. >>>> >>>> Comments? >>>> >>>> Michael >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioc-sig-sequencing mailing list >>>> [email protected] >>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>> >>> _______________________________________________ >>> Bioc-sig-sequencing mailing list >>> [email protected] >>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>> >> _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
