NB: At the end of the comments I meant to write "pretty paltry", not "pretty" ... it isn't.
On Tue, Mar 2, 2010 at 10:59 PM, Vincent Carey <[email protected]> wrote: > Here is an example of using Rsamtools to reproduce a little bit of the > resources associated with some ChIP-seq data from the paper by R > Lister et al (Nature 2009 PMID 19829295) > > 1) a BAM file was retrieved from GEO, specifically > > GSM466738_UCSD.H1.H3K36me3.UCSD_42B3BAAXX_6.bam > > This file had to be sorted and indexed using samtools; I added 'srt' before > .bam > > 2) the GEO page for the series includes a reference to a browser; i > went to chromosome 2 > > http://www.ncbi.nlm.nih.gov/nuccore/89161199?report=graph&tracks=key:sequence_track;key:graph_track,annots:NA000000149.1|NA000000150.1|NA000000146.1|NA000000148.1|NA000000169.1|NA000000170.1;key:gene_model_track,RNAs:false,CDSs:false,Exons:false,annots:Unnamed;key:feature_track,subkey:misc_RNA,annots:other > > and zoomed into a region around 132M + strand, where I saw something > that looked like a peak in H3K36me3 > > 3) Now we can make a BamViews instance based on the GSM466738 file, > and focus on any region of interest by setting bamRanges. > > library(Rsamtools) > zz = BamViews(bamPaths="GSM466738_UCSD.H1.H3K36me3.UCSD_313BFAAXX_6srt.bam") > bamRanges(zz) = RangedData(IRanges(132500000,133000000), space="chr2") > > 4) The alignment information can be retrieved and coverage data extracted > > dd = readBAMasGappedAlignments(zz) > cc = coverage(dd[[1]])$chr2 > ccx = cumsum(c...@lengths) > ccy = c...@values > > plot(ccx, ccy) # shows the peak > > 5) The ease with which detailed information on aligned reads can be > extracted using Rsamtools has led me to favor BAM as a 'hub' format > for those considering working with Bioconductor for NGS data. The > voluminous resources are left outside of R, and we can bring things in > at various stages of refinement. More examples of efficient > reproduction of published NGS results using bioc are desired -- this > one is pretty and is offered mainly to start getting a little more > concrete on possible roles of Rsamtools in downstream workflows. > > On Mon, Mar 1, 2010 at 7:46 PM, Raphael Gottardo > <[email protected]> wrote: >> Hi Michael (and others), >> >> I would certainly second that. You guys have develop great tools for low >> level analysis of next gen data, but higher level analysis are still lagging >> behind. Though, this is rather normal as the higher level stuff needs the >> lower level infrastructure. >> >> My group has been working on several aspects of chip-seq analysis and to >> some extend gene regulation. >> As noted in one of the email this morning, we are about to submit our PICS >> software based on a version of this paper http://arxiv.org/abs/0903.3206, >> which we hope will be published in Biometrics in the near future. For our >> package we have used some of the infrastructure available in the chip-seq >> package, and IRanges. >> >> One the problem we have faced is data input. In chipseq, one does not need >> sequence reads. However, when you use ShortReads you automatically get the >> sequence reads which takes a lot of memory. For some highly sequenced data >> we have, it has been somewhat of a bottleneck. >> So it would be nice to be able to only read the chr/start/strand >> information. As pointed out by Wolfgang, rsamtools might be the solution, so >> we will have to see how we can use rsamtools and the classes defined there >> for chip-seq. This being said we still have a lot of files from non MAQ >> aligners. >> I think Arnaud Droit, who is in my group, has sent an email about this issue >> already. >> >> Besides PICS that will be submitted this week, we have already released a >> package for motif analyses, rGADEM, which can work on standard Biostrings >> objects. rGADEM is relatively fast and well adapted for ChIP-seq enriched >> regions. We also have another package, MotIV for motif validation and >> identification based which is based on STAMP (with many improved >> functionalities). MotIV is under review I believe and should be available >> soon. >> >> Anyway, so very soon we will have a complete pipeline from shortread -> >> enriched regions (PICS) -> motifs (rGADEM) -> validated motifs and motif >> occurrences (MotIV) -> other BioC packages (e.g. GenomicsFeatures, etc). >> >> So at least this will be a start. Of course we are open to >> suggestions/requests, etc. If any of you guys want more details feel free to >> drop us an email. >> >> Cheers, >> >> Raphael >> >> On 2010-03-01, at 10:08 AM, Michael Lawrence wrote: >> >>> Hey guys, >>> >>> I'm wondering if anyone has given any thought to some sort of generic >>> framework for chipseq analysis in Bioconductor, based on the IRanges, >>> Biostrings, etc infrastructure. chipseq has some nice utilities; could it be >>> transformed into some sort of generic chipseq pipeline? Something like how >>> the 'affy' package (I think?) allows other packages to provide alternative >>> implementations for particular stages. Just having a clean, refined, >>> approximately complete set of chipseq-focused utilities would be nice. >>> Presumably chipseq could fill that role? I think we now have a good idea of >>> the basic steps in chipseq analysis, so it's probably time for such a >>> package to emerge. >>> >>> Comments? >>> >>> Michael >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioc-sig-sequencing mailing list >>> [email protected] >>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> >> _______________________________________________ >> Bioc-sig-sequencing mailing list >> [email protected] >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> > _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
