On Thu, Sep 17, 2009 at 10:21 PM, Patrick Aboyoun <[email protected]>wrote:
> Michael, > This import operation has similarities with the rdapply framework. We > should have the two types of operations sharing similar components such as > filtering. > > I've been meaning to port the rdapply filtering over to the ShortRead filter framework (which would need to be brought down into IRanges or something). > To everyone, > What other data reduction operations would you like to have on bed file > import? > > > Patrick > > > Michael Lawrence wrote: > >> >> >> On Thu, Sep 17, 2009 at 4:35 PM, Patrick Aboyoun <[email protected]<mailto: >> [email protected]>> wrote: >> >> Michael, >> We could add that level of flexibility, but I worry that may open >> the door too widely because the aggregator will need to take the >> old RangedData object and the new block RangedData object as >> arguments and produce a new RangedData object for the next >> iteration and users may find it difficult to construct a valid >> aggregator. >> >> >> Well just to clarify, I thought the aggregator would take the old and new >> objects as output by the block processor. So for coverage, one function >> would take a RangedData and return a RleList, and the aggregator would take >> two RleLists and add them. >> >> The call for calculating the coverage would be like: >> >> cov <- import("reads.bed", blockSize = 500000, blockFun = coverage, >> aggregateFun = `+`) >> >> Of course, high-level functionality would be nice for routine operations. >> That will help prevent the user from making mistakes. The blockFun and >> aggregateFun could be bound into a single class BlockHandler. >> >> cov <- import("reads.bed", blockSize = 500000, blockHandler = >> CoverageHandler()) >> >> I think there may also be a need for filters to allow the user to >> keep only those rows that satisfy their criteria. We could supply >> chromosome, strand, score, etc. filters to meet this need. >> >> >> Yes, and the Filter class could inherit from BlockHandler, so that it >> works incrementally. import() could be smart enough to incorporate both a >> filter and a user block handler into a single "meta" handler. >> >> So to obtain an RleList of coverage, but only over chromosome 1, >> processing half a million reads at a time: >> >> cov <- import("reads.bed", blockSize = 500000, blockHandler = >> CoverageHandler(), filter = ChromosomeFilter("chr1")) >> >> Michael >> >> >> Patrick >> >> >> Michael Lawrence wrote: >> >> I don't know if we want to add explicit support for scoring, >> when there are many reasons why one might want to >> incrementally load a track. One solution would be a design >> based on callbacks. The user specifies the block size and a >> callback to be invoked for each block. The result could be a >> list, with each element holding the result for a block. The >> user could then aggregate that. To be even more efficient, we >> could support a second callback that aggregates at each step. >> >> On Thu, Sep 17, 2009 at 3:47 PM, Patrick Aboyoun >> <[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>> wrote: >> >> Michael, >> Before people start developing their own bed file parsers, it >> looks pretty straight-forward to incorporate this operation in >> rtracklayer now. You or I could add a scoreRanges argument, >> which >> takes either a logical or perhaps a character string if we >> envision there being more than one way to score a set of >> ranges, >> to the import* functions and when a appropriate value is >> supplied, >> it will internally block process the data file and output a >> RangedData object with a score column. >> >> >> Patrick >> >> >> >> Michael Lawrence wrote: >> >> On Thu, Sep 17, 2009 at 2:17 PM, Ivan Gregoretti >> <[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>> wrote: >> >> Hi Michael, >> >> True; there are no scores in this BED file. That >> brings us >> to Sean's >> remark. >> >> You may have noticed that all features in the BED >> file are >> 36 bases >> long. Each feature is the position and orientation of a >> Solexa read. >> >> I want the value in each bin in the WIG file to >> represent >> the number >> of tags contained in that bin. >> >> So, coverage() is close to what I am looking for but I >> think that >> pileup() is perhaps better. Then comes the problem >> of how >> to place >> those values in an object exportable as WIG. Lets >> not talk >> about >> efficiency for now. >> >> >> Ignoring efficiency, it's pretty >> straight-forward. In the >> development >> version of IRanges, one can convert an Rle (or RleList for >> multiple >> chromosomes) as output to coverage() to a RangedData. >> In the >> development >> version of rtracklayer, this is done implicitly: >> >> export(cov, "coverage.wig") >> >> >> I think that I can solve this problem >> myself but >> given that: >> >> 1) there is a package that can read BED and write >> WIG files >> and >> 2) BED to WIG conversion is such a common task >> >> I thought that there must have been an existing tool or >> perhaps an >> established efficient way of doing this with BioC >> tools. >> Why is this >> important? Well, this myRegions.bed contains 4.3 >> million >> records. >> Loading them with import() used about 1GB of RAM. >> My real >> world BED >> files are never less than 50 million reads, 150 >> more like. >> >> Conclusion: I DO care about an expert's opinion of >> BED-> WIG >> conversion within R. >> >> >> Maybe nobody does this with R. That is also acceptable. >> What >> tool do >> you use then? >> >> >> The bottleneck seems to be loading and >> representing >> large BED >> files, since >> after the reduction to coverage the data size is usually >> manageable. >> rtracklayer was never designed to load BED files with >> millions >> of records. >> In my experience, short reads are more commonly >> represented in >> more >> efficient formats like those output by MAQ and bowtie, >> which >> are imported >> with ShortRead. >> >> There is much room for optimization in rtracklayer, but for >> now one idea is >> to calculate coverage on subsets of the reads, and >> aggregate >> the results. >> The BED file can be loaded with simple low-level calls like >> read.table and >> scan. >> >> Michael >> >> >> >> Thank you, >> >> Ivan >> >> >> Ivan Gregoretti, PhD >> National Institute of Diabetes and Digestive and Kidney >> Diseases >> National Institutes of Health >> 5 Memorial Dr, Building 5, Room 205. >> Bethesda, MD 20892. USA. >> Phone: 1-301-496-1592 >> Fax: 1-301-496-9878 >> >> >> >> On Thu, Sep 17, 2009 at 4:45 PM, Sean Davis >> <[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>> wrote: >> On Thu, Sep 17, 2009 at 4:43 PM, >> Michael Lawrence >> <[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>> >> >> wrote: >> On Thu, Sep 17, 2009 >> at 8:14 AM, Ivan >> Gregoretti >> <[email protected] >> <mailto:[email protected]> <mailto:[email protected] >> >> <mailto:[email protected]>>> >> >> wrote: >> >> Hello >> everybody >> >> How do you convert BED formatted files >> to WIG >> files? >> >> >> I tried to do that with rtracklayer but it >> didn't quite succeed. >> >> This is the session's transcript: >> >> First, you can download >> >> http://dl.getdropbox.com/u/2051155/myRegions.bed, >> which looks like >> this >> >> chr1 3002444 3002479 + >> chr1 3002989 3003024 - >> chr1 3017603 3017638 + >> chr1 3017879 3017914 - >> chr1 3018173 3018208 + >> chr1 3018183 3018218 - >> chr1 3018183 3018218 - >> chr1 3019065 3019100 + >> chr1 3019761 3019796 - >> chr1 3020044 3020079 - >> ... >> >> Now to R >> >> suppressMessages(library(rtracklayer)) >> myRegions <- import('myRegions.bed') >> >> So far so good. Now I try: >> export(myRegions, 'myRegions.wig', >> format = 'wig') >> >> but I get: >> Error in export.ucsc(object, con, >> subformat, >> ...) : >> Track not compatible with WIG format: >> Overlapping features must be >> on separate strands and every feature width >> must be positive >> >> I seems that the error message is a feature >> rather than a bug. My >> interpretation is that export() does >> not like >> records like lines 6 and >> 7: >> chr1 3018183 3018218 - >> chr1 3018183 3018218 - >> >> So, how do you convert BED to WIG in your >> everyday work? >> >> >> Well, as >> rtracklayer said, this track is not >> representable by WIG, which >> is >> meant to communicate a single data value for a >> given genomic region (to >> generate the bar plots in the UCSC browser, for >> instance). There aren't >> any >> scores in your file, so why do you want to >> use WIG? >> >> Just guessing, here, >> but you may want to calculate >> coverage() first, >> Ivan? >> Sean >> >> Thank you, >> >> Ivan >> >> >> sessionInfo() >> >> R version 2.9.2 (2009-08-24) >> x86_64-redhat-linux-gnu >> >> locale: >> >> >> >> >> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C >> attached base packages: >> [1] stats graphics grDevices utils >> datasets methods base >> >> other attached packages: >> [1] rtracklayer_1.4.0 RCurl_0.94-1 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.4.0 Biostrings_2.12.0 >> BSgenome_1.12.0 >> IRanges_1.2.0 >> [5] tools_2.9.2 >> XML_2.3-0 >> >> Ivan Gregoretti, PhD >> National Institute of Diabetes and >> Digestive >> and Kidney Diseases >> National Institutes of Health >> 5 Memorial Dr, Building 5, Room 205. >> Bethesda, MD 20892. USA. >> Phone: 1-301-496-1592 >> Fax: 1-301-496-9878 >> >> >> _______________________________________________ >> Bioc-sig-sequencing mailing list >> [email protected] >> <mailto:[email protected]> >> <mailto: >> [email protected] >> <mailto:[email protected]>> >> >> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-sig-sequencing mailing list >> [email protected] >> <mailto:[email protected]> >> <mailto:[email protected] >> <mailto:[email protected]>> >> >> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> >> _______________________________________________ >> Bioc-sig-sequencing mailing list >> [email protected] >> <mailto:[email protected]> >> <mailto:[email protected] >> <mailto:[email protected]>> >> >> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-sig-sequencing mailing list >> [email protected] >> <mailto:[email protected]> >> <mailto:[email protected] >> <mailto:[email protected]>> >> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> >> >> >> >> > [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
