On Thu, Sep 17, 2009 at 6:47 PM, Patrick Aboyoun <[email protected]> wrote:
> Michael, > Before people start developing their own bed file parsers, it looks pretty > straight-forward to incorporate this operation in rtracklayer now. You or I > could add a scoreRanges argument, which takes either a logical or perhaps a > character string if we envision there being more than one way to score a set > of ranges, to the import* functions and when a appropriate value is > supplied, it will internally block process the data file and output a > RangedData object with a score column. > > > The bedfile specification already includes a mechanism for including a score. Also, there is not a reason to expect that bedfiles will be sorted (at least in common practice), so block-processing would have to make the assumption that the bedfile is not sorted. In short, to do this right might be difficult. Sean > > Michael Lawrence wrote: > >> On Thu, Sep 17, 2009 at 2:17 PM, Ivan Gregoretti <[email protected]> >> wrote: >> >> >> >>> Hi Michael, >>> >>> True; there are no scores in this BED file. That brings us to Sean's >>> remark. >>> >>> You may have noticed that all features in the BED file are 36 bases >>> long. Each feature is the position and orientation of a Solexa read. >>> >>> I want the value in each bin in the WIG file to represent the number >>> of tags contained in that bin. >>> >>> So, coverage() is close to what I am looking for but I think that >>> pileup() is perhaps better. Then comes the problem of how to place >>> those values in an object exportable as WIG. Lets not talk about >>> efficiency for now. >>> >>> >>> >>> >> Ignoring efficiency, it's pretty straight-forward. In the development >> version of IRanges, one can convert an Rle (or RleList for multiple >> chromosomes) as output to coverage() to a RangedData. In the development >> version of rtracklayer, this is done implicitly: >> >> >> >>> export(cov, "coverage.wig") >>> >>> >> >> >> >> >>> I think that I can solve this problem myself but given that: >>> >>> 1) there is a package that can read BED and write WIG files >>> and >>> 2) BED to WIG conversion is such a common task >>> >>> I thought that there must have been an existing tool or perhaps an >>> established efficient way of doing this with BioC tools. Why is this >>> important? Well, this myRegions.bed contains 4.3 million records. >>> Loading them with import() used about 1GB of RAM. My real world BED >>> files are never less than 50 million reads, 150 more like. >>> >>> Conclusion: I DO care about an expert's opinion of BED-> WIG >>> conversion within R. >>> >>> >>> >> >> Maybe nobody does this with R. That is also acceptable. What tool do >> >> >>> you use then? >>> >>> >>> >>> >> The bottleneck seems to be loading and representing large BED files, since >> after the reduction to coverage the data size is usually manageable. >> rtracklayer was never designed to load BED files with millions of records. >> In my experience, short reads are more commonly represented in more >> efficient formats like those output by MAQ and bowtie, which are imported >> with ShortRead. >> >> There is much room for optimization in rtracklayer, but for now one idea >> is >> to calculate coverage on subsets of the reads, and aggregate the results. >> The BED file can be loaded with simple low-level calls like read.table and >> scan. >> >> Michael >> >> >> >> >> >>> Thank you, >>> >>> Ivan >>> >>> >>> Ivan Gregoretti, PhD >>> National Institute of Diabetes and Digestive and Kidney Diseases >>> National Institutes of Health >>> 5 Memorial Dr, Building 5, Room 205. >>> Bethesda, MD 20892. USA. >>> Phone: 1-301-496-1592 >>> Fax: 1-301-496-9878 >>> >>> >>> >>> On Thu, Sep 17, 2009 at 4:45 PM, Sean Davis <[email protected]> wrote: >>> >>> >>>> On Thu, Sep 17, 2009 at 4:43 PM, Michael Lawrence <[email protected]> >>>> wrote: >>>> >>>> >>>>> On Thu, Sep 17, 2009 at 8:14 AM, Ivan Gregoretti <[email protected]> >>>>> wrote: >>>>> >>>>> >>>>> >>>>>> Hello everybody >>>>>> >>>>>> How do you convert BED formatted files to WIG files? >>>>>> >>>>>> >>>>>> I tried to do that with rtracklayer but it didn't quite succeed. >>>>>> >>>>>> This is the session's transcript: >>>>>> >>>>>> First, you can download >>>>>> http://dl.getdropbox.com/u/2051155/myRegions.bed, which looks like >>>>>> this >>>>>> >>>>>> chr1 3002444 3002479 + >>>>>> chr1 3002989 3003024 - >>>>>> chr1 3017603 3017638 + >>>>>> chr1 3017879 3017914 - >>>>>> chr1 3018173 3018208 + >>>>>> chr1 3018183 3018218 - >>>>>> chr1 3018183 3018218 - >>>>>> chr1 3019065 3019100 + >>>>>> chr1 3019761 3019796 - >>>>>> chr1 3020044 3020079 - >>>>>> ... >>>>>> >>>>>> Now to R >>>>>> >>>>>> suppressMessages(library(rtracklayer)) >>>>>> myRegions <- import('myRegions.bed') >>>>>> >>>>>> So far so good. Now I try: >>>>>> export(myRegions, 'myRegions.wig', format = 'wig') >>>>>> >>>>>> but I get: >>>>>> Error in export.ucsc(object, con, subformat, ...) : >>>>>> Track not compatible with WIG format: Overlapping features must be >>>>>> on separate strands and every feature width must be positive >>>>>> >>>>>> I seems that the error message is a feature rather than a bug. My >>>>>> interpretation is that export() does not like records like lines 6 and >>>>>> 7: >>>>>> chr1 3018183 3018218 - >>>>>> chr1 3018183 3018218 - >>>>>> >>>>>> So, how do you convert BED to WIG in your everyday work? >>>>>> >>>>>> >>>>>> >>>>>> >>>>> Well, as rtracklayer said, this track is not representable by WIG, >>>>> which >>>>> is >>>>> meant to communicate a single data value for a given genomic region (to >>>>> generate the bar plots in the UCSC browser, for instance). There aren't >>>>> any >>>>> scores in your file, so why do you want to use WIG? >>>>> >>>>> >>>>> >>>> Just guessing, here, but you may want to calculate coverage() first, >>>> >>>> >>> Ivan? >>> >>> >>>> Sean >>>> >>>> >>>> >>>>> Thank you, >>>>>> >>>>>> Ivan >>>>>> >>>>>> >>>>>> >>>>>>> sessionInfo() >>>>>>> >>>>>>> >>>>>> R version 2.9.2 (2009-08-24) >>>>>> x86_64-redhat-linux-gnu >>>>>> >>>>>> locale: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C >>> >>> >>>> attached base packages: >>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>> >>>>>> other attached packages: >>>>>> [1] rtracklayer_1.4.0 RCurl_0.94-1 >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] Biobase_2.4.0 Biostrings_2.12.0 BSgenome_1.12.0 >>>>>> >>>>>> >>>>> IRanges_1.2.0 >>> >>> >>>> [5] tools_2.9.2 XML_2.3-0 >>>>>> >>>>>> Ivan Gregoretti, PhD >>>>>> National Institute of Diabetes and Digestive and Kidney Diseases >>>>>> National Institutes of Health >>>>>> 5 Memorial Dr, Building 5, Room 205. >>>>>> Bethesda, MD 20892. USA. >>>>>> Phone: 1-301-496-1592 >>>>>> Fax: 1-301-496-9878 >>>>>> >>>>>> _______________________________________________ >>>>>> Bioc-sig-sequencing mailing list >>>>>> [email protected] >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>>>>> >>>>>> >>>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioc-sig-sequencing mailing list >>>>> [email protected] >>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>>>> >>>>> >>>> >>>> >>> _______________________________________________ >>> Bioc-sig-sequencing mailing list >>> [email protected] >>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>> >>> >>> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-sig-sequencing mailing list >> [email protected] >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> >> > _______________________________________________ > Bioc-sig-sequencing mailing list > [email protected] > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > > [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
