Re: [Bioc-sig-seq] BED to WIG format conversion

Sean Davis Thu, 17 Sep 2009 16:30:56 -0700

On Thu, Sep 17, 2009 at 6:47 PM, Patrick Aboyoun <[email protected]> wrote:


> Michael,
> Before people start developing their own bed file parsers, it looks pretty
> straight-forward to incorporate this operation in rtracklayer now. You or I
> could add a scoreRanges argument, which takes either a logical or perhaps a
> character string if we envision there being more than one way to score a set
> of ranges, to the import* functions and when a appropriate value is
> supplied, it will internally block process the data file and output a
> RangedData object with a score column.
>
>
>
The bedfile specification already includes a mechanism for including a
score.  Also, there is not a reason to expect that bedfiles will be sorted
(at least in common practice), so block-processing would have to make the
assumption that the bedfile is not sorted.  In short, to do this right might
be difficult.

Sean


>
> Michael Lawrence wrote:
>
>> On Thu, Sep 17, 2009 at 2:17 PM, Ivan Gregoretti <[email protected]>
>> wrote:
>>
>>
>>
>>> Hi Michael,
>>>
>>> True; there are no scores in this BED file. That brings us to Sean's
>>> remark.
>>>
>>> You may have noticed that all features in the BED file are 36 bases
>>> long. Each feature is the position and orientation of a Solexa read.
>>>
>>> I want the value in each bin in the WIG file to represent the number
>>> of tags contained in that bin.
>>>
>>> So, coverage() is close to what I am looking for but I think that
>>> pileup() is perhaps better. Then comes the problem of how to place
>>> those values in an object exportable as WIG. Lets not talk about
>>> efficiency for now.
>>>
>>>
>>>
>>>
>> Ignoring efficiency, it's pretty straight-forward. In the development
>> version of IRanges, one can convert an Rle (or RleList for multiple
>> chromosomes) as output to coverage() to a RangedData. In the development
>> version of rtracklayer, this is done implicitly:
>>
>>
>>
>>> export(cov, "coverage.wig")
>>>
>>>
>>
>>
>>
>>
>>> I think that I can solve this problem myself but given that:
>>>
>>> 1) there is a package that can read BED and write WIG files
>>> and
>>> 2) BED to WIG conversion is such a common task
>>>
>>> I thought that there must have been an existing tool or perhaps an
>>> established efficient way of doing this with BioC tools. Why is this
>>> important? Well, this myRegions.bed contains 4.3 million records.
>>> Loading them with import() used about 1GB of RAM. My real world BED
>>> files are never less than 50 million reads, 150 more like.
>>>
>>> Conclusion: I DO care about an expert's opinion of BED-> WIG
>>> conversion within R.
>>>
>>>
>>>
>>
>> Maybe nobody does this with R. That is also acceptable. What tool do
>>
>>
>>> you use then?
>>>
>>>
>>>
>>>
>> The bottleneck seems to be loading and representing large BED files, since
>> after the reduction to coverage the data size is usually manageable.
>> rtracklayer was never designed to load BED files with millions of records.
>> In my experience, short reads are more commonly represented in more
>> efficient formats like those output by MAQ and bowtie, which are imported
>> with ShortRead.
>>
>> There is much room for optimization in rtracklayer, but for now one idea
>> is
>> to calculate coverage on subsets of the reads, and aggregate the results.
>> The BED file can be loaded with simple low-level calls like read.table and
>> scan.
>>
>> Michael
>>
>>
>>
>>
>>
>>> Thank you,
>>>
>>> Ivan
>>>
>>>
>>> Ivan Gregoretti, PhD
>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>> National Institutes of Health
>>> 5 Memorial Dr, Building 5, Room 205.
>>> Bethesda, MD 20892. USA.
>>> Phone: 1-301-496-1592
>>> Fax: 1-301-496-9878
>>>
>>>
>>>
>>> On Thu, Sep 17, 2009 at 4:45 PM, Sean Davis <[email protected]> wrote:
>>>
>>>
>>>> On Thu, Sep 17, 2009 at 4:43 PM, Michael Lawrence <[email protected]>
>>>> wrote:
>>>>
>>>>
>>>>> On Thu, Sep 17, 2009 at 8:14 AM, Ivan Gregoretti <[email protected]>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hello everybody
>>>>>>
>>>>>> How do you convert BED formatted files to WIG files?
>>>>>>
>>>>>>
>>>>>> I tried to do that with rtracklayer but it didn't quite succeed.
>>>>>>
>>>>>> This is the session's transcript:
>>>>>>
>>>>>> First, you can download
>>>>>> http://dl.getdropbox.com/u/2051155/myRegions.bed, which looks like
>>>>>> this
>>>>>>
>>>>>> chr1    3002444 3002479                 +
>>>>>> chr1    3002989 3003024                 -
>>>>>> chr1    3017603 3017638                 +
>>>>>> chr1    3017879 3017914                 -
>>>>>> chr1    3018173 3018208                 +
>>>>>> chr1    3018183 3018218                 -
>>>>>> chr1    3018183 3018218                 -
>>>>>> chr1    3019065 3019100                 +
>>>>>> chr1    3019761 3019796                 -
>>>>>> chr1    3020044 3020079                 -
>>>>>> ...
>>>>>>
>>>>>> Now to R
>>>>>>
>>>>>> suppressMessages(library(rtracklayer))
>>>>>> myRegions <- import('myRegions.bed')
>>>>>>
>>>>>> So far so good. Now I try:
>>>>>> export(myRegions, 'myRegions.wig', format = 'wig')
>>>>>>
>>>>>> but I get:
>>>>>> Error in export.ucsc(object, con, subformat, ...) :
>>>>>>  Track not compatible with WIG format: Overlapping features must be
>>>>>> on separate strands and every feature width must be positive
>>>>>>
>>>>>> I seems that the error message is a feature rather than a bug. My
>>>>>> interpretation is that export() does not like records like lines 6 and
>>>>>> 7:
>>>>>> chr1    3018183 3018218                 -
>>>>>> chr1    3018183 3018218                 -
>>>>>>
>>>>>> So, how do you convert BED to WIG in your everyday work?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Well, as rtracklayer said, this track is not representable by WIG,
>>>>> which
>>>>> is
>>>>> meant to communicate a single data value for a given genomic region (to
>>>>> generate the bar plots in the UCSC browser, for instance). There aren't
>>>>> any
>>>>> scores in your file, so why do you want to use WIG?
>>>>>
>>>>>
>>>>>
>>>> Just guessing, here, but you may want to calculate coverage() first,
>>>>
>>>>
>>> Ivan?
>>>
>>>
>>>> Sean
>>>>
>>>>
>>>>
>>>>> Thank you,
>>>>>>
>>>>>> Ivan
>>>>>>
>>>>>>
>>>>>>
>>>>>>> sessionInfo()
>>>>>>>
>>>>>>>
>>>>>> R version 2.9.2 (2009-08-24)
>>>>>> x86_64-redhat-linux-gnu
>>>>>>
>>>>>> locale:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>>>
>>>
>>>> attached base packages:
>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] rtracklayer_1.4.0 RCurl_0.94-1
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] Biobase_2.4.0     Biostrings_2.12.0 BSgenome_1.12.0
>>>>>>
>>>>>>
>>>>> IRanges_1.2.0
>>>
>>>
>>>> [5] tools_2.9.2       XML_2.3-0
>>>>>>
>>>>>> Ivan Gregoretti, PhD
>>>>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>>>>> National Institutes of Health
>>>>>> 5 Memorial Dr, Building 5, Room 205.
>>>>>> Bethesda, MD 20892. USA.
>>>>>> Phone: 1-301-496-1592
>>>>>> Fax: 1-301-496-9878
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-sig-sequencing mailing list
>>>>>> [email protected]
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>
>>>>>>
>>>>>>
>>>>>       [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-sig-sequencing mailing list
>>>>> [email protected]
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>
>>>>>
>>>>
>>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> [email protected]
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>>
>>
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> [email protected]
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> [email protected]
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] BED to WIG format conversion

Reply via email to