Re: [Bioc-sig-seq] BED to WIG format conversion

Michael Lawrence Thu, 17 Sep 2009 23:57:07 -0700

On Thu, Sep 17, 2009 at 10:21 PM, Patrick Aboyoun <[email protected]>wrote:


> Michael,
> This import operation has similarities with the rdapply framework. We
> should have the two types of operations sharing similar components such as
> filtering.
>
>
I've been meaning to port the rdapply filtering over to the ShortRead filter
framework (which would need to be brought down into IRanges or something).


> To everyone,
> What other data reduction operations would you like to have on bed file
> import?
>
>
> Patrick
>
>
> Michael Lawrence wrote:
>
>>
>>
>> On Thu, Sep 17, 2009 at 4:35 PM, Patrick Aboyoun <[email protected]<mailto:
>> [email protected]>> wrote:
>>
>>    Michael,
>>    We could add that level of flexibility, but I worry that may open
>>    the door too widely because the aggregator will need to take the
>>    old RangedData object and the new block RangedData object as
>>    arguments and produce a new RangedData object for the next
>>    iteration and users may find it difficult to construct a valid
>>    aggregator.
>>
>>
>> Well just to clarify, I thought the aggregator would take the old and new
>> objects as output by the block processor. So for coverage, one function
>> would take a RangedData and return a RleList, and the aggregator would take
>> two RleLists and add them.
>>
>> The call for calculating the coverage would be like:
>>
>> cov <- import("reads.bed", blockSize = 500000, blockFun = coverage,
>> aggregateFun = `+`)
>>
>> Of course, high-level functionality would be nice for routine operations.
>> That will help prevent the user from making mistakes. The blockFun and
>> aggregateFun could be bound into a single class BlockHandler.
>>
>> cov <- import("reads.bed", blockSize = 500000, blockHandler =
>> CoverageHandler())
>>
>>    I think there may also be a need for filters to allow the user to
>>    keep only those rows that satisfy their criteria. We could supply
>>    chromosome, strand, score, etc. filters to meet this need.
>>
>>
>> Yes, and the Filter class could inherit from BlockHandler, so that it
>> works incrementally. import() could be smart enough to incorporate both a
>> filter and a user block handler into a single "meta" handler.
>>
>> So to obtain an RleList of coverage, but only over chromosome 1,
>> processing half a million reads at a time:
>>
>> cov <- import("reads.bed", blockSize = 500000, blockHandler =
>> CoverageHandler(), filter = ChromosomeFilter("chr1"))
>>
>> Michael
>>
>>
>>    Patrick
>>
>>
>>    Michael Lawrence wrote:
>>
>>        I don't know if we want to add explicit support for scoring,
>>        when there are many reasons why one might want to
>>        incrementally load a track. One solution would be a design
>>        based on callbacks. The user specifies the block size and a
>>        callback to be invoked for each block. The result could be a
>>        list, with each element holding the result for a block. The
>>        user could then aggregate that. To be even more efficient, we
>>        could support a second callback that aggregates at each step.
>>
>>        On Thu, Sep 17, 2009 at 3:47 PM, Patrick Aboyoun
>>        <[email protected] <mailto:[email protected]>
>>        <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>
>>           Michael,
>>           Before people start developing their own bed file parsers, it
>>           looks pretty straight-forward to incorporate this operation in
>>           rtracklayer now. You or I could add a scoreRanges argument,
>>        which
>>           takes either a logical or perhaps a character string if we
>>           envision there being more than one way to score a set of
>>        ranges,
>>           to the import* functions and when a appropriate value is
>>        supplied,
>>           it will internally block process the data file and output a
>>           RangedData object with a score column.
>>
>>
>>           Patrick
>>
>>
>>
>>           Michael Lawrence wrote:
>>
>>               On Thu, Sep 17, 2009 at 2:17 PM, Ivan Gregoretti
>>               <[email protected] <mailto:[email protected]>
>>        <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>
>>                                 Hi Michael,
>>
>>                   True; there are no scores in this BED file. That
>>        brings us
>>                   to Sean's
>>                   remark.
>>
>>                   You may have noticed that all features in the BED
>>        file are
>>                   36 bases
>>                   long. Each feature is the position and orientation of a
>>                   Solexa read.
>>
>>                   I want the value in each bin in the WIG file to
>>        represent
>>                   the number
>>                   of tags contained in that bin.
>>
>>                   So, coverage() is close to what I am looking for but I
>>                   think that
>>                   pileup() is perhaps better. Then comes the problem
>>        of how
>>                   to place
>>                   those values in an object exportable as WIG. Lets
>>        not talk
>>                   about
>>                   efficiency for now.
>>
>>
>>                                   Ignoring efficiency, it's pretty
>> straight-forward. In the
>>               development
>>               version of IRanges, one can convert an Rle (or RleList for
>>               multiple
>>               chromosomes) as output to coverage() to a RangedData.
>>        In the
>>               development
>>               version of rtracklayer, this is done implicitly:
>>
>>                                 export(cov, "coverage.wig")
>>
>>
>>                                 I think that I can solve this problem
>> myself but
>>        given that:
>>
>>                   1) there is a package that can read BED and write
>>        WIG files
>>                   and
>>                   2) BED to WIG conversion is such a common task
>>
>>                   I thought that there must have been an existing tool or
>>                   perhaps an
>>                   established efficient way of doing this with BioC
>>        tools.
>>                   Why is this
>>                   important? Well, this myRegions.bed contains 4.3
>>        million
>>                   records.
>>                   Loading them with import() used about 1GB of RAM.
>>        My real
>>                   world BED
>>                   files are never less than 50 million reads, 150
>>        more like.
>>
>>                   Conclusion: I DO care about an expert's opinion of
>>        BED-> WIG
>>                   conversion within R.
>>
>>
>>               Maybe nobody does this with R. That is also acceptable.
>>        What
>>               tool do
>>                                 you use then?
>>
>>
>>                                   The bottleneck seems to be loading and
>> representing
>>        large BED
>>               files, since
>>               after the reduction to coverage the data size is usually
>>               manageable.
>>               rtracklayer was never designed to load BED files with
>>        millions
>>               of records.
>>               In my experience, short reads are more commonly
>>        represented in
>>               more
>>               efficient formats like those output by MAQ and bowtie,
>>        which
>>               are imported
>>               with ShortRead.
>>
>>               There is much room for optimization in rtracklayer, but for
>>               now one idea is
>>               to calculate coverage on subsets of the reads, and
>>        aggregate
>>               the results.
>>               The BED file can be loaded with simple low-level calls like
>>               read.table and
>>               scan.
>>
>>               Michael
>>
>>
>>
>>                                 Thank you,
>>
>>                   Ivan
>>
>>
>>                   Ivan Gregoretti, PhD
>>                   National Institute of Diabetes and Digestive and Kidney
>>                   Diseases
>>                   National Institutes of Health
>>                   5 Memorial Dr, Building 5, Room 205.
>>                   Bethesda, MD 20892. USA.
>>                   Phone: 1-301-496-1592
>>                   Fax: 1-301-496-9878
>>
>>
>>
>>                   On Thu, Sep 17, 2009 at 4:45 PM, Sean Davis
>>                   <[email protected] <mailto:[email protected]>
>>        <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>                                           On Thu, Sep 17, 2009 at 4:43 PM,
>> Michael Lawrence
>>                       <[email protected] <mailto:[email protected]>
>>        <mailto:[email protected] <mailto:[email protected]>>>
>>
>>                       wrote:
>>                                                     On Thu, Sep 17, 2009
>> at 8:14 AM, Ivan
>>        Gregoretti
>>                           <[email protected]
>>        <mailto:[email protected]> <mailto:[email protected]
>>
>>        <mailto:[email protected]>>>
>>
>>                           wrote:
>>
>>                                                               Hello
>> everybody
>>
>>                               How do you convert BED formatted files
>>        to WIG
>>                               files?
>>
>>
>>                               I tried to do that with rtracklayer but it
>>                               didn't quite succeed.
>>
>>                               This is the session's transcript:
>>
>>                               First, you can download
>>
>> http://dl.getdropbox.com/u/2051155/myRegions.bed,
>>                               which looks like
>>                               this
>>
>>                               chr1    3002444 3002479                 +
>>                               chr1    3002989 3003024                 -
>>                               chr1    3017603 3017638                 +
>>                               chr1    3017879 3017914                 -
>>                               chr1    3018173 3018208                 +
>>                               chr1    3018183 3018218                 -
>>                               chr1    3018183 3018218                 -
>>                               chr1    3019065 3019100                 +
>>                               chr1    3019761 3019796                 -
>>                               chr1    3020044 3020079                 -
>>                               ...
>>
>>                               Now to R
>>
>>                               suppressMessages(library(rtracklayer))
>>                               myRegions <- import('myRegions.bed')
>>
>>                               So far so good. Now I try:
>>                               export(myRegions, 'myRegions.wig',
>>        format = 'wig')
>>
>>                               but I get:
>>                               Error in export.ucsc(object, con,
>>        subformat,
>>                               ...) :
>>                                Track not compatible with WIG format:
>>                               Overlapping features must be
>>                               on separate strands and every feature width
>>                               must be positive
>>
>>                               I seems that the error message is a feature
>>                               rather than a bug. My
>>                               interpretation is that export() does
>>        not like
>>                               records like lines 6 and
>>                               7:
>>                               chr1    3018183 3018218                 -
>>                               chr1    3018183 3018218                 -
>>
>>                               So, how do you convert BED to WIG in your
>>                               everyday work?
>>
>>
>>                                                                 Well, as
>> rtracklayer said, this track is not
>>                           representable by WIG, which
>>                           is
>>                           meant to communicate a single data value for a
>>                           given genomic region (to
>>                           generate the bar plots in the UCSC browser, for
>>                           instance). There aren't
>>                           any
>>                           scores in your file, so why do you want to
>>        use WIG?
>>
>>                                                       Just guessing, here,
>> but you may want to calculate
>>                       coverage() first,
>>                                             Ivan?
>>                                           Sean
>>
>>                                                         Thank you,
>>
>>                               Ivan
>>
>>
>> sessionInfo()
>>
>> R version 2.9.2 (2009-08-24)
>>                               x86_64-redhat-linux-gnu
>>
>>                               locale:
>>
>>
>>
>>
>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>>                                                   attached base packages:
>>                               [1] stats     graphics  grDevices utils
>>                                   datasets  methods   base
>>
>>                               other attached packages:
>>                               [1] rtracklayer_1.4.0 RCurl_0.94-1
>>
>>                               loaded via a namespace (and not attached):
>>                               [1] Biobase_2.4.0     Biostrings_2.12.0
>>                               BSgenome_1.12.0
>>                                                         IRanges_1.2.0
>>                                                   [5] tools_2.9.2
>> XML_2.3-0
>>
>>                               Ivan Gregoretti, PhD
>>                               National Institute of Diabetes and
>>        Digestive
>>                               and Kidney Diseases
>>                               National Institutes of Health
>>                               5 Memorial Dr, Building 5, Room 205.
>>                               Bethesda, MD 20892. USA.
>>                               Phone: 1-301-496-1592
>>                               Fax: 1-301-496-9878
>>
>>
>> _______________________________________________
>>                               Bioc-sig-sequencing mailing list
>>                               [email protected]
>>        <mailto:[email protected]>
>>                                     <mailto:
>> [email protected]
>>        <mailto:[email protected]>>
>>
>>
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>> [[alternative HTML version deleted]]
>>
>>                           _______________________________________________
>>                           Bioc-sig-sequencing mailing list
>>                           [email protected]
>>        <mailto:[email protected]>
>>                           <mailto:[email protected]
>>        <mailto:[email protected]>>
>>
>>
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>   _______________________________________________
>>                   Bioc-sig-sequencing mailing list
>>                   [email protected]
>>        <mailto:[email protected]>
>>                   <mailto:[email protected]
>>        <mailto:[email protected]>>
>>
>>
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>>                      [[alternative HTML version deleted]]
>>
>>               _______________________________________________
>>               Bioc-sig-sequencing mailing list
>>               [email protected]
>>        <mailto:[email protected]>
>>               <mailto:[email protected]
>>        <mailto:[email protected]>>
>>
>>               https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>>
>>
>>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] BED to WIG format conversion

Reply via email to