Re: [Bioc-sig-seq] BED to WIG format conversion

Michael Lawrence Thu, 17 Sep 2009 16:51:26 -0700

On Thu, Sep 17, 2009 at 4:35 PM, Patrick Aboyoun <[email protected]> wrote:


> Michael,
> We could add that level of flexibility, but I worry that may open the door
> too widely because the aggregator will need to take the old RangedData
> object and the new block RangedData object as arguments and produce a new
> RangedData object for the next iteration and users may find it difficult to
> construct a valid aggregator.
>
>
Well just to clarify, I thought the aggregator would take the old and new
objects as output by the block processor. So for coverage, one function
would take a RangedData and return a RleList, and the aggregator would take
two RleLists and add them.

The call for calculating the coverage would be like:

cov <- import("reads.bed", blockSize = 500000, blockFun = coverage,
aggregateFun = `+`)

Of course, high-level functionality would be nice for routine operations.
That will help prevent the user from making mistakes. The blockFun and
aggregateFun could be bound into a single class BlockHandler.

cov <- import("reads.bed", blockSize = 500000, blockHandler =
CoverageHandler())


> I think there may also be a need for filters to allow the user to keep only
> those rows that satisfy their criteria. We could supply chromosome, strand,
> score, etc. filters to meet this need.
>
>
Yes, and the Filter class could inherit from BlockHandler, so that it works
incrementally. import() could be smart enough to incorporate both a filter
and a user block handler into a single "meta" handler.

So to obtain an RleList of coverage, but only over chromosome 1, processing
half a million reads at a time:

cov <- import("reads.bed", blockSize = 500000, blockHandler =
CoverageHandler(), filter = ChromosomeFilter("chr1"))

Michael


> Patrick
>
>
> Michael Lawrence wrote:
>
>> I don't know if we want to add explicit support for scoring, when there
>> are many reasons why one might want to incrementally load a track. One
>> solution would be a design based on callbacks. The user specifies the block
>> size and a callback to be invoked for each block. The result could be a
>> list, with each element holding the result for a block. The user could then
>> aggregate that. To be even more efficient, we could support a second
>> callback that aggregates at each step.
>>
>> On Thu, Sep 17, 2009 at 3:47 PM, Patrick Aboyoun <[email protected]<mailto:
>> [email protected]>> wrote:
>>
>>    Michael,
>>    Before people start developing their own bed file parsers, it
>>    looks pretty straight-forward to incorporate this operation in
>>    rtracklayer now. You or I could add a scoreRanges argument, which
>>    takes either a logical or perhaps a character string if we
>>    envision there being more than one way to score a set of ranges,
>>    to the import* functions and when a appropriate value is supplied,
>>    it will internally block process the data file and output a
>>    RangedData object with a score column.
>>
>>
>>    Patrick
>>
>>
>>
>>    Michael Lawrence wrote:
>>
>>        On Thu, Sep 17, 2009 at 2:17 PM, Ivan Gregoretti
>>        <[email protected] <mailto:[email protected]>> wrote:
>>
>>
>>            Hi Michael,
>>
>>            True; there are no scores in this BED file. That brings us
>>            to Sean's
>>            remark.
>>
>>            You may have noticed that all features in the BED file are
>>            36 bases
>>            long. Each feature is the position and orientation of a
>>            Solexa read.
>>
>>            I want the value in each bin in the WIG file to represent
>>            the number
>>            of tags contained in that bin.
>>
>>            So, coverage() is close to what I am looking for but I
>>            think that
>>            pileup() is perhaps better. Then comes the problem of how
>>            to place
>>            those values in an object exportable as WIG. Lets not talk
>>            about
>>            efficiency for now.
>>
>>
>>
>>        Ignoring efficiency, it's pretty straight-forward. In the
>>        development
>>        version of IRanges, one can convert an Rle (or RleList for
>>        multiple
>>        chromosomes) as output to coverage() to a RangedData. In the
>>        development
>>        version of rtracklayer, this is done implicitly:
>>
>>
>>            export(cov, "coverage.wig")
>>
>>
>>
>>
>>            I think that I can solve this problem myself but given that:
>>
>>            1) there is a package that can read BED and write WIG files
>>            and
>>            2) BED to WIG conversion is such a common task
>>
>>            I thought that there must have been an existing tool or
>>            perhaps an
>>            established efficient way of doing this with BioC tools.
>>            Why is this
>>            important? Well, this myRegions.bed contains 4.3 million
>>            records.
>>            Loading them with import() used about 1GB of RAM. My real
>>            world BED
>>            files are never less than 50 million reads, 150 more like.
>>
>>            Conclusion: I DO care about an expert's opinion of BED-> WIG
>>            conversion within R.
>>
>>
>>
>>        Maybe nobody does this with R. That is also acceptable. What
>>        tool do
>>
>>            you use then?
>>
>>
>>
>>        The bottleneck seems to be loading and representing large BED
>>        files, since
>>        after the reduction to coverage the data size is usually
>>        manageable.
>>        rtracklayer was never designed to load BED files with millions
>>        of records.
>>        In my experience, short reads are more commonly represented in
>>        more
>>        efficient formats like those output by MAQ and bowtie, which
>>        are imported
>>        with ShortRead.
>>
>>        There is much room for optimization in rtracklayer, but for
>>        now one idea is
>>        to calculate coverage on subsets of the reads, and aggregate
>>        the results.
>>        The BED file can be loaded with simple low-level calls like
>>        read.table and
>>        scan.
>>
>>        Michael
>>
>>
>>
>>
>>            Thank you,
>>
>>            Ivan
>>
>>
>>            Ivan Gregoretti, PhD
>>            National Institute of Diabetes and Digestive and Kidney
>>            Diseases
>>            National Institutes of Health
>>            5 Memorial Dr, Building 5, Room 205.
>>            Bethesda, MD 20892. USA.
>>            Phone: 1-301-496-1592
>>            Fax: 1-301-496-9878
>>
>>
>>
>>            On Thu, Sep 17, 2009 at 4:45 PM, Sean Davis
>>            <[email protected] <mailto:[email protected]>> wrote:
>>
>>                On Thu, Sep 17, 2009 at 4:43 PM, Michael Lawrence
>>                <[email protected] <mailto:[email protected]>>
>>                wrote:
>>
>>                    On Thu, Sep 17, 2009 at 8:14 AM, Ivan Gregoretti
>>                    <[email protected] <mailto:[email protected]>>
>>
>>                    wrote:
>>
>>
>>                        Hello everybody
>>
>>                        How do you convert BED formatted files to WIG
>>                        files?
>>
>>
>>                        I tried to do that with rtracklayer but it
>>                        didn't quite succeed.
>>
>>                        This is the session's transcript:
>>
>>                        First, you can download
>>                        http://dl.getdropbox.com/u/2051155/myRegions.bed,
>>                        which looks like
>>                        this
>>
>>                        chr1    3002444 3002479                 +
>>                        chr1    3002989 3003024                 -
>>                        chr1    3017603 3017638                 +
>>                        chr1    3017879 3017914                 -
>>                        chr1    3018173 3018208                 +
>>                        chr1    3018183 3018218                 -
>>                        chr1    3018183 3018218                 -
>>                        chr1    3019065 3019100                 +
>>                        chr1    3019761 3019796                 -
>>                        chr1    3020044 3020079                 -
>>                        ...
>>
>>                        Now to R
>>
>>                        suppressMessages(library(rtracklayer))
>>                        myRegions <- import('myRegions.bed')
>>
>>                        So far so good. Now I try:
>>                        export(myRegions, 'myRegions.wig', format = 'wig')
>>
>>                        but I get:
>>                        Error in export.ucsc(object, con, subformat,
>>                        ...) :
>>                         Track not compatible with WIG format:
>>                        Overlapping features must be
>>                        on separate strands and every feature width
>>                        must be positive
>>
>>                        I seems that the error message is a feature
>>                        rather than a bug. My
>>                        interpretation is that export() does not like
>>                        records like lines 6 and
>>                        7:
>>                        chr1    3018183 3018218                 -
>>                        chr1    3018183 3018218                 -
>>
>>                        So, how do you convert BED to WIG in your
>>                        everyday work?
>>
>>
>>
>>                    Well, as rtracklayer said, this track is not
>>                    representable by WIG, which
>>                    is
>>                    meant to communicate a single data value for a
>>                    given genomic region (to
>>                    generate the bar plots in the UCSC browser, for
>>                    instance). There aren't
>>                    any
>>                    scores in your file, so why do you want to use WIG?
>>
>>
>>                Just guessing, here, but you may want to calculate
>>                coverage() first,
>>
>>            Ivan?
>>
>>                Sean
>>
>>
>>                        Thank you,
>>
>>                        Ivan
>>
>>
>>                            sessionInfo()
>>
>>                        R version 2.9.2 (2009-08-24)
>>                        x86_64-redhat-linux-gnu
>>
>>                        locale:
>>
>>
>>
>>
>>
>>  
>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>>
>>                        attached base packages:
>>                        [1] stats     graphics  grDevices utils
>>                datasets  methods   base
>>
>>                        other attached packages:
>>                        [1] rtracklayer_1.4.0 RCurl_0.94-1
>>
>>                        loaded via a namespace (and not attached):
>>                        [1] Biobase_2.4.0     Biostrings_2.12.0
>>                        BSgenome_1.12.0
>>
>>            IRanges_1.2.0
>>
>>                        [5] tools_2.9.2       XML_2.3-0
>>
>>                        Ivan Gregoretti, PhD
>>                        National Institute of Diabetes and Digestive
>>                        and Kidney Diseases
>>                        National Institutes of Health
>>                        5 Memorial Dr, Building 5, Room 205.
>>                        Bethesda, MD 20892. USA.
>>                        Phone: 1-301-496-1592
>>                        Fax: 1-301-496-9878
>>
>>                        _______________________________________________
>>                        Bioc-sig-sequencing mailing list
>>                        [email protected]
>>                        <mailto:[email protected]>
>>
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>>                          [[alternative HTML version deleted]]
>>
>>                    _______________________________________________
>>                    Bioc-sig-sequencing mailing list
>>                    [email protected]
>>                    <mailto:[email protected]>
>>
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>>            _______________________________________________
>>            Bioc-sig-sequencing mailing list
>>            [email protected]
>>            <mailto:[email protected]>
>>            https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>>
>>               [[alternative HTML version deleted]]
>>
>>        _______________________________________________
>>        Bioc-sig-sequencing mailing list
>>        [email protected]
>>        <mailto:[email protected]>
>>        https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>>
>>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] BED to WIG format conversion

Reply via email to