Michael,
This import operation has similarities with the rdapply framework. We should have the two types of operations sharing similar components such as filtering.

To everyone,
What other data reduction operations would you like to have on bed file import?


Patrick


Michael Lawrence wrote:


On Thu, Sep 17, 2009 at 4:35 PM, Patrick Aboyoun <[email protected] <mailto:[email protected]>> wrote:

    Michael,
    We could add that level of flexibility, but I worry that may open
    the door too widely because the aggregator will need to take the
    old RangedData object and the new block RangedData object as
    arguments and produce a new RangedData object for the next
    iteration and users may find it difficult to construct a valid
    aggregator.


Well just to clarify, I thought the aggregator would take the old and new objects as output by the block processor. So for coverage, one function would take a RangedData and return a RleList, and the aggregator would take two RleLists and add them.

The call for calculating the coverage would be like:

cov <- import("reads.bed", blockSize = 500000, blockFun = coverage, aggregateFun = `+`)

Of course, high-level functionality would be nice for routine operations. That will help prevent the user from making mistakes. The blockFun and aggregateFun could be bound into a single class BlockHandler.

cov <- import("reads.bed", blockSize = 500000, blockHandler = CoverageHandler())
    I think there may also be a need for filters to allow the user to
    keep only those rows that satisfy their criteria. We could supply
    chromosome, strand, score, etc. filters to meet this need.


Yes, and the Filter class could inherit from BlockHandler, so that it works incrementally. import() could be smart enough to incorporate both a filter and a user block handler into a single "meta" handler.

So to obtain an RleList of coverage, but only over chromosome 1, processing half a million reads at a time:

cov <- import("reads.bed", blockSize = 500000, blockHandler = CoverageHandler(), filter = ChromosomeFilter("chr1"))

Michael


    Patrick


    Michael Lawrence wrote:

        I don't know if we want to add explicit support for scoring,
        when there are many reasons why one might want to
        incrementally load a track. One solution would be a design
        based on callbacks. The user specifies the block size and a
        callback to be invoked for each block. The result could be a
        list, with each element holding the result for a block. The
        user could then aggregate that. To be even more efficient, we
        could support a second callback that aggregates at each step.

        On Thu, Sep 17, 2009 at 3:47 PM, Patrick Aboyoun
        <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>> wrote:

           Michael,
           Before people start developing their own bed file parsers, it
           looks pretty straight-forward to incorporate this operation in
           rtracklayer now. You or I could add a scoreRanges argument,
        which
           takes either a logical or perhaps a character string if we
           envision there being more than one way to score a set of
        ranges,
           to the import* functions and when a appropriate value is
        supplied,
           it will internally block process the data file and output a
           RangedData object with a score column.


           Patrick



           Michael Lawrence wrote:

               On Thu, Sep 17, 2009 at 2:17 PM, Ivan Gregoretti
               <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>> wrote:

Hi Michael,

                   True; there are no scores in this BED file. That
        brings us
                   to Sean's
                   remark.

                   You may have noticed that all features in the BED
        file are
                   36 bases
                   long. Each feature is the position and orientation of a
                   Solexa read.

                   I want the value in each bin in the WIG file to
        represent
                   the number
                   of tags contained in that bin.

                   So, coverage() is close to what I am looking for but I
                   think that
                   pileup() is perhaps better. Then comes the problem
        of how
                   to place
                   those values in an object exportable as WIG. Lets
        not talk
                   about
                   efficiency for now.


Ignoring efficiency, it's pretty straight-forward. In the
               development
               version of IRanges, one can convert an Rle (or RleList for
               multiple
               chromosomes) as output to coverage() to a RangedData.
        In the
               development
               version of rtracklayer, this is done implicitly:

export(cov, "coverage.wig")

I think that I can solve this problem myself but
        given that:

                   1) there is a package that can read BED and write
        WIG files
                   and
                   2) BED to WIG conversion is such a common task

                   I thought that there must have been an existing tool or
                   perhaps an
                   established efficient way of doing this with BioC
        tools.
                   Why is this
                   important? Well, this myRegions.bed contains 4.3
        million
                   records.
                   Loading them with import() used about 1GB of RAM.
        My real
                   world BED
                   files are never less than 50 million reads, 150
        more like.

                   Conclusion: I DO care about an expert's opinion of
        BED-> WIG
                   conversion within R.

               Maybe nobody does this with R. That is also acceptable.
        What
               tool do
you use then?


The bottleneck seems to be loading and representing
        large BED
               files, since
               after the reduction to coverage the data size is usually
               manageable.
               rtracklayer was never designed to load BED files with
        millions
               of records.
               In my experience, short reads are more commonly
        represented in
               more
               efficient formats like those output by MAQ and bowtie,
        which
               are imported
               with ShortRead.

               There is much room for optimization in rtracklayer, but for
               now one idea is
               to calculate coverage on subsets of the reads, and
        aggregate
               the results.
               The BED file can be loaded with simple low-level calls like
               read.table and
               scan.

               Michael



Thank you,

                   Ivan


                   Ivan Gregoretti, PhD
                   National Institute of Diabetes and Digestive and Kidney
                   Diseases
                   National Institutes of Health
                   5 Memorial Dr, Building 5, Room 205.
                   Bethesda, MD 20892. USA.
                   Phone: 1-301-496-1592
                   Fax: 1-301-496-9878



                   On Thu, Sep 17, 2009 at 4:45 PM, Sean Davis
                   <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>> wrote:
On Thu, Sep 17, 2009 at 4:43 PM, Michael Lawrence
                       <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>

                       wrote:
On Thu, Sep 17, 2009 at 8:14 AM, Ivan
        Gregoretti
                           <[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>

                           wrote:

Hello everybody

                               How do you convert BED formatted files
        to WIG
                               files?


                               I tried to do that with rtracklayer but it
                               didn't quite succeed.

                               This is the session's transcript:

                               First, you can download
http://dl.getdropbox.com/u/2051155/myRegions.bed,
                               which looks like
                               this

                               chr1    3002444 3002479                 +
                               chr1    3002989 3003024                 -
                               chr1    3017603 3017638                 +
                               chr1    3017879 3017914                 -
                               chr1    3018173 3018208                 +
                               chr1    3018183 3018218                 -
                               chr1    3018183 3018218                 -
                               chr1    3019065 3019100                 +
                               chr1    3019761 3019796                 -
                               chr1    3020044 3020079                 -
                               ...

                               Now to R

                               suppressMessages(library(rtracklayer))
                               myRegions <- import('myRegions.bed')

                               So far so good. Now I try:
                               export(myRegions, 'myRegions.wig',
        format = 'wig')

                               but I get:
                               Error in export.ucsc(object, con,
        subformat,
                               ...) :
                                Track not compatible with WIG format:
                               Overlapping features must be
                               on separate strands and every feature width
                               must be positive

                               I seems that the error message is a feature
                               rather than a bug. My
                               interpretation is that export() does
        not like
                               records like lines 6 and
                               7:
                               chr1    3018183 3018218                 -
                               chr1    3018183 3018218                 -

                               So, how do you convert BED to WIG in your
                               everyday work?


Well, as rtracklayer said, this track is not
                           representable by WIG, which
                           is
                           meant to communicate a single data value for a
                           given genomic region (to
                           generate the bar plots in the UCSC browser, for
                           instance). There aren't
                           any
                           scores in your file, so why do you want to
        use WIG?

Just guessing, here, but you may want to calculate
                       coverage() first,
Ivan? Sean

Thank you,

                               Ivan

sessionInfo() R version 2.9.2 (2009-08-24)
                               x86_64-redhat-linux-gnu

                               locale:



LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C attached base packages:
                               [1] stats     graphics  grDevices utils
                                   datasets  methods   base

                               other attached packages:
                               [1] rtracklayer_1.4.0 RCurl_0.94-1

                               loaded via a namespace (and not attached):
                               [1] Biobase_2.4.0     Biostrings_2.12.0
                               BSgenome_1.12.0
IRanges_1.2.0 [5] tools_2.9.2 XML_2.3-0

                               Ivan Gregoretti, PhD
                               National Institute of Diabetes and
        Digestive
                               and Kidney Diseases
                               National Institutes of Health
                               5 Memorial Dr, Building 5, Room 205.
                               Bethesda, MD 20892. USA.
                               Phone: 1-301-496-1592
                               Fax: 1-301-496-9878

_______________________________________________
                               Bioc-sig-sequencing mailing list
                               [email protected]
        <mailto:[email protected]>
<mailto:[email protected]
        <mailto:[email protected]>>

https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

[[alternative HTML version deleted]]

                           _______________________________________________
                           Bioc-sig-sequencing mailing list
                           [email protected]
        <mailto:[email protected]>
                           <mailto:[email protected]
        <mailto:[email protected]>>

https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing _______________________________________________
                   Bioc-sig-sequencing mailing list
                   [email protected]
        <mailto:[email protected]>
                   <mailto:[email protected]
        <mailto:[email protected]>>

https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

                      [[alternative HTML version deleted]]

               _______________________________________________
               Bioc-sig-sequencing mailing list
               [email protected]
        <mailto:[email protected]>
               <mailto:[email protected]
        <mailto:[email protected]>>

               https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing




_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to