Michael,
We could add that level of flexibility, but I worry that may open the door too widely because the aggregator will need to take the old RangedData object and the new block RangedData object as arguments and produce a new RangedData object for the next iteration and users may find it difficult to construct a valid aggregator.

I think there may also be a need for filters to allow the user to keep only those rows that satisfy their criteria. We could supply chromosome, strand, score, etc. filters to meet this need.


Patrick


Michael Lawrence wrote:
I don't know if we want to add explicit support for scoring, when there are many reasons why one might want to incrementally load a track. One solution would be a design based on callbacks. The user specifies the block size and a callback to be invoked for each block. The result could be a list, with each element holding the result for a block. The user could then aggregate that. To be even more efficient, we could support a second callback that aggregates at each step.

On Thu, Sep 17, 2009 at 3:47 PM, Patrick Aboyoun <[email protected] <mailto:[email protected]>> wrote:

    Michael,
    Before people start developing their own bed file parsers, it
    looks pretty straight-forward to incorporate this operation in
    rtracklayer now. You or I could add a scoreRanges argument, which
    takes either a logical or perhaps a character string if we
    envision there being more than one way to score a set of ranges,
    to the import* functions and when a appropriate value is supplied,
    it will internally block process the data file and output a
    RangedData object with a score column.


    Patrick



    Michael Lawrence wrote:

        On Thu, Sep 17, 2009 at 2:17 PM, Ivan Gregoretti
        <[email protected] <mailto:[email protected]>> wrote:

            Hi Michael,

            True; there are no scores in this BED file. That brings us
            to Sean's
            remark.

            You may have noticed that all features in the BED file are
            36 bases
            long. Each feature is the position and orientation of a
            Solexa read.

            I want the value in each bin in the WIG file to represent
            the number
            of tags contained in that bin.

            So, coverage() is close to what I am looking for but I
            think that
            pileup() is perhaps better. Then comes the problem of how
            to place
            those values in an object exportable as WIG. Lets not talk
            about
            efficiency for now.


        Ignoring efficiency, it's pretty straight-forward. In the
        development
        version of IRanges, one can convert an Rle (or RleList for
        multiple
        chromosomes) as output to coverage() to a RangedData. In the
        development
        version of rtracklayer, this is done implicitly:

            export(cov, "coverage.wig")


            I think that I can solve this problem myself but given that:

            1) there is a package that can read BED and write WIG files
            and
            2) BED to WIG conversion is such a common task

            I thought that there must have been an existing tool or
            perhaps an
            established efficient way of doing this with BioC tools.
            Why is this
            important? Well, this myRegions.bed contains 4.3 million
            records.
            Loading them with import() used about 1GB of RAM. My real
            world BED
            files are never less than 50 million reads, 150 more like.

            Conclusion: I DO care about an expert's opinion of BED-> WIG
            conversion within R.


        Maybe nobody does this with R. That is also acceptable. What
        tool do
            you use then?


        The bottleneck seems to be loading and representing large BED
        files, since
        after the reduction to coverage the data size is usually
        manageable.
        rtracklayer was never designed to load BED files with millions
        of records.
        In my experience, short reads are more commonly represented in
        more
        efficient formats like those output by MAQ and bowtie, which
        are imported
        with ShortRead.

        There is much room for optimization in rtracklayer, but for
        now one idea is
        to calculate coverage on subsets of the reads, and aggregate
        the results.
        The BED file can be loaded with simple low-level calls like
        read.table and
        scan.

        Michael



            Thank you,

            Ivan


            Ivan Gregoretti, PhD
            National Institute of Diabetes and Digestive and Kidney
            Diseases
            National Institutes of Health
            5 Memorial Dr, Building 5, Room 205.
            Bethesda, MD 20892. USA.
            Phone: 1-301-496-1592
            Fax: 1-301-496-9878



            On Thu, Sep 17, 2009 at 4:45 PM, Sean Davis
            <[email protected] <mailto:[email protected]>> wrote:
                On Thu, Sep 17, 2009 at 4:43 PM, Michael Lawrence
                <[email protected] <mailto:[email protected]>>
                wrote:
                    On Thu, Sep 17, 2009 at 8:14 AM, Ivan Gregoretti
                    <[email protected] <mailto:[email protected]>>
                    wrote:

                        Hello everybody

                        How do you convert BED formatted files to WIG
                        files?


                        I tried to do that with rtracklayer but it
                        didn't quite succeed.

                        This is the session's transcript:

                        First, you can download
                        http://dl.getdropbox.com/u/2051155/myRegions.bed,
                        which looks like
                        this

                        chr1    3002444 3002479                 +
                        chr1    3002989 3003024                 -
                        chr1    3017603 3017638                 +
                        chr1    3017879 3017914                 -
                        chr1    3018173 3018208                 +
                        chr1    3018183 3018218                 -
                        chr1    3018183 3018218                 -
                        chr1    3019065 3019100                 +
                        chr1    3019761 3019796                 -
                        chr1    3020044 3020079                 -
                        ...

                        Now to R

                        suppressMessages(library(rtracklayer))
                        myRegions <- import('myRegions.bed')

                        So far so good. Now I try:
                        export(myRegions, 'myRegions.wig', format = 'wig')

                        but I get:
                        Error in export.ucsc(object, con, subformat,
                        ...) :
                         Track not compatible with WIG format:
                        Overlapping features must be
                        on separate strands and every feature width
                        must be positive

                        I seems that the error message is a feature
                        rather than a bug. My
                        interpretation is that export() does not like
                        records like lines 6 and
                        7:
                        chr1    3018183 3018218                 -
                        chr1    3018183 3018218                 -

                        So, how do you convert BED to WIG in your
                        everyday work?


                    Well, as rtracklayer said, this track is not
                    representable by WIG, which
                    is
                    meant to communicate a single data value for a
                    given genomic region (to
                    generate the bar plots in the UCSC browser, for
                    instance). There aren't
                    any
                    scores in your file, so why do you want to use WIG?

                Just guessing, here, but you may want to calculate
                coverage() first,
            Ivan?
                Sean

                        Thank you,

                        Ivan

                            sessionInfo()
                        R version 2.9.2 (2009-08-24)
                        x86_64-redhat-linux-gnu

                        locale:



            
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
                        attached base packages:
[1] stats graphics grDevices utils datasets methods base

                        other attached packages:
                        [1] rtracklayer_1.4.0 RCurl_0.94-1

                        loaded via a namespace (and not attached):
                        [1] Biobase_2.4.0     Biostrings_2.12.0
                        BSgenome_1.12.0
            IRanges_1.2.0
                        [5] tools_2.9.2       XML_2.3-0

                        Ivan Gregoretti, PhD
                        National Institute of Diabetes and Digestive
                        and Kidney Diseases
                        National Institutes of Health
                        5 Memorial Dr, Building 5, Room 205.
                        Bethesda, MD 20892. USA.
                        Phone: 1-301-496-1592
                        Fax: 1-301-496-9878

                        _______________________________________________
                        Bioc-sig-sequencing mailing list
                        [email protected]
                        <mailto:[email protected]>
                        
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

                          [[alternative HTML version deleted]]

                    _______________________________________________
                    Bioc-sig-sequencing mailing list
                    [email protected]
                    <mailto:[email protected]>
                    https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
            _______________________________________________
            Bioc-sig-sequencing mailing list
            [email protected]
            <mailto:[email protected]>
            https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


               [[alternative HTML version deleted]]

        _______________________________________________
        Bioc-sig-sequencing mailing list
        [email protected]
        <mailto:[email protected]>
        https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to