On Thu, Sep 17, 2009 at 4:35 PM, Patrick Aboyoun <[email protected]> wrote:
> Michael,
> We could add that level of flexibility, but I worry that may open the door
> too widely because the aggregator will need to take the old RangedData
> object and the new block RangedData object as arguments and produce a new
> RangedData object for the next iteration and users may find it difficult to
> construct a valid aggregator.
>
>
Well just to clarify, I thought the aggregator would take the old and new
objects as output by the block processor. So for coverage, one function
would take a RangedData and return a RleList, and the aggregator would take
two RleLists and add them.
The call for calculating the coverage would be like:
cov <- import("reads.bed", blockSize = 500000, blockFun = coverage,
aggregateFun = `+`)
Of course, high-level functionality would be nice for routine operations.
That will help prevent the user from making mistakes. The blockFun and
aggregateFun could be bound into a single class BlockHandler.
cov <- import("reads.bed", blockSize = 500000, blockHandler =
CoverageHandler())
> I think there may also be a need for filters to allow the user to keep only
> those rows that satisfy their criteria. We could supply chromosome, strand,
> score, etc. filters to meet this need.
>
>
Yes, and the Filter class could inherit from BlockHandler, so that it works
incrementally. import() could be smart enough to incorporate both a filter
and a user block handler into a single "meta" handler.
So to obtain an RleList of coverage, but only over chromosome 1, processing
half a million reads at a time:
cov <- import("reads.bed", blockSize = 500000, blockHandler =
CoverageHandler(), filter = ChromosomeFilter("chr1"))
Michael
> Patrick
>
>
> Michael Lawrence wrote:
>
>> I don't know if we want to add explicit support for scoring, when there
>> are many reasons why one might want to incrementally load a track. One
>> solution would be a design based on callbacks. The user specifies the block
>> size and a callback to be invoked for each block. The result could be a
>> list, with each element holding the result for a block. The user could then
>> aggregate that. To be even more efficient, we could support a second
>> callback that aggregates at each step.
>>
>> On Thu, Sep 17, 2009 at 3:47 PM, Patrick Aboyoun <[email protected]<mailto:
>> [email protected]>> wrote:
>>
>> Michael,
>> Before people start developing their own bed file parsers, it
>> looks pretty straight-forward to incorporate this operation in
>> rtracklayer now. You or I could add a scoreRanges argument, which
>> takes either a logical or perhaps a character string if we
>> envision there being more than one way to score a set of ranges,
>> to the import* functions and when a appropriate value is supplied,
>> it will internally block process the data file and output a
>> RangedData object with a score column.
>>
>>
>> Patrick
>>
>>
>>
>> Michael Lawrence wrote:
>>
>> On Thu, Sep 17, 2009 at 2:17 PM, Ivan Gregoretti
>> <[email protected] <mailto:[email protected]>> wrote:
>>
>>
>> Hi Michael,
>>
>> True; there are no scores in this BED file. That brings us
>> to Sean's
>> remark.
>>
>> You may have noticed that all features in the BED file are
>> 36 bases
>> long. Each feature is the position and orientation of a
>> Solexa read.
>>
>> I want the value in each bin in the WIG file to represent
>> the number
>> of tags contained in that bin.
>>
>> So, coverage() is close to what I am looking for but I
>> think that
>> pileup() is perhaps better. Then comes the problem of how
>> to place
>> those values in an object exportable as WIG. Lets not talk
>> about
>> efficiency for now.
>>
>>
>>
>> Ignoring efficiency, it's pretty straight-forward. In the
>> development
>> version of IRanges, one can convert an Rle (or RleList for
>> multiple
>> chromosomes) as output to coverage() to a RangedData. In the
>> development
>> version of rtracklayer, this is done implicitly:
>>
>>
>> export(cov, "coverage.wig")
>>
>>
>>
>>
>> I think that I can solve this problem myself but given that:
>>
>> 1) there is a package that can read BED and write WIG files
>> and
>> 2) BED to WIG conversion is such a common task
>>
>> I thought that there must have been an existing tool or
>> perhaps an
>> established efficient way of doing this with BioC tools.
>> Why is this
>> important? Well, this myRegions.bed contains 4.3 million
>> records.
>> Loading them with import() used about 1GB of RAM. My real
>> world BED
>> files are never less than 50 million reads, 150 more like.
>>
>> Conclusion: I DO care about an expert's opinion of BED-> WIG
>> conversion within R.
>>
>>
>>
>> Maybe nobody does this with R. That is also acceptable. What
>> tool do
>>
>> you use then?
>>
>>
>>
>> The bottleneck seems to be loading and representing large BED
>> files, since
>> after the reduction to coverage the data size is usually
>> manageable.
>> rtracklayer was never designed to load BED files with millions
>> of records.
>> In my experience, short reads are more commonly represented in
>> more
>> efficient formats like those output by MAQ and bowtie, which
>> are imported
>> with ShortRead.
>>
>> There is much room for optimization in rtracklayer, but for
>> now one idea is
>> to calculate coverage on subsets of the reads, and aggregate
>> the results.
>> The BED file can be loaded with simple low-level calls like
>> read.table and
>> scan.
>>
>> Michael
>>
>>
>>
>>
>> Thank you,
>>
>> Ivan
>>
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney
>> Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>
>>
>> On Thu, Sep 17, 2009 at 4:45 PM, Sean Davis
>> <[email protected] <mailto:[email protected]>> wrote:
>>
>> On Thu, Sep 17, 2009 at 4:43 PM, Michael Lawrence
>> <[email protected] <mailto:[email protected]>>
>> wrote:
>>
>> On Thu, Sep 17, 2009 at 8:14 AM, Ivan Gregoretti
>> <[email protected] <mailto:[email protected]>>
>>
>> wrote:
>>
>>
>> Hello everybody
>>
>> How do you convert BED formatted files to WIG
>> files?
>>
>>
>> I tried to do that with rtracklayer but it
>> didn't quite succeed.
>>
>> This is the session's transcript:
>>
>> First, you can download
>> http://dl.getdropbox.com/u/2051155/myRegions.bed,
>> which looks like
>> this
>>
>> chr1 3002444 3002479 +
>> chr1 3002989 3003024 -
>> chr1 3017603 3017638 +
>> chr1 3017879 3017914 -
>> chr1 3018173 3018208 +
>> chr1 3018183 3018218 -
>> chr1 3018183 3018218 -
>> chr1 3019065 3019100 +
>> chr1 3019761 3019796 -
>> chr1 3020044 3020079 -
>> ...
>>
>> Now to R
>>
>> suppressMessages(library(rtracklayer))
>> myRegions <- import('myRegions.bed')
>>
>> So far so good. Now I try:
>> export(myRegions, 'myRegions.wig', format = 'wig')
>>
>> but I get:
>> Error in export.ucsc(object, con, subformat,
>> ...) :
>> Track not compatible with WIG format:
>> Overlapping features must be
>> on separate strands and every feature width
>> must be positive
>>
>> I seems that the error message is a feature
>> rather than a bug. My
>> interpretation is that export() does not like
>> records like lines 6 and
>> 7:
>> chr1 3018183 3018218 -
>> chr1 3018183 3018218 -
>>
>> So, how do you convert BED to WIG in your
>> everyday work?
>>
>>
>>
>> Well, as rtracklayer said, this track is not
>> representable by WIG, which
>> is
>> meant to communicate a single data value for a
>> given genomic region (to
>> generate the bar plots in the UCSC browser, for
>> instance). There aren't
>> any
>> scores in your file, so why do you want to use WIG?
>>
>>
>> Just guessing, here, but you may want to calculate
>> coverage() first,
>>
>> Ivan?
>>
>> Sean
>>
>>
>> Thank you,
>>
>> Ivan
>>
>>
>> sessionInfo()
>>
>> R version 2.9.2 (2009-08-24)
>> x86_64-redhat-linux-gnu
>>
>> locale:
>>
>>
>>
>>
>>
>>
>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils
>> datasets methods base
>>
>> other attached packages:
>> [1] rtracklayer_1.4.0 RCurl_0.94-1
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.4.0 Biostrings_2.12.0
>> BSgenome_1.12.0
>>
>> IRanges_1.2.0
>>
>> [5] tools_2.9.2 XML_2.3-0
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive
>> and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> [email protected]
>> <mailto:[email protected]>
>>
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> [email protected]
>> <mailto:[email protected]>
>>
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> [email protected]
>> <mailto:[email protected]>
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> [email protected]
>> <mailto:[email protected]>
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>>
>>
>
[[alternative HTML version deleted]]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing