On Thu, Sep 17, 2009 at 4:35 PM, Patrick Aboyoun <[email protected]
<mailto:[email protected]>> wrote:
Michael,
We could add that level of flexibility, but I worry that may open
the door too widely because the aggregator will need to take the
old RangedData object and the new block RangedData object as
arguments and produce a new RangedData object for the next
iteration and users may find it difficult to construct a valid
aggregator.
Well just to clarify, I thought the aggregator would take the old and
new objects as output by the block processor. So for coverage, one
function would take a RangedData and return a RleList, and the
aggregator would take two RleLists and add them.
The call for calculating the coverage would be like:
cov <- import("reads.bed", blockSize = 500000, blockFun = coverage,
aggregateFun = `+`)
Of course, high-level functionality would be nice for routine
operations. That will help prevent the user from making mistakes. The
blockFun and aggregateFun could be bound into a single class
BlockHandler.
cov <- import("reads.bed", blockSize = 500000, blockHandler =
CoverageHandler())
I think there may also be a need for filters to allow the user to
keep only those rows that satisfy their criteria. We could supply
chromosome, strand, score, etc. filters to meet this need.
Yes, and the Filter class could inherit from BlockHandler, so that it
works incrementally. import() could be smart enough to incorporate
both a filter and a user block handler into a single "meta" handler.
So to obtain an RleList of coverage, but only over chromosome 1,
processing half a million reads at a time:
cov <- import("reads.bed", blockSize = 500000, blockHandler =
CoverageHandler(), filter = ChromosomeFilter("chr1"))
Michael
Patrick
Michael Lawrence wrote:
I don't know if we want to add explicit support for scoring,
when there are many reasons why one might want to
incrementally load a track. One solution would be a design
based on callbacks. The user specifies the block size and a
callback to be invoked for each block. The result could be a
list, with each element holding the result for a block. The
user could then aggregate that. To be even more efficient, we
could support a second callback that aggregates at each step.
On Thu, Sep 17, 2009 at 3:47 PM, Patrick Aboyoun
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>> wrote:
Michael,
Before people start developing their own bed file parsers, it
looks pretty straight-forward to incorporate this operation in
rtracklayer now. You or I could add a scoreRanges argument,
which
takes either a logical or perhaps a character string if we
envision there being more than one way to score a set of
ranges,
to the import* functions and when a appropriate value is
supplied,
it will internally block process the data file and output a
RangedData object with a score column.
Patrick
Michael Lawrence wrote:
On Thu, Sep 17, 2009 at 2:17 PM, Ivan Gregoretti
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>> wrote:
Hi Michael,
True; there are no scores in this BED file. That
brings us
to Sean's
remark.
You may have noticed that all features in the BED
file are
36 bases
long. Each feature is the position and orientation of a
Solexa read.
I want the value in each bin in the WIG file to
represent
the number
of tags contained in that bin.
So, coverage() is close to what I am looking for but I
think that
pileup() is perhaps better. Then comes the problem
of how
to place
those values in an object exportable as WIG. Lets
not talk
about
efficiency for now.
Ignoring efficiency, it's pretty straight-forward. In the
development
version of IRanges, one can convert an Rle (or RleList for
multiple
chromosomes) as output to coverage() to a RangedData.
In the
development
version of rtracklayer, this is done implicitly:
export(cov, "coverage.wig")
I think that I can solve this problem myself but
given that:
1) there is a package that can read BED and write
WIG files
and
2) BED to WIG conversion is such a common task
I thought that there must have been an existing tool or
perhaps an
established efficient way of doing this with BioC
tools.
Why is this
important? Well, this myRegions.bed contains 4.3
million
records.
Loading them with import() used about 1GB of RAM.
My real
world BED
files are never less than 50 million reads, 150
more like.
Conclusion: I DO care about an expert's opinion of
BED-> WIG
conversion within R.
Maybe nobody does this with R. That is also acceptable.
What
tool do
you use then?
The bottleneck seems to be loading and representing
large BED
files, since
after the reduction to coverage the data size is usually
manageable.
rtracklayer was never designed to load BED files with
millions
of records.
In my experience, short reads are more commonly
represented in
more
efficient formats like those output by MAQ and bowtie,
which
are imported
with ShortRead.
There is much room for optimization in rtracklayer, but for
now one idea is
to calculate coverage on subsets of the reads, and
aggregate
the results.
The BED file can be loaded with simple low-level calls like
read.table and
scan.
Michael
Thank you,
Ivan
Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney
Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878
On Thu, Sep 17, 2009 at 4:45 PM, Sean Davis
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>> wrote:
On Thu, Sep 17, 2009 at 4:43 PM, Michael Lawrence
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>
wrote:
On Thu, Sep 17, 2009 at 8:14 AM, Ivan
Gregoretti
<[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>>
wrote:
Hello everybody
How do you convert BED formatted files
to WIG
files?
I tried to do that with rtracklayer but it
didn't quite succeed.
This is the session's transcript:
First, you can download
http://dl.getdropbox.com/u/2051155/myRegions.bed,
which looks like
this
chr1 3002444 3002479 +
chr1 3002989 3003024 -
chr1 3017603 3017638 +
chr1 3017879 3017914 -
chr1 3018173 3018208 +
chr1 3018183 3018218 -
chr1 3018183 3018218 -
chr1 3019065 3019100 +
chr1 3019761 3019796 -
chr1 3020044 3020079 -
...
Now to R
suppressMessages(library(rtracklayer))
myRegions <- import('myRegions.bed')
So far so good. Now I try:
export(myRegions, 'myRegions.wig',
format = 'wig')
but I get:
Error in export.ucsc(object, con,
subformat,
...) :
Track not compatible with WIG format:
Overlapping features must be
on separate strands and every feature width
must be positive
I seems that the error message is a feature
rather than a bug. My
interpretation is that export() does
not like
records like lines 6 and
7:
chr1 3018183 3018218 -
chr1 3018183 3018218 -
So, how do you convert BED to WIG in your
everyday work?
Well, as rtracklayer said, this track is not
representable by WIG, which
is
meant to communicate a single data value for a
given genomic region (to
generate the bar plots in the UCSC browser, for
instance). There aren't
any
scores in your file, so why do you want to
use WIG?
Just guessing, here, but you may want to calculate
coverage() first,
Ivan?
Sean
Thank you,
Ivan
sessionInfo()
R version 2.9.2 (2009-08-24)
x86_64-redhat-linux-gnu
locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils
datasets methods base
other attached packages:
[1] rtracklayer_1.4.0 RCurl_0.94-1
loaded via a namespace (and not attached):
[1] Biobase_2.4.0 Biostrings_2.12.0
BSgenome_1.12.0
IRanges_1.2.0
[5] tools_2.9.2 XML_2.3-0
Ivan Gregoretti, PhD
National Institute of Diabetes and
Digestive
and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
[[alternative HTML version deleted]]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
[[alternative HTML version deleted]]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing