I just checked in a patch to rtracklayer version 1.9.6 into BioC 2.7
that added asRangedData arguments to all of the import methods except
import.bw as well as to the GenomicData function. When asRangedData =
FALSE, these import methods will produce GRanges objects. I didn't
optimize any of the code yet, so there may be another rev tomorrow with
some tweaks to the import code at the R level. Here is are some examples:
> library(rtracklayer)
> bedfilepath <- system.file("tests", "test.bed", package = "rtracklayer")
> import(bedfilepath, asRangedData = FALSE)
GRanges with 9 ranges and 5 elementMetadata values
seqnames ranges strand | name score
thickStart
<Rle> <IRanges> <Rle> | <character> <numeric> <integer>
[1] chr7 [127471197, 127472363] + | Pos1 0
127471196
[2] chr7 [127472364, 127473530] + | Pos2 0
127472363
[3] chr7 [127473531, 127474697] + | Pos3 0
127473530
[4] chr7 [127474698, 127475864] + | Pos4 0
127474697
[5] chr7 [127475865, 127477031] - | Neg1 0
127475864
[6] chr7 [127477032, 127478198] - | Neg2 0
127477031
[7] chr7 [127478199, 127479365] - | Neg3 0
127478198
[8] chr7 [127479366, 127480532] + | Pos5 0
127479365
[9] chr7 [127480533, 127481699] - | Neg4 0
127480532
thickEnd itemRgb
<integer> <character>
[1] 127472363 #FF0000
[2] 127473530 #FF0000
[3] 127474697 #FF0000
[4] 127475864 #FF0000
[5] 127477031 #0000FF
[6] 127478198 #0000FF
[7] 127479365 #0000FF
[8] 127480532 #FF0000
[9] 127481699 #0000FF
seqlengths
chr7
NA
> bed15filepath <- system.file("tests", "test.bed15", package =
"rtracklayer")
> import(bed15filepath, asRangedData = FALSE)
GRanges with 2 ranges and 13 elementMetadata values
seqnames ranges strand | name score
thickStart
<Rle> <IRanges> <Rle> | <character> <numeric> <integer>
[1] chr1 [159639973, 159640031] - | 2440848 500
159639972
[2] chr1 [159640162, 159640190] - | 2440849 500
159640161
thickEnd itemRgb blockCount blockSizes blockStarts breast_A
<integer> <character> <integer> <character> <character> <numeric>
[1] 159640031 NA 1 59, 0, 0.593
[2] 159640190 NA 1 29, 0, -0.906
breast_B breast_C cerebellum_A cerebellum_B
<numeric> <numeric> <numeric> <numeric>
[1] 1.196 -0.190 -1.088 0.093
[2] -1.247 0.111 -0.515 -0.057
seqlengths
chr1
NA
> sessionInfo()
R version 2.12.0 Under development (unstable) (2010-08-01 r52659)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rtracklayer_1.9.6 RCurl_1.4-3 bitops_1.0-4.1
loaded via a namespace (and not attached):
[1] Biobase_2.9.0 Biostrings_2.17.27 BSgenome_1.17.6
[4] GenomicRanges_1.1.20 IRanges_1.7.15 tools_2.12.0
[7] XML_3.1-0
Patrick
On 8/5/10 1:11 PM, Michael Lawrence wrote:
>
>
> On Thu, Aug 5, 2010 at 10:45 AM, Patrick Aboyoun <[email protected]
> <mailto:[email protected]>> wrote:
>
> Michael,
> I just made a minor check-in to rtracklayer where I replaced use
> of Biobase:listLen with IRanges::elementLenghts in an effort to
> minimize the impact of Biobase on the sequence package stack.
>
>
> Ok. It looks like elementLengths has been optimized since the last
> time I looked.
>
>
> Before I start the boulder rolling, how should I reconcile the
> UCSCData class with the GRanges class? Once I have that sorted I
> can make changes to import.bed and import.wig as well.
>
>
> Well, eventually we'll want to stick the track line information on to
> GRanges. Could be done via a subclass like with UCSCData. metadata()
> is another option. I do actually use the subclass for dispatch
> purposes, pretty printing, etc. For right now though, the extra
> information could just be dropped if the user requests a GRanges.
>
> I originally named the argument asRangedData in the BSgenome
> methods to reinforce that RangedData output is not intended to be
> the default and conceptually the user is making an extra effort to
> produce a RangedData object.
>
>
> Patrick
>
>
>
> On 8/5/10 4:32 AM, Michael Lawrence wrote:
>> Makes sense. But why not make it asGRanges, which is shorter?
>> Please go ahead and check in your work so far.
>>
>> Thanks a lot,
>> Michael
>>
>> On Thu, Aug 5, 2010 at 12:51 AM, Patrick Aboyoun
>> <[email protected] <mailto:[email protected]>> wrote:
>>
>> Michael,
>> Breaking this down to two issues:
>>
>> Filtering
>> Martin has been working on improving filtering in the
>> ShortRead package to move from a read all then filter data to
>> a block processing based filtering methodology. Lessons
>> learned there can be brought to rtracklayer for large bed
>> files and the like.
>>
>> import() output class
>> Keeping the same API and just switching the import methods
>> from producing RangedData (or UCSCData) output to GRanges
>> output will break backward compatibility because the
>> RangedData API is not wholly applicable to GRanges objects. I
>> would not recommend this course since a number of packages in
>> BioC and scripts in the wild expect the import methods to
>> produce a RangedData (or UCSCData) object. An additional
>> argument is not that onerous and can be fazed out over the
>> course of two or three releases (1 - 1.5 years). Another
>> alternative is to add a new import function (read.GRanges?)
>> to rtracklayer that shares the same infrastructure as the
>> existing import methods.
>>
>> I have a local copy of rtracklayer where I added a new
>> asRangedData flag to the GenomicData function and import.gff*
>> methods. I'll sit on this for now since these changes didn't
>> take a lot of work. This is one of the situations where the
>> managing the life cycle of the function specs is trickier
>> than making the desired code changes.
>>
>>
>> Cheers,
>> Patrick
>>
>>
>>
>> On 8/4/10 8:24 PM, Michael Lawrence wrote:
>>> This might work, but it seems like an expensive optimization
>>> in that it changes a lot of the API. If someone cannot make
>>> a single copy of the data, it's unlikely that they're even
>>> going to be able to get to GenomicData() or manipulate it
>>> later. Perhaps the coercion function needs some simple
>>> tweaks? The filter support would definitely help. I'd rather
>>> keep things simple and return a single type, and GRanges
>>> sounds most appropriate.
>>>
>>> But I'm open to suggestions and further argument.
>>>
>>> Michael
>>>
>>> On Wed, Aug 4, 2010 at 2:05 PM, Patrick Aboyoun
>>> <[email protected] <mailto:[email protected]>> wrote:
>>>
>>> Michael,
>>> How integrated would you like to see the GRanges class
>>> in rtracklayer? The rtracklayer::GenomicData constructor
>>> is the master instantiator. I would like to add an
>>> asRangedData = TRUE (default) argument to the
>>> GenomicData function and push it all the way up through
>>> the import functions where when the user sets
>>> asRangedData = FALSE, the GenomicData function would
>>> create a GRanges object. This is what we did with the
>>> {matchPWM,vmatchPattern,vmatchPDict},BSgenome-methods in
>>> the BSgenome package and it as good a solution as any.
>>> This is a straight-forward change and wouldn't take too
>>> long to complete.
>>>
>>>
>>> Patrick
>>>
>>>
>>>
>>> On 8/4/10 12:56 PM, Michael Lawrence wrote:
>>>
>>> GRanges support is definitely on the TODO list.
>>> Filters are a good idea and
>>> also on the TODO list, possibly with a chunk size
>>> parameter to enable chunk
>>> processing.
>>>
>>> I'd love to have the GRanges stuff at least done by
>>> the next release.
>>> Patches welcome, of course :)
>>>
>>> Michael
>>>
>>> On Wed, Aug 4, 2010 at 8:08 AM, Ivan
>>> Gregoretti<[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>
>>> Hello Michael and everyone,
>>>
>>> Would you please consider adding to import() the
>>> capacity to generate
>>> a GRanges object rather than the default
>>> RangedData object?
>>>
>>> Also,
>>>
>>> Wouldn't it be great to be able to import() with
>>> filters just like
>>> with readAligned()?
>>>
>>>
>>>
>>> Justification
>>>
>>> GRanges is a biology-aware container. When
>>> importing large BEDs into
>>> R, the current workflow involves creating
>>> RangedData first and then
>>> converting to GRanges.
>>>
>>> If the BEDs are really big, holding both objects
>>> in memory at any
>>> point in time is a hardware challenge.
>>>
>>> The capacity to filter the input would help in
>>> this case and in
>>> general it would provide an increase in efficiency.
>>>
>>>
>>> Thank you,
>>>
>>> Ivan
>>>
>>>
>>>
>>>
>>> Ivan Gregoretti, PhD
>>> National Institute of Diabetes and Digestive and
>>> Kidney Diseases
>>> National Institutes of Health
>>> 5 Memorial Dr, Building 5, Room 205.
>>> Bethesda, MD 20892. USA.
>>> Phone: 1-301-496-1016 and 1-301-496-1592
>>> Fax: 1-301-496-9878
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> [email protected]
>>> <mailto:[email protected]>
>>>
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> [email protected]
>>> <mailto:[email protected]>
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>>
>>
>>
>
>
[[alternative HTML version deleted]]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing