Re: [Bioc-sig-seq] rtracklayer and import()ing into GRanges

Patrick Aboyoun Thu, 05 Aug 2010 17:06:46 -0700

I just checked in a patch to rtracklayer version 1.9.6 into BioC 2.7 
that added asRangedData arguments to all of the import methods except 
import.bw as well as to the GenomicData function. When asRangedData = 
FALSE, these import methods will produce GRanges objects. I didn't 
optimize any of the code yet, so there may be another rev tomorrow with 
some tweaks to the import code at the R level. Here is are some examples:


 > library(rtracklayer)
 > bedfilepath <- system.file("tests", "test.bed", package = "rtracklayer")
 > import(bedfilepath, asRangedData = FALSE)
GRanges with 9 ranges and 5 elementMetadata values
     seqnames                 ranges strand |        name     score 
thickStart
<Rle> <IRanges> <Rle> | <character> <numeric> <integer>
[1]     chr7 [127471197, 127472363]      + |        Pos1         0  
127471196
[2]     chr7 [127472364, 127473530]      + |        Pos2         0  
127472363
[3]     chr7 [127473531, 127474697]      + |        Pos3         0  
127473530
[4]     chr7 [127474698, 127475864]      + |        Pos4         0  
127474697
[5]     chr7 [127475865, 127477031]      - |        Neg1         0  
127475864
[6]     chr7 [127477032, 127478198]      - |        Neg2         0  
127477031
[7]     chr7 [127478199, 127479365]      - |        Neg3         0  
127478198
[8]     chr7 [127479366, 127480532]      + |        Pos5         0  
127479365
[9]     chr7 [127480533, 127481699]      - |        Neg4         0  
127480532
      thickEnd     itemRgb
<integer> <character>
[1] 127472363     #FF0000
[2] 127473530     #FF0000
[3] 127474697     #FF0000
[4] 127475864     #FF0000
[5] 127477031     #0000FF
[6] 127478198     #0000FF
[7] 127479365     #0000FF
[8] 127480532     #FF0000
[9] 127481699     #0000FF

seqlengths
  chr7
    NA
 > bed15filepath <- system.file("tests", "test.bed15", package = 
"rtracklayer")
 > import(bed15filepath, asRangedData = FALSE)
GRanges with 2 ranges and 13 elementMetadata values
     seqnames                 ranges strand |        name     score 
thickStart
<Rle> <IRanges> <Rle> | <character> <numeric> <integer>
[1]     chr1 [159639973, 159640031]      - |     2440848       500  
159639972
[2]     chr1 [159640162, 159640190]      - |     2440849       500  
159640161
      thickEnd     itemRgb blockCount  blockSizes blockStarts  breast_A
<integer> <character> <integer> <character> <character> <numeric>
[1] 159640031          NA          1         59,          0,     0.593
[2] 159640190          NA          1         29,          0,    -0.906
      breast_B  breast_C cerebellum_A cerebellum_B
<numeric> <numeric> <numeric> <numeric>
[1]     1.196    -0.190       -1.088        0.093
[2]    -1.247     0.111       -0.515       -0.057

seqlengths
  chr1
    NA
 > sessionInfo()
R version 2.12.0 Under development (unstable) (2010-08-01 r52659)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] rtracklayer_1.9.6 RCurl_1.4-3       bitops_1.0-4.1

loaded via a namespace (and not attached):
[1] Biobase_2.9.0        Biostrings_2.17.27   BSgenome_1.17.6
[4] GenomicRanges_1.1.20 IRanges_1.7.15       tools_2.12.0
[7] XML_3.1-0


Patrick


On 8/5/10 1:11 PM, Michael Lawrence wrote:
>
>
> On Thu, Aug 5, 2010 at 10:45 AM, Patrick Aboyoun <[email protected] 
> <mailto:[email protected]>> wrote:
>
>     Michael,
>     I just made a minor check-in to rtracklayer where I replaced use
>     of Biobase:listLen with IRanges::elementLenghts in an effort to
>     minimize the impact of Biobase on the sequence package stack.
>
>
> Ok. It looks like elementLengths has been optimized since the last 
> time I looked.
>
>
>     Before I start the boulder rolling, how should I reconcile the
>     UCSCData class with the GRanges class? Once I have that sorted I
>     can make changes to import.bed and import.wig as well.
>
>
> Well, eventually we'll want to stick the track line information on to 
> GRanges. Could be done via a subclass like with UCSCData. metadata() 
> is another option. I do actually use the subclass for dispatch 
> purposes, pretty printing, etc. For right now though, the extra 
> information could just be dropped if the user requests a GRanges.
>
>     I originally named the argument asRangedData in the BSgenome
>     methods to reinforce that RangedData output is not intended to be
>     the default and conceptually the user is making an extra effort to
>     produce a RangedData object.
>
>
>     Patrick
>
>
>
>     On 8/5/10 4:32 AM, Michael Lawrence wrote:
>>     Makes sense. But why not make it asGRanges, which is shorter?
>>     Please go ahead and check in your work so far.
>>
>>     Thanks a lot,
>>     Michael
>>
>>     On Thu, Aug 5, 2010 at 12:51 AM, Patrick Aboyoun
>>     <[email protected] <mailto:[email protected]>> wrote:
>>
>>         Michael,
>>         Breaking this down to two issues:
>>
>>         Filtering
>>         Martin has been working on improving filtering in the
>>         ShortRead package to move from a read all then filter data to
>>         a block processing based filtering methodology. Lessons
>>         learned there can be brought to rtracklayer for large bed
>>         files and the like.
>>
>>         import() output class
>>         Keeping the same API and just switching the import methods
>>         from producing RangedData (or UCSCData) output to GRanges
>>         output will break backward compatibility because the
>>         RangedData API is not wholly applicable to GRanges objects. I
>>         would not recommend this course since a number of packages in
>>         BioC and scripts in the wild expect the import methods to
>>         produce a RangedData (or UCSCData) object. An additional
>>         argument is not that onerous and can be fazed out over the
>>         course of two or three releases (1 - 1.5 years). Another
>>         alternative is to add a new import function (read.GRanges?)
>>         to rtracklayer that shares the same infrastructure as the
>>         existing import methods.
>>
>>         I have a local copy of rtracklayer where I added a new
>>         asRangedData flag to the GenomicData function and import.gff*
>>         methods. I'll sit on this for now since these changes didn't
>>         take a lot of work. This is one of the situations where the
>>         managing the life cycle of the function specs is trickier
>>         than making the desired code changes.
>>
>>
>>         Cheers,
>>         Patrick
>>
>>
>>
>>         On 8/4/10 8:24 PM, Michael Lawrence wrote:
>>>         This might work, but it seems like an expensive optimization
>>>         in that it changes a lot of the API. If someone cannot make
>>>         a single copy of the data, it's unlikely that they're even
>>>         going to be able to get to GenomicData() or manipulate it
>>>         later. Perhaps the coercion function needs some simple
>>>         tweaks? The filter support would definitely help. I'd rather
>>>         keep things simple and return a single type, and GRanges
>>>         sounds most appropriate.
>>>
>>>         But I'm open to suggestions and further argument.
>>>
>>>         Michael
>>>
>>>         On Wed, Aug 4, 2010 at 2:05 PM, Patrick Aboyoun
>>>         <[email protected] <mailto:[email protected]>> wrote:
>>>
>>>             Michael,
>>>             How integrated would you like to see the GRanges class
>>>             in rtracklayer? The rtracklayer::GenomicData constructor
>>>             is the master instantiator. I would like to add an
>>>             asRangedData = TRUE (default) argument to the
>>>             GenomicData function and push it all the way up through
>>>             the import functions where when the user sets
>>>             asRangedData = FALSE, the GenomicData function would
>>>             create a GRanges object. This is what we did with the
>>>             {matchPWM,vmatchPattern,vmatchPDict},BSgenome-methods in
>>>             the BSgenome package and it as good a solution as any.
>>>             This is a straight-forward change and wouldn't take too
>>>             long to complete.
>>>
>>>
>>>             Patrick
>>>
>>>
>>>
>>>             On 8/4/10 12:56 PM, Michael Lawrence wrote:
>>>
>>>                 GRanges support is definitely on the TODO list.
>>>                 Filters are a good idea and
>>>                 also on the TODO list, possibly with a chunk size
>>>                 parameter to enable chunk
>>>                 processing.
>>>
>>>                 I'd love to have the GRanges stuff at least done by
>>>                 the next release.
>>>                 Patches welcome, of course :)
>>>
>>>                 Michael
>>>
>>>                 On Wed, Aug 4, 2010 at 8:08 AM, Ivan
>>>                 Gregoretti<[email protected]
>>>                 <mailto:[email protected]>>  wrote:
>>>
>>>
>>>                     Hello Michael and everyone,
>>>
>>>                     Would you please consider adding to import() the
>>>                     capacity to generate
>>>                     a GRanges object rather than the default
>>>                     RangedData object?
>>>
>>>                     Also,
>>>
>>>                     Wouldn't it be great to be able to import() with
>>>                     filters just like
>>>                     with readAligned()?
>>>
>>>
>>>
>>>                     Justification
>>>
>>>                     GRanges is a biology-aware container. When
>>>                     importing large BEDs into
>>>                     R, the current workflow involves creating
>>>                     RangedData first and then
>>>                     converting to GRanges.
>>>
>>>                     If the BEDs are really big, holding both objects
>>>                     in memory at any
>>>                     point in time is a hardware challenge.
>>>
>>>                     The capacity to filter the input would help in
>>>                     this case and in
>>>                     general it would provide an increase in efficiency.
>>>
>>>
>>>                     Thank you,
>>>
>>>                     Ivan
>>>
>>>
>>>
>>>
>>>                     Ivan Gregoretti, PhD
>>>                     National Institute of Diabetes and Digestive and
>>>                     Kidney Diseases
>>>                     National Institutes of Health
>>>                     5 Memorial Dr, Building 5, Room 205.
>>>                     Bethesda, MD 20892. USA.
>>>                     Phone: 1-301-496-1016 and 1-301-496-1592
>>>                     Fax: 1-301-496-9878
>>>
>>>                     _______________________________________________
>>>                     Bioc-sig-sequencing mailing list
>>>                     [email protected]
>>>                     <mailto:[email protected]>
>>>                     
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>>                        [[alternative HTML version deleted]]
>>>
>>>
>>>                 _______________________________________________
>>>                 Bioc-sig-sequencing mailing list
>>>                 [email protected]
>>>                 <mailto:[email protected]>
>>>                 https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>>
>>
>>
>
>


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] rtracklayer and import()ing into GRanges

Reply via email to