On Tue, Apr 27, 2010 at 10:33 AM, Ivan Gregoretti <[email protected]> wrote: > Patrick, > > In my opinion, there is no perfect solution, only compromise solution. > > There is a group of java-based programs called Picard. It is a package > designed to work with SAM/BAM tool. It is good and well established. > In my opinion, it is 'the' sam/bam toolbox. > > In that package, the problem of DNA circularity is also present. They > way Picard deals with this problem is by providing an option to > override the over-the-bound check. As long as the user knows that some > DNAs are circular, everything is OK. > > The BAM header contains the chromosome lengths and yet it does hold > reads that lie beyond the sequence boundaries.
Actually, the edge case is general as alignments, even on linear chromosomes, may extend beyond the end of the chromosome, I believe. In the best case, these alignments are clipped (in CIGAR terms), but I don't know that all aligners are doing that appropriately. Sean > I vote for an overriding option rather than an infrastructure overhaul. > > Thank you, > > Ivan > > > Ivan Gregoretti, PhD > National Institute of Diabetes and Digestive and Kidney Diseases > National Institutes of Health > > > > On Mon, Apr 26, 2010 at 6:51 PM, Patrick Aboyoun <[email protected]> wrote: >> Ivan, >> The chr4 and chr4_random type issue is easily solved by adding the length of >> chr?_random to chr? for each chromosome. Perhaps we can even add an argument >> to seqlengths,BSgenome-method to perform that collapsing. >> >> The case of circular DNA (e.g. mitochondrial) is a bit trickier since our >> infrastructure assumes linear DNA. To address that case we would first have >> to create a linearized DNA coordinate system and then remap all those >> alignments that fall into the extended region. That is unless people really >> want to work in circular coordinates, which would involve numerous changes >> to the infrastructure. >> >> I am open to any suggestions. >> >> >> Patrick >> >> >> >> On 4/26/10 2:17 PM, Ivan Gregoretti wrote: >>> >>> Hi Patrick. >>> >>> Sure. Here is the output. To the best of my understanding, most >>> aligners assume that, for instance, 'ch4_random' is just a linear >>> continuation of 'chr4'. >>> >>> Thank you, >>> >>> Ivan >>> >>> >>> >>>> >>>> range(A) >>>> >>> >>> GRanges with 67 ranges and 0 elementMetadata values >>> seqnames ranges strand | >>> <Rle> <IRanges> <Rle> | >>> [1] chr1 [3000311, 197195093] + | >>> [2] chr1 [2999944, 197195305] - | >>> [3] chr10 [3000180, 129992771] + | >>> [4] chr10 [3000092, 129993250] - | >>> [5] chr11 [3000006, 121843794] + | >>> [6] chr11 [3000010, 121843708] - | >>> [7] chr12 [3000095, 121257390] + | >>> [8] chr12 [3000144, 121257410] - | >>> [9] chr13 [3000434, 120284396] + | >>> ... ... ... ... ... >>> [59] chrUn_random [ -21, 5900319] - | >>> [60] chrX [3000042, 166627000] + | >>> [61] chrX [2999944, 166650266] - | >>> [62] chrX_random [ 326, 1051594] + | >>> [63] chrX_random [ 229, 1309438] - | >>> [64] chrY [ 150, 2902454] + | >>> [65] chrY [ -2, 2902181] - | >>> [66] chrY_random [ 7539, 58497416] + | >>> [67] chrY_random [ 1247, 58489658] - | >>> >>> seqlengths >>> chr1 chr10 chr11 ... chrY chrY_random >>> NA NA NA ... NA NA >>> >>> >>> >>>> >>>> range(A[seqnames(A)=='chrM']) >>>> >>> >>> GRanges with 2 ranges and 0 elementMetadata values >>> seqnames ranges strand | >>> <Rle> <IRanges> <Rle> | >>> [1] chrM [ 4, 16387] + | >>> [2] chrM [-82, 16298] - | >>> >>> >>> >>>> >>>> range(B) >>>> >>> >>> GRanges with 65 ranges and 0 elementMetadata values >>> seqnames ranges strand | >>> <Rle> <IRanges> <Rle> | >>> [1] chr1 [3000506, 197194916] + | >>> [2] chr1 [3000977, 197194382] - | >>> [3] chr10 [3000626, 129989507] + | >>> [4] chr10 [3000643, 129988346] - | >>> [5] chr11 [3000057, 121843783] + | >>> [6] chr11 [3000010, 121843699] - | >>> [7] chr12 [3001370, 121256864] + | >>> [8] chr12 [3000135, 121257132] - | >>> [9] chr13 [3000103, 120284379] + | >>> ... ... ... ... ... >>> [57] chrUn_random [ 125, 5900272] - | >>> [58] chrX [3001345, 166614519] + | >>> [59] chrX [3000031, 166628394] - | >>> [60] chrX_random [ 1177, 1051588] + | >>> [61] chrX_random [ 254, 1051506] - | >>> [62] chrY [ 141, 2902458] + | >>> [63] chrY [ 744, 2901128] - | >>> [64] chrY_random [ 39555, 58472198] + | >>> [65] chrY_random [ 71505, 58487987] - | >>> >>> seqlengths >>> chr1 chr10 chr11 ... chrY chrY_random >>> NA NA NA ... NA NA >>> >>> >>> >>>> >>>> range(B[seqnames(B)=='chrM']) >>>> >>> >>> GRanges with 2 ranges and 0 elementMetadata values >>> seqnames ranges strand | >>> <Rle> <IRanges> <Rle> | >>> [1] chrM [ 5, 16385] + | >>> [2] chrM [-77, 16299] - | >>> >>> >>> >>>> >>>> range(C) >>>> >>> >>> GRanges with 67 ranges and 0 elementMetadata values >>> seqnames ranges strand | >>> <Rle> <IRanges> <Rle> | >>> [1] chr1 [3000125, 197195029] + | >>> [2] chr1 [2999981, 197194997] - | >>> [3] chr10 [3000098, 129993293] + | >>> [4] chr10 [3000105, 129993219] - | >>> [5] chr11 [3000007, 121843789] + | >>> [6] chr11 [3000010, 121843707] - | >>> [7] chr12 [3000113, 121257571] + | >>> [8] chr12 [3000011, 121257521] - | >>> [9] chr13 [3000163, 120284384] + | >>> ... ... ... ... ... >>> [59] chrUn_random [ -75, 5900330] - | >>> [60] chrX [3000056, 166650160] + | >>> [61] chrX [2999951, 166650076] - | >>> [62] chrX_random [ 327, 1200747] + | >>> [63] chrX_random [ 236, 1508524] - | >>> [64] chrY [ 38, 2902563] + | >>> [65] chrY [ -80, 2902221] - | >>> [66] chrY_random [ 5699, 58492209] + | >>> [67] chrY_random [ 7154, 58486839] - | >>> >>> seqlengths >>> chr1 chr10 chr11 ... chrY chrY_random >>> NA NA NA ... NA NA >>> >>> >>> >>>> >>>> range(C[seqnames(C)=='chrM']) >>>> >>> >>> GRanges with 2 ranges and 0 elementMetadata values >>> seqnames ranges strand | >>> <Rle> <IRanges> <Rle> | >>> [1] chrM [ 1, 16387] + | >>> [2] chrM [-81, 16332] - | >>> >>> >>> library(BSgenome.Mmusculus.UCSC.mm9) >>> >>>> >>>> length(Mmusculus$chrM) >>>> >>> >>> [1] 16299 >>> >>> >>> >>> Ivan Gregoretti, PhD >>> National Institute of Diabetes and Digestive and Kidney Diseases >>> National Institutes of Health >>> >>> >>> >>> On Mon, Apr 26, 2010 at 4:35 PM, Patrick Aboyoun<[email protected]> >>> wrote: >>> >>>> >>>> Ivan, >>>> Could you provide me with the results of >>>> >>>> range(Z) >>>> >>>> for all three of the GRanges objects that seqlengths<- throws and error >>>> on? >>>> I would like to see if there is some adjustment we can make to the >>>> seqlengths<- function that will resolve your issue. >>>> >>>> Thanks, >>>> Patrick >>>> >>>> >>>> On 4/26/10 12:57 PM, Ivan Gregoretti wrote: >>>> >>>>> >>>>> Hi Patrick, >>>>> >>>>> You are correct. The validity check is rejecting the input. The >>>>> problem with that is that I have three set, all of them failing perhap >>>>> because one or two tags are out of bounds. >>>>> >>>>> This tags are coming straight from the Illumina sequencer. There is >>>>> no manipulation. Perhaps it is complaining because the validity check >>>>> does not know that Mitochondrial DNA is circular. >>>>> >>>>> Bottom line, anybody attempting to load the data as GRanges from a >>>>> high coverage tag set will find this seqlengths() rejection. >>>>> >>>>> Is there any way to override it or should I just refrain from >>>>> assigning chromosome lengths? The online vignette does not mention if >>>>> there is a switch to the validity check. >>>>> >>>>> Thank you, >>>>> >>>>> Ivan >>>>> >>>>> >>>>> >>>>> Ivan Gregoretti, PhD >>>>> National Institute of Diabetes and Digestive and Kidney Diseases >>>>> National Institutes of Health >>>>> 5 Memorial Dr, Building 5, Room 205. >>>>> Bethesda, MD 20892. USA. >>>>> Phone: 1-301-496-1592 >>>>> Fax: 1-301-496-9878 >>>>> >>>>> >>>>> >>>>> On Mon, Apr 26, 2010 at 3:42 PM, Patrick Aboyoun<[email protected]> >>>>> wrote: >>>>> >>>>> >>>>>> >>>>>> Ivan, >>>>>> As you probably already realized, the first error you encountered was >>>>>> do >>>>>> to >>>>>> a misuse of the seqlengths function since function objects (e.g. >>>>>> BSgenome) >>>>>> have no sequence lengths. >>>>>> >>>>>> # Here is the problem >>>>>> seqlengths(Z)<- seqlengths(BSgenome)[names(seqlengths(Z))] >>>>>> Error in function (classes, fdef, mtable) : >>>>>> unable to find an inherited method for function "seqlengths", for >>>>>> signature "function" >>>>>> >>>>>> The second error is the result of a failed validity check on the >>>>>> modified >>>>>> object. All ranges stored in a GRanges object must be between 1 and >>>>>> seqlengths(object) if the seqlengths information is non-NAs. >>>>>> >>>>>> # Here is a second attempt that also fails >>>>>> seqlengths(Z)<- seqlengths(Mmusculus)[names(seqlengths(Z))] >>>>>> Error in validObject(.Object) : >>>>>> invalid class "GRanges" object: slot 'ranges' contains values >>>>>> outside of sequence bounds >>>>>> >>>>>> Compare the results of range(Z), which returns the min start and max >>>>>> end >>>>>> for >>>>>> each of the seqnames in Z, and compare it with >>>>>> seqlengths(Mmusculus)[names(seqlengths(Z))]. This should provide you >>>>>> with >>>>>> some insight as to which ranges are out of bounds. Perhaps your >>>>>> intervals >>>>>> are 0-based instead of 1-based? >>>>>> >>>>>> >>>>>> Cheers, >>>>>> Patrick >>>>>> >>>>>> >>>>>> On 4/26/10 11:08 AM, Ivan Gregoretti wrote: >>>>>> >>>>>> >>>>>>> >>>>>>> Hello listers, >>>>>>> >>>>>>> Is anybody having trouble assigning seqlengths() to a GRanges instance >>>>>>> with the new GenomicRanges version? >>>>>>> >>>>>>> >>>>>>> This morning I upgraded my GenomicRanges from 0.0.9 to 0.1.17 and >>>>>>> since then I am unable to assign chromosome lengths to any of my tag >>>>>>> sets from my Illumina 36 nucleotide sequences. >>>>>>> >>>>>>> On Friday this worked. Let me show you how it complains now: >>>>>>> >>>>>>> Z<- import('millionsoftags.bed.gz', 'bed') >>>>>>> >>>>>>> Z<- as(Z, 'GRanges') >>>>>>> >>>>>>> class(Z) >>>>>>> [1] "GRanges" >>>>>>> attr(,"package") >>>>>>> [1] "GenomicRanges" >>>>>>> >>>>>>> Z >>>>>>> GRanges with 23293177 ranges and 2 elementMetadata values >>>>>>> seqnames ranges strand | >>>>>>> <Rle> <IRanges> <Rle> | >>>>>>> [1] chr1 [3000506, 3000541] + | >>>>>>> [2] chr1 [3001061, 3001096] - | >>>>>>> [3] chr1 [3001075, 3001110] - | >>>>>>> [4] chr1 [3001098, 3001133] + | >>>>>>> [5] chr1 [3001310, 3001345] + | >>>>>>> [6] chr1 [3001559, 3001594] + | >>>>>>> [7] chr1 [3001603, 3001638] + | >>>>>>> [8] chr1 [3001603, 3001638] + | >>>>>>> [9] chr1 [3001609, 3001644] - | >>>>>>> ... ... ... ... ... >>>>>>> [23293169] chrY_random [58402685, 58402720] + | >>>>>>> [23293170] chrY_random [58403358, 58403393] + | >>>>>>> [23293171] chrY_random [58406154, 58406189] + | >>>>>>> [23293172] chrY_random [58411077, 58411112] - | >>>>>>> [23293173] chrY_random [58430677, 58430712] + | >>>>>>> [23293174] chrY_random [58435117, 58435152] - | >>>>>>> [23293175] chrY_random [58472079, 58472114] + | >>>>>>> [23293176] chrY_random [58483725, 58483760] - | >>>>>>> [23293177] chrY_random [58487952, 58487987] - | >>>>>>> name score >>>>>>> <character> <numeric> >>>>>>> [1] HWI-EAS179_1:7:39:506:1302 96 >>>>>>> [2] HWI-EAS179_1:2:69:562:1539 119 >>>>>>> [3] HWI-EAS179_1:8:28:1327:394 119 >>>>>>> [4] HWI-EAS179_1:7:96:619:454 119 >>>>>>> [5] HWI-EAS179_49:3:4:1219:1729 119 >>>>>>> [6] HWI-EAS179_49:3:88:949:558 118 >>>>>>> [7] HWI-EAS179_1:7:60:1151:1790 119 >>>>>>> [8] HWI-EAS179_1:7:61:1586:147 114 >>>>>>> [9] HWI-EAS179_1:7:55:813:365 106 >>>>>>> ... ... ... >>>>>>> [23293169] HWI-EAS179_1:7:49:1416:1573 17 >>>>>>> [23293170] HWI-EAS179_1:8:25:405:1723 59 >>>>>>> [23293171] HWI-EAS179_1:7:75:1366:1224 25 >>>>>>> [23293172] HWI-EAS179_1:2:5:1338:80 5 >>>>>>> [23293173] HWI-EAS179_49:3:13:151:166 83 >>>>>>> [23293174] HWI-EAS179_49:3:29:1091:472 6 >>>>>>> [23293175] HWI-EAS179_1:2:69:1424:733 17 >>>>>>> [23293176] HWI-EAS179_1:7:16:945:1051 25 >>>>>>> [23293177] HWI-EAS179_1:7:74:1117:1801 14 >>>>>>> >>>>>>> seqlengths >>>>>>> chr1 chr10 chr11 ... chrY chrY_random >>>>>>> NA NA NA ... NA NA >>>>>>> >>>>>>> # Here is the problem >>>>>>> seqlengths(Z)<- seqlengths(BSgenome)[names(seqlengths(Z))] >>>>>>> Error in function (classes, fdef, mtable) : >>>>>>> unable to find an inherited method for function "seqlengths", for >>>>>>> signature "function" >>>>>>> >>>>>>> # Here is a second attempt that also fails >>>>>>> seqlengths(Z)<- seqlengths(Mmusculus)[names(seqlengths(Z))] >>>>>>> Error in validObject(.Object) : >>>>>>> invalid class "GRanges" object: slot 'ranges' contains values >>>>>>> outside of sequence bounds >>>>>>> >>>>>>> >>>>>>> As you can see, I haven't had the chance to mess the data. >>>>>>> Any idea how to circumvent this problem? >>>>>>> >>>>>>> Thank you, >>>>>>> >>>>>>> Ivan >>>>>>> >>>>>>> >>>>>>> sessionInfo() >>>>>>> R version 2.12.0 Under development (unstable) (2010-03-25 r51410) >>>>>>> x86_64-unknown-linux-gnu >>>>>>> >>>>>>> locale: >>>>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>>>>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >>>>>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>>> >>>>>>> attached base packages: >>>>>>> [1] stats graphics grDevices utils datasets methods base >>>>>>> >>>>>>> other attached packages: >>>>>>> [1] BSgenome.Mmusculus.UCSC.mm9_1.3.16 BSgenome_1.15.21 >>>>>>> [3] Biostrings_2.15.27 GenomicRanges_0.1.17 >>>>>>> [5] IRanges_1.5.79 rtracklayer_1.7.12 >>>>>>> [7] RCurl_1.4-1 bitops_1.0-4.1 >>>>>>> >>>>>>> loaded via a namespace (and not attached): >>>>>>> [1] Biobase_2.7.6 tools_2.12.0 XML_2.8-1 >>>>>>> >>>>>>> >>>>>>> Ivan Gregoretti, PhD >>>>>>> National Institute of Diabetes and Digestive and Kidney Diseases >>>>>>> National Institutes of Health >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioc-sig-sequencing mailing list >>>>>>> [email protected] >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>> >>>> >> >> > > _______________________________________________ > Bioc-sig-sequencing mailing list > [email protected] > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
