Re: [Bioc-sig-seq] Biostrings: problem to access indel-details form pairwiseAlignment()

Patrick Aboyoun Fri, 24 Jul 2009 07:55:55 -0700

Wolfgang,

If I'm understanding the situation correctly, thestart(indel(pattern())) operation returns the positions in the patternwhere there is a gap and the length of those gaps, thestart(indel(subject())) function returns the positions in the subjectwhere there is a gap and the length of those gap and what you would liketo do is map this information back to the subject to see where thedeletions occurred. I don't recall adding a function to do that, but Ican see it being useful. Here is a somewhat kludgey method of gettingthe deletion locations via manipulation of output of the alignedfunction (there may be a cleaner way of doing this with existing tools,but this is the first idea that came to mind):


> aligned(align)
A DNAStringSet instance of length 3
  width seq
[1]    36 GGGATAGTGACTACAGGATCCAGCTCTTCGCCTGGC
[2]    36 ---ATAGTGACGT-AGGATCCAGCTCTTCGCC-GGC
[3]    36 GGGA--GTGACTTCAGGATCCAG-TCTTCGCCTGGC

> deleteRanges <- vmatchPattern("-", aligned(align))
> deleteRanges
MIndex object of length 3
> deleteRanges <- as(deleteRanges, "CompressedIRangesList")
> deleteRanges
CompressedIRangesList: 3 elements
> deleteRanges <- lapply(deleteRanges, asNormalIRanges)
> deleteRanges
[[1]]
NormalIRanges instance:
[1] start end   width
<0 rows> (or 0-length row.names)

[[2]]
NormalIRanges instance:
  start end width
[1]     1   3     3
[2]    14  14     1
[3]    33  33     1

[[3]]
NormalIRanges instance:
  start end width
[1]     5   6     2
[2]    24  24     1

> for(i in 1:3) print(Views(mySubject, deleteRanges[[i]]))
Views on a 36-letter DNAString subject
subject: GGGATAGTGACTTCAGGATCCAGCTCTTCGCCTGGC
views: NONE
Views on a 36-letter DNAString subject
subject: GGGATAGTGACTTCAGGATCCAGCTCTTCGCCTGGC
views:
  start end width
[1]     1   3     3 [GGG]
[2]    14  14     1 [C]
[3]    33  33     1 [T]
Views on a 36-letter DNAString subject
subject: GGGATAGTGACTTCAGGATCCAGCTCTTCGCCTGGC
views:
  start end width
[1]     5   6     2 [TA]
[2]    24  24     1 [C]


Wolfgang Raffelsberger wrote:

Patrick,

thank for your quick response !
For a while things went well but now I got across a case where I havedifficulty interpreting the number(s) given for start-positions ofdeletions. This is when there is a preceding insertion before adeletion occurs/ gets encountered.Of course I could correct this, ie shift the values myself accordingto the number preceding inserted nucletides, but I posted this sinceI'm not sure if you're aware of this special case/problem.Or maybe you have some other command/way to access (ie extract) thesequence(s) deleted ?
(Note that the equivalent approach for extracting insertions workswithout any problems :))
Wolfgang

> suppressMessages(library(Biostrings))
> mySubject <- ref1 <- DNAString("GGGATAGTGACTTCAGGATCCAGCTCTTCGCCTGGC")
> myPattern <- samp1 <-DNAStringSet(c("GGGATAGTGACTACAGGATCCAGCTCTTCGCCTGGC","ATAGTGACGTAGGATCACAGCTCTTCGCCGGC","TTGGGAGTGACTTGCAGGATCCAGTCTTCGCCTGGCAT"))> ## 1st has a mutation; 2nd: {5'del+ mut(g)+ del(C)+ del(T)+ ins(A)}; 3rd: {5'ins + del(TA)+ ins(G)+ del(C)+ 3'ins}
> ##  multiple alignment of test-case (gapOpening= -4, gapExtension= -2)
> ##     GGGATAGTGACTT-CAGGATC-CAGCTCTTCGCCTGGC    .. ref = subject
> ## GGGATAGTGACTa-CAGGATC-CAGCTCTTCGCCTGGC .. (1) mutation (T-> A)> ## ATAGTGACgT--AGGATCACAGCTCTTCGCC-GGC .. (2) 5'del +mut(g) + del(C)+ ins(A)+ del(T)> ## ttGGGA--GTGACTTGCAGGATC-CAG-TCTTCGCCTGGCat .. (3) 5'ins +del(TA)+ ins(G)+ del(C)+ 3'ins
>
> align <- pairwiseAlignment(myPattern,mySubject,gapOpening= -4,gapExtension= -2)
>
> nindel(align)                             # no of indels per seq
An object of class "InDel"
Slot "insertion":
   Length WidthSum
[1,]      0        0
[2,]      1        1
[3,]      1        1

Slot "deletion":
   Length WidthSum
[1,]      0        0
[2,]      2        2
[3,]      2        3

> (delStart <- start(indel(pattern(align))))
[1] 11 30  5 23
> (delEnd <- end(indel(pattern(align))))
[1] 11 30  6 23
>
> (seqNo_wDel <- rep.int(1:length(align),elementLengths(indel(pattern(align))))) # sequence-number for eachinstance of deletion
[1] 2 2 3 3
>
> for(i in 1:length(seqNo_wDel))print(substr(as.character(unaligned(subject(align[seqNo_wDel[i]]))),delStart[i],delEnd[i]))
[1] "C"
[1] "G"
[1] "TA"
[1] "G"
## however the "true" deleted ones are : C, T, TA, C => problem withdeletions after preceding insertion (ie ech 2nd case)> for(i in 1:length(seqNo_wDel)) print(substr(as.character(aligned(pattern(align[seqNo_wDel[i]]))),delStart[i],delEnd[i]))
[1] "-"
[1] "C"
[1] "--"
[1] "A"
> ## ==> problem with pôsition of deletions after preceding insertion(ie ech 2nd case)
>
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-07-22 r48979)
x86_64-unknown-linux-gnu

locale:
[1] C

attached base packages:
[1] stats graphics grDevices utils datasets methods baseother attached packages:[1] Biostrings_2.13.27 IRanges_1.3.42 loaded via a namespace (and notattached):
[1] Biobase_2.5.4
>




Patrick Aboyoun a écrit :
Wolfgang,
The IRanges infrastructure has evolved greatly from BioC 2.4/R-2.9 toBioC 2.5/R-2.10 and I highly recommend you use accessor functions,rather than direct slot access to obtain the information you arelooking for. For example, the IRangesList class definition haschanged greatly and the @elements slot is no longer present.
Question: how to transform an IRangesList object to a list of IRangesobject
Answer:
Short answer, use as.list as in as.list(indel(subject(align))). Longanswer, this is typically not necessary and not desired given thatIRangesList has many methods defined for them. See man pages forIRangesList, RangesList, and Sequence (if you are on BioC2.5/R-2.10). If you can tell me what operations you would like toperform, I can point you in the right direction.
Question: how to the the start, width, and end locations of apairwise alignment.
Answer: use the start, width and end methods.
> start(subject(align))
[1] 1 1 4
> width(subject(align))
[1] 24 24 19
> end(subject(align))
[1] 24 24 22


Patrick


Wolfgang Raffelsberger wrote:
Patrick,

thank you very much for your quick and helpful answer !

Yes, using  :
> align <- pairwiseAlignment(samp1,ref1)
> indel(subject(align))
I'm about to get what I'm looking for. Now my question is, whichcommands are (will be) availabel for mining an IRangesList-object.
Most of all I'm interested in what would correspond to getting :

> indel(subject(align))@elements
> subject(align)@ra...@start
> subject(align)@ra...@witdth # in fact, so far I can do withoutthis one
(unless you think the @elements, and @range won't change in thefuture ...)With these elements I manage now to extract the very nucleotidesthat were inserted/deleted.
Wolfgang


Patrick Aboyoun a écrit :
Wolfgang,
Below is code that retrieves the indel locations you are lookingfor. I like your attempts at using indel, insertion, and deletionfor PairwiseAlignment objects and I'll add the methods forPairwiseAlignment objects to BioC 2.5 (devel) shortly using theconventions that I specify below.
> suppressMessages(library(Biostrings))
> ref1 <- DNAString("GGGATACTTCACCAGCTCCCTGGC") # my pattern
> samp1 <-DNAStringSet(c("GGGATACTACACCAGCTCCCTGGC","GGGATACTTACACCAGCTCCCTGGC","ATACTTCACCAGCTCCCTG"))> # 1st has a mutation, 2nd has an insertion, the 3rd is simplyshorter ...
>
> align <- pairwiseAlignment(samp1,ref1)
>
> nindel(align)
An object of class “InDel”
Slot "insertion":
Length WidthSum
[1,] 0 0
[2,] 1 1
[3,] 0 0

Slot "deletion":
Length WidthSum
[1,] 0 0
[2,] 0 0
[3,] 0 0

> deletions <- indel(pattern(align))
> deletions
CompressedIRangesList: 3 elements
> insertions <- indel(subject(align))
> insertions
CompressedIRangesList: 3 elements
> insertions[[2]]
IRanges instance:
start end width
[1] 10 10 1
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] Biostrings_2.13.26 IRanges_1.3.41

loaded via a namespace (and not attached):
[1] Biobase_2.5.4


Wolfgang Raffelsberger wrote:
Dear list,
previously I've been extracting indel-information from sequencesaligned by the Biostrings function pairwiseAlignment(), which isprobably not the best way since the class'PairwiseAlignedFixedSubject' has evoled & changed and my old codewon't work any more. Now trying to use the library-providedfunctions to access the information/details about indels (ie theirlocalization on the pattern and possibly the indel sequence ).However, I can't find a function to extract this information, thatis (to the best of my knowledge) part of the aligned object.
## here an example :
library(Biostrings)
ref1 <- DNAString("GGGATACTTCACCAGCTCCCTGGC") # my pattern
samp1 <-DNAStringSet(c("GGGATACTACACCAGCTCCCTGGC","GGGATACTTACACCAGCTCCCTGGC","ATACTTCACCAGCTCCCTG"))# 1st has a mutation, 2nd has an insertion, the 3rd is simplyshorter ...
align <- pairwiseAlignment(samp1,ref1)
nindel(align) # insertion was found properly but I can't see atwhich nt position the indel was found (neither if it's aninsertion or deletion)indel(align) # Error in function (classes, fdef, mtable) unable tofind an inherited method for function...insertion(align) # Error in function (classes, fdef, mtable)unable to find an inherited method for function ...
deletion(align) # neither ...
?AlignedXStringSet # says under 'Accessor methods' that indel()exists ..
## ideally I'd be looking for something like
mismatchTable(align) # but addressing indels ...


## for completeness :
> sessionInfo()
R version 2.9.1 (2009-06-26)
i386-pc-mingw32

locale:
LC_COLLATE=French_France.1252;LC_CTYPE=French_France.1252;LC_MONETARY=French_France.1252;LC_NUMERIC=C;LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ShortRead_1.2.1 lattice_0.17-25 BSgenome_1.12.3Biostrings_2.12.7 IRanges_1.2.3
loaded via a namespace (and not attached):
[1] Biobase_2.4.1 grid_2.9.1 hwriter_1.1

Thank's in advance,
Wolfgang Raffelsberger
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Wolfgang Raffelsberger, PhD
Laboratoire de BioInformatique et Génomique Intégratives
CNRS UMR7104, IGBMC, 1 rue Laurent Fries, 67404 IllkirchStrasbourg, France
Tel (+33) 388 65 3300         Fax (+33) 388 65 3276
wolfgang.raffelsberger (at) igbmc.fr

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Biostrings: problem to access indel-details form pairwiseAlignment()

Reply via email to