Re: [R-sig-phylo] Removing columns containing "N" in DNA alignment

Vojtěch Zeisek Mon, 30 Oct 2017 03:27:08 -0700

Thank You, Emmanuel,
sorry for late reaction, I forge to copy the data to another computer before I 
left at Friday... It is a nice trick, although it returns unchanged alignment. 
:-( It is same regardless which threshold value I choose. I created it using 
vcfR::vcfR2DNAbin, but it looks like any other DNAbin object. With indels 
("-") this function worked well.


minua.dna
117 DNA sequences in binary format stored in a matrix.

All sequences of same length: 11946 

Labels:
M151_1_1
M151_2_4
M152_2_52
M153_1_109
M153_2_276
M153_2_294
...

Base composition:
    a     c     g     t 
0.213 0.280 0.295 0.212 

minua.dna.red <- del.colgapsonly.n(x=minua.dna, threshold=5, freq.only=FALSE)
minua.dna.red
117 DNA sequences in binary format stored in a matrix.

All sequences of same length: 11946 

Labels:
M151_1_1
M151_2_4
M152_2_52
M153_1_109
M153_2_276
M153_2_294
...

Base composition:
    a     c     g     t 
0.213 0.280 0.295 0.212

And yes, the alignment contains various bases...
base.freq(x=minua.dna, all=TRUE)
         a          c          g          t          r          m          w    
      
s          k          y 
0.14505732 0.19109712 0.20145212 0.14439408 0.06586477 0.01543627 0.01724498 
0.01034570 0.01635994 0.06292132 
         v          h          d          b          n          -          ? 
0.00000000 0.00000000 0.00000000 0.00000000 0.12982638 0.00000000 0.00000000

del.colgapsonly.n
function (x, threshold = 1, freq.only = FALSE) 
{
    if (!inherits(x, "DNAbin")) 
        x <- as.DNAbin(x)
    if (!is.matrix(x)) 
        stop("DNA sequences not in a matrix")
    foo <- function(x) sum(x == 240)
    g <- apply(x, 2, foo)
    if (freq.only) 
        return(g)
    i <- which(g/nrow(x) >= threshold)
    if (length(i)) 
        x <- x[, -i]
    x
}
<environment: namespace:ape>

BTW, what about generalize the function like this:

del.colgapsonly.X <- function (x, threshold=1, freq.only=FALSE, char="-") 
{
    if (!inherits(x, "DNAbin")) 
        x <- as.DNAbin(x)
    if (!is.matrix(x)) 
        stop("DNA sequences not in a matrix")
    value <- as.integer(as.DNAbin(char))
    foo <- function(x) sum(x == value)
    g <- apply(x, 2, foo)
    if (freq.only) 
        return(g)
    i <- which(g/nrow(x) >= threshold)
    if (length(i)) 
        x <- x[, -i]
    x
}
:-)
Sincerely,
V.

Dne pátek 27. října 2017 17:44:10 CET jste napsal(a):
> Hi Vojtěch,
> 
> Here's something you could do. First, make a copy of del.colgapsonly:
> 
> toto <- del.colgapsonly
> 
> Then, edit this copy (e.g., with fix(toto)), find this line:
> 
>      foo <- function(x) sum(x == 4)
> 
> and replace 4 by 240. Save and close. Now you can use toto() in the same
> way than del.colgapsonly(); for instance, to get the number of N's in
> each site of the woodmouse data:
> 
> R> toto(woodmouse, freq.only = TRUE)
>    [1]  3  2  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
> ....
> 
> If you wonder where the values 4 and 240 come from:
> 
> R> as.integer(as.DNAbin("-"))
> [1] 4
> R> as.integer(as.DNAbin("N"))
> [1] 240
> 
> This gives a lot of possibilities to hack the function. For instance, if
> you want to find the sites with R/Y, first, get the integer codes:
> 
> R> as.integer(as.DNAbin("R"))
> [1] 192
> R> as.integer(as.DNAbin("Y"))
> [1] 48
> 
> Then, change the above line for:
> 
>      foo <- function(x) sum(x == 192 | x == 48)
> 
> HTH
> 
> Best,
> 
> Emmanuel
> 
> Le 27/10/2017 à 17:02, Vojtěch Zeisek a écrit :
> > Hm, I tried a dirty hack: I exported the DNAbin object using
> > ape::write.dna
> > and replaced all occurrences of "n" in any sequence by "-" and imported
> > the
> > file back to R with ape::read.dna. Then I tried the mentioned functions.
> > They did nothing. When I exported the file to disk, the FASTA file did
> > not contain any "-", only "n". DO I do something wrong, or is there a bug
> > in ape as it seems to confuse "n" and "-"?
> > Sincerely,
> > V.
> > 
> > Dne pátek 27. října 2017 16:25:02 CEST jste napsal(a):
> >> Hello,
> >> I checked ape::del.colgapsonly, ips::deleteGaps and
> >> ips::deleteEmptyCells.
> >> They delete columns containing missing values, but I need also to delete
> >> columns containing base "N" (all columns with amount of Ns over certain
> >> threshold).
> >> Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it
> >> is
> >> suppose to remove columns/rows containing only the given characters, but
> >> if
> >> I use it and export data (ape::write.dna or ape::write.nexus.data), some
> >> samples consist only of N characters...
> >> The DNAbin object being processed was originally imported from VCF using
> >> vcfR (read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf,
> >> consensus=TRUE, extract.haps=FALSE, unphased_as_NA=FALSE)).
> >> I checked source code of the above functions, but they seem to only count
> >> NAs and then drop respective columns. And as sequences in DNAbin are
> >> stored in binary format, I'm bit struggled here... :(
> >> Any idea how to remove columns with given portion of "N" in sequences?
> >> Sincerely,
> >> V.
-- 
Vojtěch Zeisek
https://trapa.cz/

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
R-sig-phylo mailing list - [email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/[email protected]/

Re: [R-sig-phylo] Removing columns containing "N" in DNA alignment

Reply via email to