Dear Vojtech,
my code snippet works, and the other solutions were also correct. The problem is with your alignment. Your 3rd sequence only contains N's, therefore whatever approach you use, all coloumns are filtered out if you filter all coloumns containing N. Please open your alignment in a text editor or use image.DNAbin(youralignment) to visualise your alignment before you process it in any way.
Best wishes,
Andreas

Am 2017-10-30 11:26, schrieb Vojtěch Zeisek:
Thank You, Emmanuel,
sorry for late reaction, I forge to copy the data to another computer before I left at Friday... It is a nice trick, although it returns unchanged alignment. :-( It is same regardless which threshold value I choose. I created it using vcfR::vcfR2DNAbin, but it looks like any other DNAbin object. With indels
("-") this function worked well.

minua.dna
117 DNA sequences in binary format stored in a matrix.

All sequences of same length: 11946

Labels:
M151_1_1
M151_2_4
M152_2_52
M153_1_109
M153_2_276
M153_2_294
...

Base composition:
    a     c     g     t
0.213 0.280 0.295 0.212

minua.dna.red <- del.colgapsonly.n(x=minua.dna, threshold=5, freq.only=FALSE)
minua.dna.red
117 DNA sequences in binary format stored in a matrix.

All sequences of same length: 11946

Labels:
M151_1_1
M151_2_4
M152_2_52
M153_1_109
M153_2_276
M153_2_294
...

Base composition:
    a     c     g     t
0.213 0.280 0.295 0.212

And yes, the alignment contains various bases...
base.freq(x=minua.dna, all=TRUE)
         a          c          g          t          r          m
    w
s          k          y
0.14505732 0.19109712 0.20145212 0.14439408 0.06586477 0.01543627 0.01724498
0.01034570 0.01635994 0.06292132
v h d b n - ? 0.00000000 0.00000000 0.00000000 0.00000000 0.12982638 0.00000000 0.00000000

del.colgapsonly.n
function (x, threshold = 1, freq.only = FALSE)
{
    if (!inherits(x, "DNAbin"))
        x <- as.DNAbin(x)
    if (!is.matrix(x))
        stop("DNA sequences not in a matrix")
    foo <- function(x) sum(x == 240)
    g <- apply(x, 2, foo)
    if (freq.only)
        return(g)
    i <- which(g/nrow(x) >= threshold)
    if (length(i))
        x <- x[, -i]
    x
}
<environment: namespace:ape>

BTW, what about generalize the function like this:

del.colgapsonly.X <- function (x, threshold=1, freq.only=FALSE, char="-")
{
    if (!inherits(x, "DNAbin"))
        x <- as.DNAbin(x)
    if (!is.matrix(x))
        stop("DNA sequences not in a matrix")
    value <- as.integer(as.DNAbin(char))
    foo <- function(x) sum(x == value)
    g <- apply(x, 2, foo)
    if (freq.only)
        return(g)
    i <- which(g/nrow(x) >= threshold)
    if (length(i))
        x <- x[, -i]
    x
}
:-)
Sincerely,
V.

Dne pátek 27. října 2017 17:44:10 CET jste napsal(a):
Hi Vojtěch,

Here's something you could do. First, make a copy of del.colgapsonly:

toto <- del.colgapsonly

Then, edit this copy (e.g., with fix(toto)), find this line:

     foo <- function(x) sum(x == 4)

and replace 4 by 240. Save and close. Now you can use toto() in the same
way than del.colgapsonly(); for instance, to get the number of N's in
each site of the woodmouse data:

R> toto(woodmouse, freq.only = TRUE)
   [1]  3  2  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
....

If you wonder where the values 4 and 240 come from:

R> as.integer(as.DNAbin("-"))
[1] 4
R> as.integer(as.DNAbin("N"))
[1] 240

This gives a lot of possibilities to hack the function. For instance, if
you want to find the sites with R/Y, first, get the integer codes:

R> as.integer(as.DNAbin("R"))
[1] 192
R> as.integer(as.DNAbin("Y"))
[1] 48

Then, change the above line for:

     foo <- function(x) sum(x == 192 | x == 48)

HTH

Best,

Emmanuel

Le 27/10/2017 à 17:02, Vojtěch Zeisek a écrit :
> Hm, I tried a dirty hack: I exported the DNAbin object using
> ape::write.dna
> and replaced all occurrences of "n" in any sequence by "-" and imported
> the
> file back to R with ape::read.dna. Then I tried the mentioned functions.
> They did nothing. When I exported the file to disk, the FASTA file did
> not contain any "-", only "n". DO I do something wrong, or is there a bug
> in ape as it seems to confuse "n" and "-"?
> Sincerely,
> V.
>
> Dne pátek 27. října 2017 16:25:02 CEST jste napsal(a):
>> Hello,
>> I checked ape::del.colgapsonly, ips::deleteGaps and
>> ips::deleteEmptyCells.
>> They delete columns containing missing values, but I need also to delete
>> columns containing base "N" (all columns with amount of Ns over certain
>> threshold).
>> Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it
>> is
>> suppose to remove columns/rows containing only the given characters, but
>> if
>> I use it and export data (ape::write.dna or ape::write.nexus.data), some
>> samples consist only of N characters...
>> The DNAbin object being processed was originally imported from VCF using
>> vcfR (read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf,
>> consensus=TRUE, extract.haps=FALSE, unphased_as_NA=FALSE)).
>> I checked source code of the above functions, but they seem to only count
>> NAs and then drop respective columns. And as sequences in DNAbin are
>> stored in binary format, I'm bit struggled here... :(
>> Any idea how to remove columns with given portion of "N" in sequences?
>> Sincerely,
>> V.
_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Reply via email to