Hi Vojtěch,

Here's something you could do. First, make a copy of del.colgapsonly:

toto <- del.colgapsonly

Then, edit this copy (e.g., with fix(toto)), find this line:

    foo <- function(x) sum(x == 4)

and replace 4 by 240. Save and close. Now you can use toto() in the same way than del.colgapsonly(); for instance, to get the number of N's in each site of the woodmouse data:

R> toto(woodmouse, freq.only = TRUE)
  [1]  3  2  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
....

If you wonder where the values 4 and 240 come from:

R> as.integer(as.DNAbin("-"))
[1] 4
R> as.integer(as.DNAbin("N"))
[1] 240

This gives a lot of possibilities to hack the function. For instance, if you want to find the sites with R/Y, first, get the integer codes:

R> as.integer(as.DNAbin("R"))
[1] 192
R> as.integer(as.DNAbin("Y"))
[1] 48

Then, change the above line for:

    foo <- function(x) sum(x == 192 | x == 48)

HTH

Best,

Emmanuel

Le 27/10/2017 à 17:02, Vojtěch Zeisek a écrit :
Hm, I tried a dirty hack: I exported the DNAbin object using ape::write.dna
and replaced all occurrences of "n" in any sequence by "-" and imported the
file back to R with ape::read.dna. Then I tried the mentioned functions. They
did nothing. When I exported the file to disk, the FASTA file did not contain
any "-", only "n". DO I do something wrong, or is there a bug in ape as it
seems to confuse "n" and "-"?
Sincerely,
V.

Dne pátek 27. října 2017 16:25:02 CEST jste napsal(a):
Hello,
I checked ape::del.colgapsonly, ips::deleteGaps and ips::deleteEmptyCells.
They delete columns containing missing values, but I need also to delete
columns containing base "N" (all columns with amount of Ns over certain
threshold).
Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it is
suppose to remove columns/rows containing only the given characters, but if
I use it and export data (ape::write.dna or ape::write.nexus.data), some
samples consist only of N characters...
The DNAbin object being processed was originally imported from VCF using
vcfR (read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf,
consensus=TRUE, extract.haps=FALSE, unphased_as_NA=FALSE)).
I checked source code of the above functions, but they seem to only count
NAs and then drop respective columns. And as sequences in DNAbin are stored
in binary format, I'm bit struggled here... :(
Any idea how to remove columns with given portion of "N" in sequences?
Sincerely,
V.


_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Reply via email to