HI Vojtech, Below is a function modified from package *ips deleteEmptyCells *function, it works on "?" percentage in the alignment, but that can be easily modified with the above suggestions by Emmanuel.
Best, Matt DeleteFunkyColumns = function (DNAbin, cutoff=0.3, quiet = FALSE) { isfunky = function(DNAbin) { nsets = "?" ### nn = as.raw(2) names(nn) = "?" nsets = nn[nsets] len = dim(DNAbin)[2] DNAbin[,1:len] == nsets } size = dim(DNAbin) zztop = isfunky(DNAbin) # phylip colids = which(apply(zztop,2, sum)/dim(DNAbin)[1] >= cutoff) if (length(colids)!=0) { DNAbin = DNAbin[, -colids] } else { print("The cutoff is too high, try lowering it, nothing will be done to the alignment.") } if (!quiet) { size = size - dim(DNAbin) cat("\n\t", size[2], "columns deleted from alignment, they are:", colids) } DNAbin } #### example of how to use below phylip = read.phy("uce-1289.nexus.phylip") phylip newphylip = DeleteFunkyColumns(phylip, cutoff=0.37) newphylip newphylip = DeleteFunkyColumns(phylip, cutoff=0.25) newphylip On Fri, Oct 27, 2017 at 8:52 AM, Vojtěch Zeisek <vo...@trapa.cz> wrote: > Thank You, > Andreas, yes, I try to manipulate an alignment. This is nice trick, > although > it returns empty alignment regardless threshold value used (I do have some > data in the alignment:-)... > Have a nice weekend, > V. > > Dne pátek 27. října 2017 17:02:45 CEST jste napsal(a): > > Hello V. > > Because you speak of columns I assume you are handling an alignment, > > right? If you handle an alignment all sequences have the same length and > > you can do as.matrix > > > > Like this? > > > > library(magrittr) > > #maximum number of n's > > thresh <- 0.005 #0.5% > > seq <- as.matrix(seq) > > temp <- seq %>% sapply(.,grep,pattern="n") %>% unlist(.,use.names=F) %>% > > table > > seq[,-(names(temp)[which(temp/ncol(seq)>thresh)] %>% as.integer)] > > > > Greetings, > > Andreas > > > > Am 2017-10-27 16:25, schrieb Vojtěch Zeisek: > > > Hello, > > > I checked ape::del.colgapsonly, ips::deleteGaps and > > > ips::deleteEmptyCells. > > > They delete columns containing missing values, but I need also to > > > delete > > > columns containing base "N" (all columns with amount of Ns over certain > > > threshold). > > > Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it > > > is suppose to remove columns/rows containing only the given characters, > > > but if I > > > use it and export data (ape::write.dna or ape::write.nexus.data), some > > > samples consist only of N characters... > > > The DNAbin object being processed was originally imported from VCF > > > using vcfR (read.vcfR(file="my.vcf") and converted: > vcfR2DNAbin(x=myvcf, > > > consensus=TRUE, > > > extract.haps=FALSE, unphased_as_NA=FALSE)). > > > I checked source code of the above functions, but they seem to only > > > count NAs > > > and then drop respective columns. And as sequences in DNAbin are stored > > > in binary format, I'm bit struggled here... :( > > > Any idea how to remove columns with given portion of "N" in sequences? > > > Sincerely, > > > V. > -- > Vojtěch Zeisek > https://trapa.cz/en/ > > Department of Botany, Faculty of Science > Charles University, Prague, Czech Republic > https://www.natur.cuni.cz/biology/botany/ > > Institute of Botany, Czech Academy of Sciences > Průhonice, Czech Republic > http://www.ibot.cas.cz/en/ > > _______________________________________________ > R-sig-phylo mailing list - R-sig-phylo@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo > Searchable archive at http://www.mail-archive.com/r- > sig-ph...@r-project.org/ > [[alternative HTML version deleted]] _______________________________________________ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/