HI Vojtech,

Below is a function modified from package *ips deleteEmptyCells *function,
it works on "?" percentage in the alignment, but that can be easily
modified with the above suggestions by Emmanuel.

Best,
Matt


DeleteFunkyColumns =

function (DNAbin,  cutoff=0.3, quiet = FALSE)

{


 isfunky = function(DNAbin) {

  nsets = "?" ###

 nn = as.raw(2)

 names(nn) = "?"

 nsets = nn[nsets]

  len = dim(DNAbin)[2]

  DNAbin[,1:len] == nsets

 }

 size = dim(DNAbin)

 zztop = isfunky(DNAbin) # phylip

  colids = which(apply(zztop,2, sum)/dim(DNAbin)[1] >= cutoff)

  if (length(colids)!=0) {



  DNAbin = DNAbin[, -colids]

} else {

 print("The cutoff is too high, try lowering it, nothing will be done to
the alignment.")

 }

 if (!quiet) {

        size = size - dim(DNAbin)

        cat("\n\t", size[2], "columns deleted from alignment, they are:",
colids)

    }

    DNAbin

}


#### example of how to use below

phylip = read.phy("uce-1289.nexus.phylip")

 phylip

newphylip = DeleteFunkyColumns(phylip, cutoff=0.37)

newphylip

newphylip = DeleteFunkyColumns(phylip, cutoff=0.25)

newphylip




On Fri, Oct 27, 2017 at 8:52 AM, Vojtěch Zeisek <vo...@trapa.cz> wrote:

> Thank You,
> Andreas, yes, I try to manipulate an alignment. This is nice trick,
> although
> it returns empty alignment regardless threshold value used (I do have some
> data in the alignment:-)...
> Have a nice weekend,
> V.
>
> Dne pátek 27. října 2017 17:02:45 CEST jste napsal(a):
> > Hello V.
> > Because you speak of columns I assume you are handling an alignment,
> > right? If you handle an alignment all sequences have the same length and
> > you can do as.matrix
> >
> > Like this?
> >
> > library(magrittr)
> > #maximum number of n's
> > thresh <- 0.005  #0.5%
> > seq <- as.matrix(seq)
> > temp <- seq %>% sapply(.,grep,pattern="n") %>% unlist(.,use.names=F) %>%
> > table
> > seq[,-(names(temp)[which(temp/ncol(seq)>thresh)] %>% as.integer)]
> >
> > Greetings,
> > Andreas
> >
> > Am 2017-10-27 16:25, schrieb Vojtěch Zeisek:
> > > Hello,
> > > I checked ape::del.colgapsonly, ips::deleteGaps and
> > > ips::deleteEmptyCells.
> > > They delete columns containing missing values, but I need also to
> > > delete
> > > columns containing base "N" (all columns with amount of Ns over certain
> > > threshold).
> > > Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so it
> > > is suppose to remove columns/rows containing only the given characters,
> > > but if I
> > > use it and export data (ape::write.dna or ape::write.nexus.data), some
> > > samples consist only of N characters...
> > > The DNAbin object being processed was originally imported from VCF
> > > using vcfR (read.vcfR(file="my.vcf") and converted:
> vcfR2DNAbin(x=myvcf,
> > > consensus=TRUE,
> > > extract.haps=FALSE, unphased_as_NA=FALSE)).
> > > I checked source code of the above functions, but they seem to only
> > > count NAs
> > > and then drop respective columns. And as sequences in DNAbin are stored
> > > in binary format, I'm bit struggled here... :(
> > > Any idea how to remove columns with given portion of "N" in sequences?
> > > Sincerely,
> > > V.
> --
> Vojtěch Zeisek
> https://trapa.cz/en/
>
> Department of Botany, Faculty of Science
> Charles University, Prague, Czech Republic
> https://www.natur.cuni.cz/biology/botany/
>
> Institute of Botany, Czech Academy of Sciences
> Průhonice, Czech Republic
> http://www.ibot.cas.cz/en/
>
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at http://www.mail-archive.com/r-
> sig-ph...@r-project.org/
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Reply via email to