Thank You, dear Andreas,
yes, I should notice this myself, my fault. Indeed, the nearly empty sample 
was problematic, although, I wish to filter columns with N over certain 
proportion. After manual edition it works. I just still wonder if there was 
something related to the conversion from VCF...
Thank You all for Your help,
V.

Dne pondělí 30. října 2017 12:03:05 CET, Andreas Kolter napsal(a):
> Dear Vojtech,
> my code snippet works, and the other solutions were also correct. The
> problem is with your alignment. Your 3rd sequence only contains N's,
> therefore whatever approach you use, all coloumns are filtered out if
> you filter all coloumns containing N.
> Please open your alignment in a text editor or use
> image.DNAbin(youralignment) to visualise your alignment before you
> process it in any way.
> Best wishes,
> Andreas
> 
> Am 2017-10-30 11:26, schrieb Vojtěch Zeisek:
> > Thank You, Emmanuel,
> > sorry for late reaction, I forge to copy the data to another computer
> > before I
> > left at Friday... It is a nice trick, although it returns unchanged
> > alignment.
> > :-( It is same regardless which threshold value I choose. I created it
> > using vcfR::vcfR2DNAbin, but it looks like any other DNAbin object. With
> > indels ("-") this function worked well.
> > 
> > minua.dna
> > 117 DNA sequences in binary format stored in a matrix.
> > 
> > All sequences of same length: 11946
> > 
> > Labels:
> > M151_1_1
> > M151_2_4
> > M152_2_52
> > M153_1_109
> > M153_2_276
> > M153_2_294
> > ...
> > 
> > Base composition:
> >     a     c     g     t
> > 
> > 0.213 0.280 0.295 0.212
> > 
> > minua.dna.red <- del.colgapsonly.n(x=minua.dna, threshold=5,
> > freq.only=FALSE)
> > minua.dna.red
> > 117 DNA sequences in binary format stored in a matrix.
> > 
> > All sequences of same length: 11946
> > 
> > Labels:
> > M151_1_1
> > M151_2_4
> > M152_2_52
> > M153_1_109
> > M153_2_276
> > M153_2_294
> > ...
> > 
> > Base composition:
> >     a     c     g     t
> > 
> > 0.213 0.280 0.295 0.212
> > 
> > And yes, the alignment contains various bases...
> > base.freq(x=minua.dna, all=TRUE)
> > 
> >          a          c          g          t          r          m
> >     
> >     w
> > 
> > s          k          y
> > 0.14505732 0.19109712 0.20145212 0.14439408 0.06586477 0.01543627
> > 0.01724498
> > 0.01034570 0.01635994 0.06292132
> > 
> >          v          h          d          b          n          -
> >    
> >    ?
> > 
> > 0.00000000 0.00000000 0.00000000 0.00000000 0.12982638 0.00000000
> > 0.00000000
> > 
> > del.colgapsonly.n
> > function (x, threshold = 1, freq.only = FALSE)
> > {
> >     if (!inherits(x, "DNAbin"))
> >         x <- as.DNAbin(x)
> >     if (!is.matrix(x))
> >         stop("DNA sequences not in a matrix")
> >     foo <- function(x) sum(x == 240)
> >     g <- apply(x, 2, foo)
> >     if (freq.only)
> >         return(g)
> >     i <- which(g/nrow(x) >= threshold)
> >     if (length(i))
> >         x <- x[, -i]
> >     x
> > }
> > <environment: namespace:ape>
> > 
> > BTW, what about generalize the function like this:
> > 
> > del.colgapsonly.X <- function (x, threshold=1, freq.only=FALSE,
> > char="-")
> > {
> >     if (!inherits(x, "DNAbin"))
> >         x <- as.DNAbin(x)
> >     if (!is.matrix(x))
> >         stop("DNA sequences not in a matrix")
> >     value <- as.integer(as.DNAbin(char))
> >     foo <- function(x) sum(x == value)
> >     g <- apply(x, 2, foo)
> >     if (freq.only)
> >         return(g)
> >     i <- which(g/nrow(x) >= threshold)
> >     if (length(i))
> >         x <- x[, -i]
> >     x
> > }
> > :-)
> > Sincerely,
> > V.
> > 
> > Dne pátek 27. října 2017 17:44:10 CET jste napsal(a):
> >> Hi Vojtěch,
> >> 
> >> Here's something you could do. First, make a copy of del.colgapsonly:
> >> 
> >> toto <- del.colgapsonly
> >> 
> >> Then, edit this copy (e.g., with fix(toto)), find this line:
> >>      foo <- function(x) sum(x == 4)
> >> 
> >> and replace 4 by 240. Save and close. Now you can use toto() in the
> >> same
> >> way than del.colgapsonly(); for instance, to get the number of N's in
> >> each site of the woodmouse data:
> >> 
> >> R> toto(woodmouse, freq.only = TRUE)
> >> 
> >>    [1]  3  2  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
> >> 
> >> ....
> >> 
> >> If you wonder where the values 4 and 240 come from:
> >> 
> >> R> as.integer(as.DNAbin("-"))
> >> [1] 4
> >> R> as.integer(as.DNAbin("N"))
> >> [1] 240
> >> 
> >> This gives a lot of possibilities to hack the function. For instance,
> >> if you want to find the sites with R/Y, first, get the integer codes:
> >> 
> >> R> as.integer(as.DNAbin("R"))
> >> [1] 192
> >> R> as.integer(as.DNAbin("Y"))
> >> [1] 48
> >> 
> >> Then, change the above line for:
> >>      foo <- function(x) sum(x == 192 | x == 48)
> >> 
> >> HTH
> >> 
> >> Best,
> >> 
> >> Emmanuel
> >> 
> >> Le 27/10/2017 à 17:02, Vojtěch Zeisek a écrit :
> >> > Hm, I tried a dirty hack: I exported the DNAbin object using
> >> > ape::write.dna
> >> > and replaced all occurrences of "n" in any sequence by "-" and imported
> >> > the file back to R with ape::read.dna. Then I tried the mentioned
> >> > functions.
> >> > They did nothing. When I exported the file to disk, the FASTA file did
> >> > not contain any "-", only "n". DO I do something wrong, or is there a
> >> > bug in ape as it seems to confuse "n" and "-"?
> >> > Sincerely,
> >> > V.
> >> > 
> >> > Dne pátek 27. října 2017 16:25:02 CEST jste napsal(a):
> >> >> Hello,
> >> >> I checked ape::del.colgapsonly, ips::deleteGaps and
> >> >> ips::deleteEmptyCells.
> >> >> They delete columns containing missing values, but I need also to
> >> >> delete
> >> >> columns containing base "N" (all columns with amount of Ns over
> >> >> certain threshold).
> >> >> Actually, ips::deleteEmptyCells has option nset=c("-", "n", "?"), so
> >> >> it is
> >> >> suppose to remove columns/rows containing only the given characters,
> >> >> but if
> >> >> I use it and export data (ape::write.dna or ape::write.nexus.data),
> >> >> some samples consist only of N characters...
> >> >> The DNAbin object being processed was originally imported from VCF
> >> >> using
> >> >> vcfR (read.vcfR(file="my.vcf") and converted: vcfR2DNAbin(x=myvcf,
> >> >> consensus=TRUE, extract.haps=FALSE, unphased_as_NA=FALSE)).
> >> >> I checked source code of the above functions, but they seem to only
> >> >> count
> >> >> NAs and then drop respective columns. And as sequences in DNAbin are
> >> >> stored in binary format, I'm bit struggled here... :(
> >> >> Any idea how to remove columns with given portion of "N" in sequences?
> >> >> Sincerely,
> >> >> V.
-- 
Vojtěch Zeisek
https://trapa.cz/

Attachment: signature.asc
Description: This is a digitally signed message part.

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Reply via email to