[R-sig-phylo] Extracting only FASTA sequences from a FASTQ file
Hello, I have a FASTQ file from which I would like to extract only the FASTA sequences (and not the associated PHRED quality scores). For instance, using the file mentioned in the R documentation for read.fastq(): library(ape) a <- " https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/ " file <- "SRR062641.filt.fastq.gz" URL <- paste0(a, file) download.file(URL, file) fastq <- read.fastq(file) # import FASTQ file One can easily extract the quality scores via qual <- attr(fastq, "QUAL") but there doesn't appear to be an attribute associated with the DNA sequences themselves (without the scores appended); only a "class" attribute, which evaluates to "DNAbin", is available. I've examined the underlying code for read.fastq() and there is a variable called DNA that is created, but it's not what I need: function (file, offset = -33) { Z <- scan(file, "", sep = "\n", quiet = TRUE) tmp <- Z[c(TRUE, TRUE, FALSE, FALSE)] sel <- c(TRUE, FALSE) tmp[sel] <- gsub("^@", ">", tmp[sel]) fl <- tempfile() cat(tmp, file = fl, sep = "\n") DNA <- read.FASTA(fl) tmp <- Z[c(FALSE, FALSE, FALSE, TRUE)] QUAL <- lapply(tmp, function(x) as.integer(charToRaw(x))) if (offset) QUAL <- lapply(QUAL, "+", offset) names(QUAL) <- names(DNA) attr(DNA, "QUAL") <- QUAL DNA } I then tried the following: x <- as.character.DNAbin(fastq) # extract DNA sequences y <- as.alignment(x) seqs <- y$seq but 'seqs' can't be written to a file using write.FASTA(). Any assistance is greatly appreciated. Thanks! Cheers, Jarrett [[alternative HTML version deleted]] ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
Re: [R-sig-phylo] Generating a matrix of interspecfic genetic distances
Hi Wolfgang, Apologies for not previously linking to the mailing list. I believe the below code gets me what I am in need of. res <- outer(Species, Species, "==") # logical matrix where off-diagonal elements are all FALSE is.na(no_outliers_dist_matrix) <- res # set all intraspecific distances to NA, leaving only interspecific values As a result, no_outliers_dist_matrix now contains only interspecific distances. Note, no_outliers_dist_matrix is lower triangular since I first ran the code dists[upper.tri(dists, diag = FALSE)] <- NA # replace upper triangle with NA and then subset to get lower triangular matrix Thanks. Cheers, Jarrett On Sun, Feb 5, 2023 at 12:50 PM Viechtbauer, Wolfgang (NP) < wolfgang.viechtba...@maastrichtuniversity.nl> wrote: > Please also reply to the list, not just to me. > > Unfortunately, I still don't get it. Can you show what you want the > *final* matrix to look like for this specific example? > > Best, > Wolfgang > > >-Original Message- > >From: Jarrett Phillips [mailto:phillipsjarre...@gmail.com] > >Sent: Sunday, 05 February, 2023 18:30 > >To: Viechtbauer, Wolfgang (NP) > >Subject: Re: [R-sig-phylo] Generating a matrix of interspecfic genetic > distances > > > >Thanks Wolfgang, > > > >‘dists’ contains both intraspecific and interspecific comparisons. The > ‘inter’ > >variable extracts interspecific values from ‘dists’, but returns a vector > instead > >of a matrix. > > > >I’m wondering how (if it is possible) I can return ‘inter’ as a matrix. > The > >result likely will not be a square matrix, which is okay. > > > >As mentioned in my previous email, as.matrix(inter) simply returns a > matrix with > >a single column, so destroys the block structure of matrices, and also > drops > >relevant row and column names found in ‘dists’. > > > >‘res’ is used to return all elements of ‘dists’ except the main diagonal. > > > >If things are still unclear, please let me know. > > > >Cheers, > > > >Jarrett > > > >On Sun, Feb 5, 2023 at 6:10 AM Viechtbauer, Wolfgang (NP) > > wrote: > >Dear Jarrett, > > > >I am not sure what it is that you are trying to accomplish with the last > two > >lines. Isn't 'dists' already what you want? If you just want to set the > diagonal > >of 'dists' to NA, then 'diag(dists) <- NA' would do that. But maybe I am > missing > >the point here entirely. > > > >Best, > >Wolfgang > > > >>-Original Message- > >>From: R-sig-phylo [mailto:r-sig-phylo-boun...@r-project.org] On Behalf > Of > >Jarrett > >>Phillips > >>Sent: Sunday, 05 February, 2023 8:31 > >>To: r-sig-phylo@r-project.org > >>Subject: [R-sig-phylo] Generating a matrix of interspecfic genetic > distances > >> > >>Hi All, > >> > >>I am using the ape R package to extract interspecific distances from a > >>pairwise distance matrix. > >> > >>Consider the woodmouse example. > >> > >>library(ape) > >> > >>data(woodmouse) > >> > >>dists <- dist.dna(woodmouse, model = "raw", as.matrix = TRUE) > >>isSymmetric(dists) # TRUE > >>dists[upper.tri(dists, diag = FALSE)] <- NA # replace upper triangle > >>with NA and then subset to get lower triangular matrix > >>isSymmetric(dists) # FALSE > >>dists # this is a matrix > >> > >>labs <- labels(woodmouse) # get species names > >> > >>res <- outer(labs, labs, "==") # get cases where matrix elements are > >>equal > >>inter <- na.omit(dists[!res]) # get interspecific distances by taking > >>elements that are not equal > >> > >>The problem I am facing is that I need 'inter' to be a matrix instead of > a > >>vector, so that I can tell which distances belong to which pairs of > >>specimens. The 'inter' matrix should have the same row and column names > as > >>the 'dists' variable. Simply doing > >> > >>as.matrix(inter) > >> > >>will not suffice in this case since row names and column names are not > >>preserved. > >> > >>Once I have a matrix of only interspecific distances, I can then find > >>nearest neighbours for all species in my dataset and continue from there. > >> > >>I feel it is an easy solution, but something is just not clicking. > >> > >>Any ideas? > >> > >>Please let me know if anything is unclear. > >> > >>Thanks. > >> > >>Cheers, > >> > >>Jarrett > [[alternative HTML version deleted]] ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
[R-sig-phylo] Generating a matrix of interspecfic genetic distances
Hi All, I am using the ape R package to extract interspecific distances from a pairwise distance matrix. Consider the woodmouse example. library(ape) data(woodmouse) dists <- dist.dna(woodmouse, model = "raw", as.matrix = TRUE) isSymmetric(dists) # TRUE dists[upper.tri(dists, diag = FALSE)] <- NA # replace upper triangle with NA and then subset to get lower triangular matrix isSymmetric(dists) # FALSE dists # this is a matrix labs <- labels(woodmouse) # get species names res <- outer(labs, labs, "==") # get cases where matrix elements are equal inter <- na.omit(dists[!res]) # get interspecific distances by taking elements that are not equal The problem I am facing is that I need 'inter' to be a matrix instead of a vector, so that I can tell which distances belong to which pairs of specimens. The 'inter' matrix should have the same row and column names as the 'dists' variable. Simply doing as.matrix(inter) will not suffice in this case since row names and column names are not preserved. Once I have a matrix of only interspecific distances, I can then find nearest neighbours for all species in my dataset and continue from there. I feel it is an easy solution, but something is just not clicking. Any ideas? Please let me know if anything is unclear. Thanks. Cheers, Jarrett [[alternative HTML version deleted]] ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
[R-sig-phylo] Extracting interspecific distances as a matrix
Hello, I'm trying to extract interspecific distances from a large distance matrix using APE. However, my approach returns a vector. Consider the following illustrative example: library(ape) data(woodmouse) d <- dist.dna(woodmouse, model = "raw") # get p-distance matrix labs <- labels(woodmouse) # get species names o <- outer(labs, labs, "==") # this is a matrix inter <- d[!o] # this is a vector of interspecific distances What I need is for 'inter' to be a matrix so I can use split() to split the matrix into lists by species. Simply coercing to a matrix with as.matrix(inter) destroys the structure of the distance matrix. Any assistance is appreciated. Thanks! Cheers, Jarrett [[alternative HTML version deleted]] ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
[R-sig-phylo] Extracting intraspecific and interspecific distances from dist.dna() object
Hello, I'm wondering whether someone knows of an efficient way to extract intraspecific (within-species) and interspecific (between-species) genetic distances for each species returned by ape::dist.dna() to be able to calculate the DNA barcode gap. It seems extracting the entire object as a matrix is a step in the right direction. Consider the following example: library(ape) library(spider) data(anoteropsis) # sample dataset anoDist <- dist.dna(anoteropsis, model = "raw", pairwise.deletion = FALSE, as.matrix = TRUE) # compute p-distances Any ideas? [[alternative HTML version deleted]] ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
[R-sig-phylo] Extracting sequences from DNAbin according to a condition
Hello All, I have a COI FASTA alignment file with 11926 sequences spanning 668 species. According to my code, a total of 361 species are represented by 6 or more sequences. I need to extract these 361 species (along with all their associated sequences) from the alignment but am having issues. Here is my code: library(ape) library(pegas) library(spider) library(stringr) seqs <- read.dna(file = file.choose(), format = "fasta") # import data and convert to matrix seqs.mat <- as.matrix(seqs) spp <- substr(dimnames(seqs.mat)[[1]], 1, 50) # extract sequence labels res <- str_remove(spp, "^[^|]+\\|") # remove BOLD Process ID res <- table(res)[which(table(res) >= 6)] # species with 6 or more records names(res) # 361 species names The problem is that `names(res)' contains unique species names.rather than repeated species names (according to the BOLD process ID). I also need the sequences. There are between 6-421 sequences across the 361 species (11143 sequences total). Does anyone have any ideas on next steps here? Once this is done, I would then write the sequences to a file via `write.dna()` If clarification is needed, please let me know. Thanks. Cheers, Jarrett [[alternative HTML version deleted]] ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
[R-sig-phylo] Extract singleton haplotypes with pegas::haplotype()
Hi All, I'm looking for a way to extract singleton haplotypes using the haplotype function in Pegas. I use the below function to accomplish this: extract.singleton.haps <- function(freq) { ind <- which(freq == 1) # index of singleton haplotypes single <- seqs[ind, ] # extracted haplotypes (only singletons) } I have tried the following: library(pegas) seqs <- read.dna(file = file.choose(), format = "fasta") # select FASTA file freq <- haplotype(seqs) # haplotype distribution freq <- lengths(attr(freq, "index") # extract haplotypes singleton.seqs <- extract.singleton.haps(freq) write.dna(singleton.seqs, file = paste0(taxon, "_singletons.fas"), format = "fasta") However, upon viewing the sequences in MEGA, they appear to all be identical, instead of unique. Any ideas on what is happening here? Thanks. Cheers, Jarrett [[alternative HTML version deleted]] ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
[R-sig-phylo] Improves haplotype() function in pegas 0.13
Hello, Emmanuel Paradis has recently updated the haplotype() function in pegas 0.13 to account for base ambiguities, gaps and Ns. Thank you Emmanuel! The argument 'strict' simply considers or ignores all gaps and ambiguities, but does this also consider/ignore Ns? The 'trailingGapsAsN' simply treats leading and trailing gaps as Ns, ignoring internal gaps. This argument is set to TRUE by default. >From the above, it appears that 'strict' ignores Ns. If 'strict' is set to TRUE, does this mean that TRUE/FALSE assignment 'trailingGapsAsN' is ignored as well? The reason I ask is because I use haplotype() in one of my R packages to compute optimal sample sizes for genetic diversity assessment (HACSim). Currently in my package, R throws a warning to users if missing data or base ambiguities are present within DNA alignments. Given Emmanuel's changes, it seems the warning in my package will not be needed once I set 'strict = TRUE'. I am unsure however on how to properly set 'trailingGapsAsN' to ensure that gaps do not affect haplotype calculation if they are left in the alignment. Gaps, ambiguities and Ns will cause an overestimation of haplotypes, and therefore an inflation of standing genetic variation. Can someone weigh in on this? Thanks! Cheers, Jarrett [[alternative HTML version deleted]] ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
[R-sig-phylo] Iterating though multiple FASTA files via rbind.DNAbin
Hi All, I have a folder with multiple FASTA files which need to be read into R. To avoid file overwriting, I use ape::rbind.DNAbin() as follows: file.names <- list.files(path = envr$filepath, pattern = ".fas") tmp <- matrix() for (i in 1:length(file.names)) { seqs <- read.dna(file = file.names[i], format = "fasta") seqs <- rbind.DNAbin(tmp, seqs) } When run however, I get an error saying that the files do not have the same number of columns (i.e., alignments are all not of the same length). How can I avoid this error. I feel that it's a basic fix, but one that is not immediately obvious to me. Thanks! [[alternative HTML version deleted]] ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/