[R-sig-phylo] Extracting only FASTA sequences from a FASTQ file

2024-03-11 Thread Jarrett Phillips
Hello,

I have a FASTQ file from which I would like to extract only the FASTA
sequences (and not the associated PHRED quality scores).

For instance, using the file mentioned in the R documentation for
read.fastq():

library(ape)

a <- "
https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/sequence_read/
"

file <- "SRR062641.filt.fastq.gz"
URL <- paste0(a, file)
download.file(URL, file)

fastq <- read.fastq(file) # import FASTQ file

One can easily extract the quality scores via

qual <- attr(fastq, "QUAL")

but there doesn't appear to be an attribute associated with the DNA
sequences themselves (without the scores appended); only a "class"
attribute, which evaluates to "DNAbin", is available.

I've examined the underlying code for read.fastq() and there is a variable
called DNA that is created, but it's not what I need:

function (file, offset = -33)
{
Z <- scan(file, "", sep = "\n", quiet = TRUE)
tmp <- Z[c(TRUE, TRUE, FALSE, FALSE)]
sel <- c(TRUE, FALSE)
tmp[sel] <- gsub("^@", ">", tmp[sel])
fl <- tempfile()
cat(tmp, file = fl, sep = "\n")
DNA <- read.FASTA(fl)
tmp <- Z[c(FALSE, FALSE, FALSE, TRUE)]
QUAL <- lapply(tmp, function(x) as.integer(charToRaw(x)))
if (offset)
QUAL <- lapply(QUAL, "+", offset)
names(QUAL) <- names(DNA)
attr(DNA, "QUAL") <- QUAL
DNA
}

I then tried the following:

x <- as.character.DNAbin(fastq) # extract DNA sequences
y <- as.alignment(x)

seqs <- y$seq

but 'seqs' can't be written to a file using write.FASTA().


Any assistance is greatly appreciated.

Thanks!

Cheers,

Jarrett

[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/


Re: [R-sig-phylo] Generating a matrix of interspecfic genetic distances

2023-02-05 Thread Jarrett Phillips
Hi Wolfgang,

Apologies for not previously linking to the mailing list.

I believe the below code gets me what I am in need of.

res <- outer(Species, Species, "==") # logical matrix where
off-diagonal elements are all FALSE
is.na(no_outliers_dist_matrix) <- res # set all intraspecific distances
to NA, leaving only interspecific values

As a result, no_outliers_dist_matrix now contains only interspecific
distances.

Note, no_outliers_dist_matrix is lower triangular since I first ran the code

 dists[upper.tri(dists, diag = FALSE)] <- NA # replace upper triangle with
NA and then subset to get lower triangular matrix


Thanks.

Cheers,

Jarrett


On Sun, Feb 5, 2023 at 12:50 PM Viechtbauer, Wolfgang (NP) <
wolfgang.viechtba...@maastrichtuniversity.nl> wrote:

> Please also reply to the list, not just to me.
>
> Unfortunately, I still don't get it. Can you show what you want the
> *final* matrix to look like for this specific example?
>
> Best,
> Wolfgang
>
> >-Original Message-
> >From: Jarrett Phillips [mailto:phillipsjarre...@gmail.com]
> >Sent: Sunday, 05 February, 2023 18:30
> >To: Viechtbauer, Wolfgang (NP)
> >Subject: Re: [R-sig-phylo] Generating a matrix of interspecfic genetic
> distances
> >
> >Thanks Wolfgang,
> >
> >‘dists’ contains both intraspecific and interspecific comparisons. The
> ‘inter’
> >variable extracts interspecific values from ‘dists’, but returns a vector
> instead
> >of a matrix.
> >
> >I’m wondering how (if it is possible) I can return ‘inter’ as a matrix.
> The
> >result likely will not be a square matrix, which is okay.
> >
> >As mentioned in my previous email, as.matrix(inter) simply returns a
> matrix with
> >a single column, so destroys the block structure of matrices, and also
> drops
> >relevant row and column names found in ‘dists’.
> >
> >‘res’ is used to return all elements of ‘dists’ except the main diagonal.
> >
> >If things are still unclear, please let me know.
> >
> >Cheers,
> >
> >Jarrett
> >
> >On Sun, Feb 5, 2023 at 6:10 AM Viechtbauer, Wolfgang (NP)
> > wrote:
> >Dear Jarrett,
> >
> >I am not sure what it is that you are trying to accomplish with the last
> two
> >lines. Isn't 'dists' already what you want? If you just want to set the
> diagonal
> >of 'dists' to NA, then 'diag(dists) <- NA' would do that. But maybe I am
> missing
> >the point here entirely.
> >
> >Best,
> >Wolfgang
> >
> >>-Original Message-
> >>From: R-sig-phylo [mailto:r-sig-phylo-boun...@r-project.org] On Behalf
> Of
> >Jarrett
> >>Phillips
> >>Sent: Sunday, 05 February, 2023 8:31
> >>To: r-sig-phylo@r-project.org
> >>Subject: [R-sig-phylo] Generating a matrix of interspecfic genetic
> distances
> >>
> >>Hi All,
> >>
> >>I am using the ape R package to extract interspecific distances from a
> >>pairwise distance matrix.
> >>
> >>Consider the woodmouse example.
> >>
> >>library(ape)
> >>
> >>data(woodmouse)
> >>
> >>dists <- dist.dna(woodmouse, model = "raw", as.matrix = TRUE)
> >>isSymmetric(dists) # TRUE
> >>dists[upper.tri(dists, diag = FALSE)] <- NA # replace upper triangle
> >>with NA and then subset to get lower triangular matrix
> >>isSymmetric(dists) # FALSE
> >>dists # this is a matrix
> >>
> >>labs <- labels(woodmouse) # get species names
> >>
> >>res <- outer(labs, labs, "==") # get cases where matrix elements are
> >>equal
> >>inter <- na.omit(dists[!res]) # get interspecific distances by taking
> >>elements that are not equal
> >>
> >>The problem I am facing is that I need 'inter' to be a matrix instead of
> a
> >>vector, so that I can tell which distances belong to which pairs of
> >>specimens. The 'inter' matrix should have the same row and column names
> as
> >>the 'dists' variable. Simply doing
> >>
> >>as.matrix(inter)
> >>
> >>will not suffice in this case since row names and column names are not
> >>preserved.
> >>
> >>Once I have a matrix of only interspecific distances, I can then find
> >>nearest neighbours for all species in my dataset and continue from there.
> >>
> >>I feel it is an easy solution, but something is just not clicking.
> >>
> >>Any ideas?
> >>
> >>Please let me know if anything is unclear.
> >>
> >>Thanks.
> >>
> >>Cheers,
> >>
> >>Jarrett
>

[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/


[R-sig-phylo] Generating a matrix of interspecfic genetic distances

2023-02-04 Thread Jarrett Phillips
Hi All,

I am using the ape R package to extract interspecific distances from a
pairwise distance matrix.

Consider the woodmouse example.

library(ape)

data(woodmouse)

dists <- dist.dna(woodmouse, model = "raw", as.matrix = TRUE)
isSymmetric(dists) # TRUE
dists[upper.tri(dists, diag = FALSE)] <- NA # replace upper triangle
with NA and then subset to get lower triangular matrix
isSymmetric(dists) # FALSE
dists # this is a matrix

labs <- labels(woodmouse) # get species names

res <- outer(labs, labs, "==") # get cases where matrix elements are
equal
inter <- na.omit(dists[!res]) # get interspecific distances by taking
elements that are not equal


The problem I am facing is that I need 'inter' to be a matrix instead of a
vector, so that I can tell which distances belong to which pairs of
specimens. The 'inter' matrix should have the same row and column names as
the 'dists' variable. Simply doing

as.matrix(inter)

will not suffice in this case since row names and column names are not
preserved.

Once I have a matrix of only interspecific distances, I can then find
nearest neighbours for all species in my dataset and continue from there.

I feel it is an easy solution, but something is just not clicking.

Any ideas?

Please let me know if anything is unclear.

Thanks.

Cheers,

Jarrett

[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/


[R-sig-phylo] Extracting interspecific distances as a matrix

2023-01-16 Thread Jarrett Phillips
Hello,

I'm trying to extract interspecific distances from a large distance matrix
using APE. However, my approach returns a vector.

Consider the following illustrative example:

library(ape)
data(woodmouse)

d <- dist.dna(woodmouse, model = "raw") # get p-distance matrix
labs <- labels(woodmouse) # get species names
o <- outer(labs, labs, "==") # this is a matrix
inter <- d[!o] # this is a vector of interspecific distances

What I need is for 'inter' to be a matrix so I can use split() to split the
matrix into lists by species.

Simply coercing to a matrix with as.matrix(inter) destroys the structure of
the distance matrix.

Any assistance is appreciated.

Thanks!

Cheers,

Jarrett

[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/


[R-sig-phylo] Extracting intraspecific and interspecific distances from dist.dna() object

2022-10-18 Thread Jarrett Phillips
Hello,

I'm wondering whether someone knows of an efficient way to extract
intraspecific (within-species) and interspecific (between-species) genetic
distances for each species returned by ape::dist.dna() to be able to
calculate the DNA barcode gap.

It seems extracting the entire object as a matrix is a step in the right
direction.

Consider the following example:

library(ape)
library(spider)

data(anoteropsis) # sample dataset
anoDist <- dist.dna(anoteropsis, model = "raw", pairwise.deletion =
FALSE, as.matrix = TRUE) # compute p-distances

Any ideas?

[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/


[R-sig-phylo] Extracting sequences from DNAbin according to a condition

2021-06-27 Thread Jarrett Phillips
Hello All,

I have a COI FASTA alignment file with 11926 sequences spanning 668 species.

According to my code, a total of 361 species are represented by 6 or more
sequences.

I need to extract these 361 species (along with all their associated
sequences) from the alignment but am having issues.

Here is my code:

library(ape)
library(pegas)
library(spider)
library(stringr)

seqs <- read.dna(file = file.choose(), format = "fasta") # import data
and convert to matrix
seqs.mat <- as.matrix(seqs)

spp <- substr(dimnames(seqs.mat)[[1]], 1, 50) # extract sequence labels

res <- str_remove(spp, "^[^|]+\\|") # remove BOLD Process ID

res <- table(res)[which(table(res) >= 6)] # species with 6 or more
records
names(res) # 361 species names


The problem is that `names(res)' contains unique species names.rather than
repeated species names (according to the BOLD process ID).

I also need the sequences. There are between 6-421 sequences across the 361
species (11143 sequences total).

Does anyone have any ideas on next steps here?

Once this is done, I would then write the sequences to a file via
`write.dna()`

If clarification is needed, please let me know.


Thanks.

Cheers,

Jarrett

[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/


[R-sig-phylo] Extract singleton haplotypes with pegas::haplotype()

2021-04-26 Thread Jarrett Phillips
Hi All,

I'm looking for a way to extract singleton haplotypes using the haplotype
function in Pegas.

I use the below function to accomplish this:

extract.singleton.haps <- function(freq) {
  ind <- which(freq == 1) # index of singleton haplotypes
  single <- seqs[ind, ] # extracted haplotypes (only singletons)
}

I have tried the following:

library(pegas)
seqs <- read.dna(file = file.choose(), format = "fasta") # select FASTA
file
freq <- haplotype(seqs) # haplotype distribution
freq <- lengths(attr(freq, "index") # extract haplotypes
singleton.seqs <- extract.singleton.haps(freq)
write.dna(singleton.seqs, file = paste0(taxon, "_singletons.fas"),
format = "fasta")


However, upon viewing the sequences in MEGA, they appear to all be
identical, instead of unique.

Any ideas on what is happening here?

Thanks.

Cheers,

Jarrett

[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/


[R-sig-phylo] Improves haplotype() function in pegas 0.13

2020-05-13 Thread Jarrett Phillips
Hello,

Emmanuel Paradis has recently updated the haplotype() function in pegas
0.13 to account for base ambiguities, gaps and Ns. Thank you Emmanuel!

The argument  'strict' simply considers or ignores all gaps and
ambiguities, but does this also consider/ignore Ns?

The 'trailingGapsAsN' simply treats leading and trailing gaps as Ns,
ignoring internal gaps. This argument is set to TRUE by default.

>From the above, it appears that  'strict' ignores Ns. If 'strict' is set to
TRUE, does this mean that TRUE/FALSE assignment 'trailingGapsAsN' is ignored
as well?

The reason I ask is because I use haplotype() in one of my R packages to
compute optimal sample sizes for genetic diversity assessment (HACSim).
Currently in my package, R throws a warning to users if missing data or
base ambiguities are present within DNA alignments.

Given Emmanuel's changes, it seems the warning in my package will not be
needed once I set 'strict = TRUE'. I am unsure however on how to properly
set 'trailingGapsAsN' to ensure that gaps do not affect haplotype
calculation if they are left in the alignment. Gaps, ambiguities and Ns
will cause an overestimation of haplotypes, and therefore an inflation of
standing genetic variation.

Can someone weigh in on this?

Thanks!

Cheers,

Jarrett

[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/


[R-sig-phylo] Iterating though multiple FASTA files via rbind.DNAbin

2020-03-12 Thread Jarrett Phillips
Hi All,

I have a folder with multiple FASTA files which need to be read into R.

To avoid file overwriting, I use ape::rbind.DNAbin() as follows:

file.names <- list.files(path = envr$filepath, pattern = ".fas")
  tmp <- matrix()
  for (i in 1:length(file.names)) {
seqs <- read.dna(file = file.names[i], format = "fasta")
seqs <- rbind.DNAbin(tmp, seqs)
  }

When run however, I get an error saying that the files do not have the same
number of columns (i.e., alignments are all not of the same length).

How can I avoid this error. I feel that it's a basic fix, but one that is
not immediately obvious to me.

Thanks!

[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/