Hi all, I had a short look in the code and found some bits to speed the read.nexus.data function up. I added Emmanuel on the list so he may can put it into the next ape release if it does work. Generally I agree with Johan that if speed matters fasta files are the way to go. Nexus files are ugly to parse and contain many inconsistencies, like parameters inside comments [].
Regards, Klaus On 4/9/13, Johan Nylander <johan.nylan...@abc.se> wrote: > Dear All, > > Just to avoid confusion, the readNexus function is in the phylobase > package. And as Ben pointed out, other packages have their own functions > for reading the data part from a nexus-formatted file, see e.g., read.nex > in phyloch. > > On a related note, I wrote read.nexus.data as a "temporary", crude parsing > function while waiting for the phylobase project to take off (phylobase > uses NCL by Lewis & Holder - _the_ nexus parser), so expect > read.nexus.data to have it's limitations. > > Furthermore, if speed is the concern, it would perhaps be preferable to > first convert the Nexus data to Fasta, and then use one of the many > fast(er) parsers implemented in numerous R packages. > > Cheers > Johan > > > On 04/07/2013 02:59 PM, Ben Bolker wrote:> On 13-04-05 01:29 PM, Jessica > Sabo wrote: >>> Hi All, >>> >>> I am wondering if there is anyway to increase the speed of the >>> read.nexus.data parser. Or if there is an alternative that is a >>> faster nexus file data parser. >>> >>> THanks, Jess >>> >> >> I don't know if it's faster or not, but there is ?readNexus in the >> 'ape' package. Also see library("sos"); findFn("read {nexus format}") >> >> _______________________________________________ >> R-sig-phylo mailing list - R-sig-phylo@r-project.org >> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo >> Searchable archive at > http://www.mail-archive.com/r-sig-phylo@r-project.org/ >> > > _______________________________________________ > R-sig-phylo mailing list - R-sig-phylo@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo > Searchable archive at > http://www.mail-archive.com/r-sig-phylo@r-project.org/ > -- Klaus Schliep Phylogenomics Lab at the University of Vigo, Spain
read.nexus.data <- function (file) { "find.ntax" <- function(x) { for (i in 1:NROW(x)) { if (any(f <- grep("\\bntax", x[i], ignore.case = TRUE))) { ntax <- as.numeric(sub("(.+?)(ntax\\s*\\=\\s*)(\\d+)(.+)", "\\3", x[i], perl = TRUE, ignore.case = TRUE)) break } } ntax } "find.nchar" <- function(x) { for (i in 1:NROW(x)) { if (any(f <- grep("\\bnchar", x[i], ignore.case = TRUE))) { nchar <- as.numeric(sub("(.+?)(nchar\\s*\\=\\s*)(\\d+)(.+)", "\\3", x[i], perl = TRUE, ignore.case = TRUE)) break } } nchar } "find.matrix.line" <- function(x) { for (i in 1:NROW(x)) { if (any(f <- grep("\\bmatrix\\b", x[i], ignore.case = TRUE))) { matrix.line <- as.numeric(i) break } } matrix.line } "trim.whitespace" <- function(x) { gsub("\\s+", "", x) } "trim.semicolon" <- function(x) { gsub(";", "", x) } X <- scan(file = file, what = character(), sep = "\n", quiet = TRUE, comment.char = "[", strip.white = TRUE) ntax <- find.ntax(X) nchar <- find.nchar(X) matrix.line <- find.matrix.line(X) start.reading <- matrix.line + 1 Obj <- vector("list", ntax) for(i in 1:ntax)Obj[[i]] = rep(NA, nchar) i <- 1 pos <- 0 tot.nchar <- 0 tot.ntax <- 0 for (j in start.reading:NROW(X)) { Xj <- trim.semicolon(X[j]) if (Xj == "") { break } if (any(jtmp <- grep("\\bend\\b", X[j], perl = TRUE, ignore.case = TRUE))) { break } ts <- unlist(strsplit(Xj, "(?<=\\S)(\\s+)(?=\\S)", perl = TRUE)) # browser() if (length(ts) > 2) { stop("nexus parser does not handle spaces in sequences or taxon names (ts>2)") } if (length(ts) != 2) { stop("nexus parser failed to read the sequences (ts!=2)") } Seq <- trim.whitespace(ts[2]) Name <- trim.whitespace(ts[1]) nAME <- paste(c("\\b", Name, "\\b"), collapse = "") if (any(l <- grep(nAME, names(Obj)))) { tsp <- strsplit(Seq, NULL)[[1]] Obj[[l]][pos + c(1:length(tsp))] <- tsp chars.done <- length(tsp) } else { names(Obj)[i] <- Name tsp <- strsplit(Seq, NULL)[[1]] Obj[[i]][pos + c(1:length(tsp))] <- tsp chars.done <- length(tsp) } tot.ntax <- tot.ntax + 1 if (tot.ntax == ntax) { i <- 1 tot.ntax <- 0 tot.nchar <- tot.nchar + chars.done if (tot.nchar == nchar * ntax) { print("ntot was more than nchar*ntax") break } pos <- tot.nchar } else { i <- i + 1 } } if (tot.ntax != 0) { cat("ntax:", ntax, "differ from actual number of taxa in file?\n") stop("nexus parser did not read names correctly (tot.ntax!=0)") } for (i in 1:length(Obj)) { if (length(Obj[[i]]) != nchar) { cat(names(Obj[i]), "has", length(Obj[[i]]), "characters\n") stop("nchar differ from sequence length (length(Obj[[i]])!=nchar)") } } Obj <- lapply(Obj, tolower) Obj }
_______________________________________________ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/