Re: [R-sig-phylo] read.nexus.data parser

Klaus Schliep Tue, 09 Apr 2013 04:18:16 -0700

Hi all,

I had a short look in the code and found some bits to speed the
read.nexus.data function up. I added Emmanuel on the list so he may
can put it into the next ape release if it does work.
Generally I agree with Johan that if speed matters fasta files are the
way to go. Nexus files are ugly to parse and contain many
inconsistencies, like parameters inside comments [].


Regards,
Klaus


On 4/9/13, Johan Nylander <johan.nylan...@abc.se> wrote:
> Dear All,
>
> Just to avoid confusion, the readNexus function is in the phylobase
> package. And as Ben pointed out, other packages have their own functions
> for reading the data part from a nexus-formatted file, see e.g., read.nex
> in phyloch.
>
> On a related note, I wrote read.nexus.data as a "temporary", crude parsing
> function while waiting for the phylobase project to take off (phylobase
> uses NCL by Lewis & Holder - _the_ nexus parser), so expect
> read.nexus.data to have it's limitations.
>
> Furthermore, if speed is the concern, it would perhaps be preferable to
> first convert the Nexus data to Fasta, and then use one of the many
> fast(er) parsers implemented in numerous R packages.
>
> Cheers
> Johan
>
>
> On 04/07/2013 02:59 PM, Ben Bolker wrote:> On 13-04-05 01:29 PM, Jessica
> Sabo wrote:
>>> Hi All,
>>>
>>> I am wondering if there is anyway to increase the speed of the
>>> read.nexus.data parser. Or if there is an alternative that is a
>>> faster nexus file data parser.
>>>
>>> THanks, Jess
>>>
>>
>>    I don't know if it's faster or not, but there is ?readNexus in the
>> 'ape' package.  Also see library("sos"); findFn("read {nexus format}")
>>
>> _______________________________________________
>> R-sig-phylo mailing list - R-sig-phylo@r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
>> Searchable archive at
> http://www.mail-archive.com/r-sig-phylo@r-project.org/
>>
>
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at
> http://www.mail-archive.com/r-sig-phylo@r-project.org/
>


-- 
Klaus Schliep
Phylogenomics Lab at the University of Vigo, Spain

read.nexus.data <- function (file) 
{
  "find.ntax" <- function(x) {
    for (i in 1:NROW(x)) {
      if (any(f <- grep("\\bntax", x[i], ignore.case = TRUE))) {
        ntax <- as.numeric(sub("(.+?)(ntax\\s*\\=\\s*)(\\d+)(.+)", 
                               "\\3", x[i], perl = TRUE, ignore.case = TRUE))
        break
      }
    }
    ntax
  }
  "find.nchar" <- function(x) {
    for (i in 1:NROW(x)) {
      if (any(f <- grep("\\bnchar", x[i], ignore.case = TRUE))) {
        nchar <- as.numeric(sub("(.+?)(nchar\\s*\\=\\s*)(\\d+)(.+)", 
                                "\\3", x[i], perl = TRUE, ignore.case = TRUE))
        break
      }
    }
    nchar
  }
  "find.matrix.line" <- function(x) {
    for (i in 1:NROW(x)) {
      if (any(f <- grep("\\bmatrix\\b", x[i], ignore.case = TRUE))) {
        matrix.line <- as.numeric(i)
        break
      }
    }
    matrix.line
  }
  "trim.whitespace" <- function(x) {
    gsub("\\s+", "", x)
  }
  "trim.semicolon" <- function(x) {
    gsub(";", "", x)
  }
  X <- scan(file = file, what = character(), sep = "\n", quiet = TRUE, 
            comment.char = "[", strip.white = TRUE)
  ntax <- find.ntax(X)
  nchar <- find.nchar(X)
  matrix.line <- find.matrix.line(X)
  start.reading <- matrix.line + 1
  Obj <- vector("list", ntax)
  for(i in 1:ntax)Obj[[i]] = rep(NA, nchar)
  
  i <- 1
  pos <- 0
  tot.nchar <- 0
  tot.ntax <- 0
  for (j in start.reading:NROW(X)) {
    Xj <- trim.semicolon(X[j])
    if (Xj == "") {
      break
    }
    if (any(jtmp <- grep("\\bend\\b", X[j], perl = TRUE, ignore.case = TRUE))) {
      break
    }
    ts <- unlist(strsplit(Xj, "(?<=\\S)(\\s+)(?=\\S)", perl = TRUE))
    #    browser()
    if (length(ts) > 2) {
      stop("nexus parser does not handle spaces in sequences or taxon names (ts>2)")
    }
    if (length(ts) != 2) {
      stop("nexus parser failed to read the sequences (ts!=2)")
    }
    Seq <- trim.whitespace(ts[2])
    Name <- trim.whitespace(ts[1])
    nAME <- paste(c("\\b", Name, "\\b"), collapse = "")
    
    if (any(l <- grep(nAME, names(Obj)))) {
      tsp <- strsplit(Seq, NULL)[[1]]
      
      Obj[[l]][pos + c(1:length(tsp))] <- tsp
      chars.done <- length(tsp)           
      
    }
    else {
      names(Obj)[i] <- Name
      tsp <- strsplit(Seq, NULL)[[1]]
      
      Obj[[i]][pos + c(1:length(tsp))] <- tsp
      chars.done <- length(tsp)  
      
    }
    tot.ntax <- tot.ntax + 1
    if (tot.ntax == ntax) {
      i <- 1
      tot.ntax <- 0
      tot.nchar <- tot.nchar + chars.done
      if (tot.nchar == nchar * ntax) {
        print("ntot was more than nchar*ntax")
        break
      }
      pos <- tot.nchar
    }
    else {
      i <- i + 1
    }
  }
  if (tot.ntax != 0) {
    cat("ntax:", ntax, "differ from actual number of taxa in file?\n")
    stop("nexus parser did not read names correctly (tot.ntax!=0)")
  }
  for (i in 1:length(Obj)) {
    if (length(Obj[[i]]) != nchar) {
      cat(names(Obj[i]), "has", length(Obj[[i]]), "characters\n")
      stop("nchar differ from sequence length (length(Obj[[i]])!=nchar)")
    }
  }
  Obj <- lapply(Obj, tolower)
  Obj
}

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] read.nexus.data parser

Reply via email to