This sounds great, Ben. Have you talked to Rod Page about his "Elsevier Grand Challenge" project ( <http://precedings.nature.com/ documents/2217/version/1> ) which involves parsing PDFs from Molecular Phylogenetics and Evolution to extract trees and other data? It sounds like you two might encounter similar issues.

Brian


On Nov 4, 2008, at 12:47 PM, bbolker wrote:

[background for r-sig-phylo: some of us have been talking about
the problems of grabbing trees from the literature when they
are not available in TreeBase or as Nexus or Newick format
from the authors.  Reconstructing Newick format from a big
tree is a huge pain, as anyone who has tried it will know, and
even then one wants the branch lengths as well as the topology]

  The  problem of reconstructing trees from a set of (x,y) points
turns out not to be all that hard -- even "trivial" from the
computational point of view. The R function below takes
a set of (x,y) points, number of tips, and tip labels, and
returns a tree in "phylo" format [it assumes that all
the tips are first in the list of points, otherwise I think
order shouldn't matter].  I haven't tried it on
ultrametric trees, and I know that polytomies will
be trouble.

  The examples below take the node (x,y) locations from
some of the ape examples (the tiny "owl tree" and the
bird.orders data set), which are retrievable using some
black magic, and reconstruct the trees.  **The trees do
not come back in the same order** (is this a problem?)
but they are equivalent.

   Getting the (x,y) points into R in the first place is also
a potential challenge.  Two possible solutions: use g3data
(notes included below), a standalone, cross-platform
utility for retrieving point locations from image files.
One could also write a small R program that took
an image file, plotted it, and use locator() to get
the points (using pixmap:::read.pnm?).
I think I've written something like this
before, but would have to dig it up or redo it -- and
g3data has a nicer interface.

##
library(ape)

## from ?plot.tree:
cat("(((Strix_aluco:4.2,Asio_otus:4.2):3.1,",
    "Athene_noctua:7.3):6.3,Tyto_alba:13.5);",
    file = "ex.tre", sep = "\n")
tree.owls <- read.tree("ex.tre")
plot(tree.owls)
unlink("ex.tre") # delete the file "ex.tre"

plot(tree.owls)
xy <- get("last_plot.phylo",envir=.PlotPhyloEnv)
xx <- xy$xx
yy <- xy$yy
points(xx,yy,col="white",pch=16,cex=2)
text(xx,yy,col=2,1:length(xx))

## assumes left-to-right horizontal tree -- may need some logic for
##  different directions
## assumes first N points are tips.
##
## polytomies?? may need to be explicitly identified ...
## should?? work on non-ultrametric trees, but untested
build.tree <- function(xx,yy,tip.labels,ntips,
                       poly=numeric(0),
                       debug=FALSE) {
  if (!missing(tip.labels)) ntips <- length(tip.labels)
  nodes <- 1:length(xx)
  is.tip <- nodes<=ntips
  if (which.min(xx)!=ntips+1) {
    ## reorder nodes the way ape/phylo expects
    yy[internal] <- rev(yy[!is.tip])[order(xx[!is.tip])]
    xx[internal] <- rev(yy[!is.tipl])[order(xx[!is.tip])]
  }
  edges <- matrix(nrow=0,ncol=2)
  edge.length <- numeric(0)
  nnode <- length(xx)-ntips
  while (length(xx)>1) {
    ## find next node to include
    nextnode <- which(!is.tip & xx==max(xx[!is.tip]))[1]
    ## find daughters
    dist <- abs(yy-yy[nextnode])
    daughters <- which(is.tip & dist==min(dist[is.tip]))
    ## be careful with numeric fuzz?
    edges <- rbind(edges,
                   nodes[c(nextnode,daughters[1])],
                   nodes[c(nextnode,daughters[2])])
    edge.length <- c(edge.length,xx[daughters]-xx[nextnode])
    xx <- xx[-daughters]
    yy <- yy[-daughters]
    is.tip[nextnode] <- TRUE
    is.tip <- is.tip[-daughters]
    nodes <- nodes[-daughters]
  }
  zz <- list(tip.labels=tip.labels,
             edge=edges,
             edge.length=edge.length,
             Nnode=nnode)
  class(zz) <- "phylo"
  zz <- reorder(zz)
  zz
}

newtree <- build.tree(xx,yy,tree.owls$tip.label)

data(bird.orders)
plot(bird.orders,show.node.label=TRUE)
xy <- get("last_plot.phylo",envir=.PlotPhyloEnv)
points(xx,yy,col="white",pch=16,cex=2)
text(xx,yy,col=2,1:length(xx))

xx <- xy$xx
yy <- xy$yy
newtree2 <- build.tree(xx,yy,bird.orders$tip.label)

===========
g3data notes:
============

INSTALLATION: install g3data and (for Windows) clip2png.jar

Ubuntu and other Debians:

  sudo apt-get g3data

Windows:
   http://www.frantz.fi/software/Windows/g3data-1.5.1-win32.zip (for
windows)

Mac (OS X 10.4 or 10.5): available via fink
  http://www.finkproject.org/doc/users-guide/index.php
  fink install g3data (?) or
  fink -b install g3data


 get clip2png.jar :

 google "clip2png.jar", or go to ...
 http://sourceforge.net/project/showfiles.php?group_id=185579
 click on "download"
 scroll down and click on "clip2png.jar"
 save it somewhere (desktop?)

USAGE

  open the paper in your favorite PDF viewer
  select the desired figure, including axes but as little else as
possible,
    and copy to the clipboard, then save the clipboard as a PNG or GIF

  OR adjust the PDF window so the figure fills it and take a snapshot
of the Window (on Ubuntu: alt-printscreen), save as PNG or GIF

  open g3data

  click on two points on the X and Y axis, fill in values

  click on points

  if you need to compress the display so that you can see the output
actions,
use the View menu or function keys to toggle display of zoom area
(F5),
axis settings (F6), or output properties (F7)

for multiple series, either click on points in order (e.g. work left-
to-right
for each series), then edit your output to put tags on increasing
series,
or output each series to a separate data file

  note that by default g3data will save your data to a file named
after
your graphics file, e.g. "mydata.png.dat" -- which means that it will
show up in Windows as a file called "mydata.png",  with a DAT file
type -- which may be confusing.

  reading into excel: use "Data" menu to separate into columns

  Wish list for g3data:

csv format output?
series tagging?
keyboard shortcuts for Save (Ctrl-S), Save As (Ctrl-A)?
built-in documentation?


plot(newtree2)

_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo


________________________________
Brian O'Meara
NESCent
Durham, NC
http://www.brianomeara.info

_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo

Reply via email to