Thanks DTH.
Fortunately the HTML I am parsing is clean and it's consistently the
same pages being scraped.
Saxon seems to be most in line with what I'm looking for (handles
XPath 2.0), I'll have to try it out. Otherwise I might have to use a
java library.
-Dan
On Aug 23, 3:53 am, DTH dth...@gmail.com wrote:
There are a number of options, depending on your needs:
- the standard JRE libraries for xml parsing / xpath (javax.xml.*).
These have the benefit of having seen wide usage (outside of clojure),
and would allow you to migrate existing xpaths over unchanged.
- clojure.xml - a more clojuresque way of parsing and working with xml
- clojure.zip - which can take the xml from above (in addition to many
other things) and provides a functional way of traversing and editing
the resulting tree of elements.
- clojure.contrib.zip_filter.xml - provides a means to extract data
from clojure.xml structures using a syntax loosely similar to xpath.
For working with html, I've had good experiences with c.x / c.c.zf.x,
using tagsoup (http://home.ccil.org/~cowan/XML/tagsoup/) as the
SAXParser in order to deal with non-xml compliant documents.
If performance is your aim, you might want to investigate the clojure/
saxon library (http://github.com/pjt/saxon/tree/master), possibly
combined with tagsoup again to deal with dodgy html; your message
implies that you mainly want to retrieve documents and extract a set
of data from each using relatively static expressions (presumably the
bulk of your business logic deals with processing this data); if this
is indeed the case, then you could use saxon to load the documents
returned by your http client and execute the XPaths, which I would
imagine will be faster than using zippers. You could also, of course,
simply use the javax.xml.* libraries above directly to load the
document and evaluate the xpath.
-DTH
On Aug 23, 2:02 am, dmix liftedme...@gmail.com wrote:
I am planning on migrating an app from ruby to clojure (for
performance and to learn clojure) and before I proceed I wanted to
make sure a few libraries are available.
One crucial part of the app is parsing a URL to return the pages HTML
(htmlbody...etc). Then I need to grab a certain element off the
page using an xpath. For example a specific images src= .
I found an http client on github but I haven't found any HTML parser,
does anyone know if one exists?
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
-~--~~~~--~~--~--~---