Thanks for the suggestions everyone.
jsdom and cheerio look interesting -- and to think I passed over both of them.
:)
Anyways, I guess cheerio looks like it might be the better option due to the
parser it uses.
BTW, going a little bit off topic, I noticed that jsdom has a very nifty
feature: the ability to simulated a browser, and run the actual javascript
inside the page. Libraries in other languages (like libxml2) can't do this,
but it seems a natural fit for node.js.
Going even further off-topic... I must say, that it might be nice if cheerio
also had the ability to access javascript like it can html/xml. Granted, maybe
not to the extent that jsdom does in simulating an entire browser, but maybe
the ability to turn the javascript into some type of "abstract syntaxt tree" or
"DOM"-like structure that doesn't actually run the code, but let's you access
parts of the javascript code that you might need to extract data from (just
like you can with the HTML DOM).
Date: Sat, 4 May 2013 08:56:24 -0700
From: [email protected]
To: [email protected]
CC: [email protected]
Subject: [nodejs] Re: What libraries are available to parse web pages?
We started with jsdom and then switched to cheerio. Very happy with it.
On Friday, May 3, 2013 9:29:55 PM UTC-7, Kevin Ar18 wrote:
Not sure if this is an appropriate question to ask here, but I could not find
any other major node.js communities to ask on.
My goal is to extract data from pages on the internet.
Basically, I need to parse html pages (which might have errors, be
malformed, etc...), turn the page into a usable DOM that I can then access
elements
using the DOM, xpath, or whatever.
In Python, I used tidlib + libxml2 to do this.
html5lib is also another library I was considering using in Python.
However, I would like to find a node.js library that can do what I need. One
that meets these requirements:
* can handle malformed html (basically the internet). This is NOT a strict
requirement, since I can pass the page through tidylib first before I send it
to the node.js library.
* presents an easy to use interface to access elements from the page (bonus if
it can access malformed parts of the page instead of just discarding them).
The basic DOM (getElementById, etc...) or an xpath command would be sufficient.
I can always build more features onto it if needed.
* BSD/MIT licensed end-to-end; no copyleft parts
* bonus for speed and being design to work well or feel native to node.js
Several questions:
1) Does anyone else have experience with using node.js in the way I just
described to extract various pieces of data from a site? If so, what
libraries do you recommend -- which do you use and which alternatives (that you
don't use) have you also considered?
2) One of the big problems is handling malformed pages. There is a
library/program called tidylib (http://tidy.sourceforge.net/libintro.html) that
I can use to generate nicely formed pages. So... if I use this library first
to clean up the page, could I use some of the stricter XML to JS libraries I
see on this page? https://github.com/joyent/node/wiki/modules#wiki-parsers-xml
I am still a little uncertain if it says "XML" if that means they will not be
good for handling HTML pages.
3) Not A Question.
Anyways, here are some libraries that caught my interest.
I found libxmljs and "HTML5 Parser for node.js" to be particularly interesting
projects.
However, I am still a little bit confused about the options available in
node.js:
libxmljs uses libxml2, which I am familiar with from Python:
https://github.com/polotek/libxmljs
html5lib site lists 2 options (http://code.google.com/p/html5lib/wiki/Ports):
"HTML5 Parser for node.js" (https://github.com/aredridel/html5) -- suggests
that it can handle any page on the internet (even malformed) Here's a quote:
"HTML5 parsing algorithm. If you find something this can't parse, I'll want
to know about it. It should make sense out of anything a browser can."
"dom.js" -- dead in experimental state?
node-o3-xml -- has GPL components, which I would like to avoid :(
node-03-fastxml -- supposedly fast
a large list of XML parsers on this page:
https://github.com/joyent/node/wiki/modules#wiki-parsers
as far as I can tell, most of the items on that page specifically say XML and
probably expect perfect XML. Also, I'm not sure if they are even usable for
working with HTML pages (or if that even matters). As I mentioned before, I
can pass the page through tidylib first
(http://tidy.sourceforge.net/libintro.html) before sending it to one of these
parsers; but I'm still uncertain about how usable they really are for html (or
if that even matters).
Anyways,
hope it was ok to bring up this question here. Right now, I'm leaning towards
libxmljs or aredridel's html5, but I'm hoping there might be other options I am
not aware of.
Thanks,
Kevin
--
--
Job Board: http://jobs.nodejs.org/
Posting guidelines:
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups
"nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.
--
--
Job Board: http://jobs.nodejs.org/
Posting guidelines:
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups
"nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.