RE: [nodejs] Re: What libraries are available to parse web pages?

Kevin Ar18 Sat, 04 May 2013 16:50:15 -0700

Thanks for the suggestions everyone.
jsdom and cheerio look interesting -- and to think I passed over both of them. 
:)
Anyways, I guess cheerio looks like it might be the better option due to the 
parser it uses.

BTW, going a little bit off topic, I noticed that jsdom has a very nifty 
feature: the ability to simulated a browser, and run the actual javascript 
inside the page.  Libraries in other languages (like libxml2) can't do this, 
but it seems a natural fit for node.js.

Going even further off-topic... I must say, that it might be nice if cheerio 
also had the ability to access javascript like it can html/xml.  Granted, maybe 
not to the extent that jsdom does in simulating an entire browser, but maybe 
the ability to turn the javascript into some type of "abstract syntaxt tree" or 
"DOM"-like structure that doesn't actually run the code, but let's you access 
parts of the javascript code that you might need to extract data from (just 
like you can with the HTML DOM).



Date: Sat, 4 May 2013 08:56:24 -0700
From: [email protected]
To: [email protected]
CC: [email protected]
Subject: [nodejs] Re: What libraries are available to parse web pages?

We started with jsdom and then switched to cheerio.  Very happy with it. 

On Friday, May 3, 2013 9:29:55 PM UTC-7, Kevin Ar18 wrote:


Not sure if this is an appropriate question to ask here, but I could not find 
any other major node.js communities to ask on.


My goal is to extract data from pages on the internet.
Basically, I need to parse html pages (which might have errors, be 
malformed, etc...), turn the page into a usable DOM that I can then access 
elements 
using the DOM, xpath, or whatever.
In Python, I used tidlib + libxml2 to do this.
html5lib is also another library I was considering using in Python.

However, I would like to find a node.js library that can do what I need.  One 
that meets these requirements:
* can handle malformed html (basically the internet).  This is NOT a strict 
requirement, since I can pass the page through tidylib first before I send it 
to the node.js library.
* presents an easy to use interface to access elements from the page (bonus if 
it can access malformed parts of the page instead of just discarding them).  
The basic DOM (getElementById, etc...) or an xpath command would be sufficient. 
 I can always build more features onto it if needed.
* BSD/MIT licensed end-to-end; no copyleft parts
* bonus for speed and being design to work well or feel native to node.js


Several questions:
1) Does anyone else have experience with using node.js in the way I just 
described to extract various pieces of data from a site?  If so,  what 
libraries do you recommend -- which do you use and which alternatives (that you 
don't use) have you also considered?

2) One of the big problems is handling malformed pages.  There is a 
library/program called tidylib (http://tidy.sourceforge.net/libintro.html) that 
I can use to generate nicely formed pages.  So... if I use this library first 
to clean up the page, could I use some of the stricter XML to JS libraries I 
see on this page?  https://github.com/joyent/node/wiki/modules#wiki-parsers-xml

I am still a little uncertain if it says "XML" if that means they will not be 
good for handling HTML pages.  


3) Not A Question.
Anyways, here are some libraries that caught my interest.
I found libxmljs and "HTML5 Parser for node.js" to be particularly interesting 
projects.

However, I am still a little bit confused about the options available in 
node.js:

libxmljs uses libxml2, which I am familiar with from Python: 
https://github.com/polotek/libxmljs

html5lib site lists 2 options (http://code.google.com/p/html5lib/wiki/Ports):
"HTML5 Parser for node.js" (https://github.com/aredridel/html5) -- suggests 
that it can handle any page on the internet (even malformed)   Here's a quote: 
"HTML5 parsing algorithm. If you find something this can't parse, I'll want
to know about it. It should make sense out of anything a browser can."
"dom.js" -- dead in experimental state?
node-o3-xml -- has GPL components, which I would like to avoid :(
node-03-fastxml -- supposedly fast
a large list of XML parsers on this page: 
https://github.com/joyent/node/wiki/modules#wiki-parsers
as far as I can tell, most of the items on that page specifically say XML and 
probably expect perfect XML.  Also, I'm not sure if they are even usable for 
working with HTML pages (or if that even matters).  As I mentioned before, I 
can pass the page through tidylib first 
(http://tidy.sourceforge.net/libintro.html) before sending it to one of these 
parsers; but I'm still uncertain about how usable they really are for html (or 
if that even matters).


Anyways,
hope it was ok to bring up this question here.  Right now, I'm leaning towards 
libxmljs or aredridel's html5, but I'm hoping there might be other options I am 
not aware of.

Thanks,
Kevin
                                          





-- 

-- 

Job Board: http://jobs.nodejs.org/

Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines

You received this message because you are subscribed to the Google

Groups "nodejs" group.

To post to this group, send email to [email protected]

To unsubscribe from this group, send email to

[email protected]

For more options, visit this group at

http://groups.google.com/group/nodejs?hl=en?hl=en

 

--- 

You received this message because you are subscribed to the Google Groups 
"nodejs" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].

For more options, visit https://groups.google.com/groups/opt_out.

 

 

                                          

-- 
-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.
RE: [nodejs] Re: What libraries are available to parse web pages?

Reply via email to