Not sure if this is an appropriate question to ask here, but I could not find 
any other major node.js communities to ask on.


My goal is to extract data from pages on the internet.
Basically, I need to parse html pages (which might have errors, be 
malformed, etc...), turn the page into a usable DOM that I can then access 
elements 
using the DOM, xpath, or whatever.
In Python, I used tidlib + libxml2 to do this.
html5lib is also another library I was considering using in Python.

However, I would like to find a node.js library that can do what I need.  One 
that meets these requirements:
* can handle malformed html (basically the internet).  This is NOT a strict 
requirement, since I can pass the page through tidylib first before I send it 
to the node.js library.
* presents an easy to use interface to access elements from the page (bonus if 
it can access malformed parts of the page instead of just discarding them).  
The basic DOM (getElementById, etc...) or an xpath command would be sufficient. 
 I can always build more features onto it if needed.
* BSD/MIT licensed end-to-end; no copyleft parts
* bonus for speed and being design to work well or feel native to node.js


Several questions:
1) Does anyone else have experience with using node.js in the way I just 
described to extract various pieces of data from a site?  If so,  what 
libraries do you recommend -- which do you use and which alternatives (that you 
don't use) have you also considered?

2) One of the big problems is handling malformed pages.  There is a 
library/program called tidylib (http://tidy.sourceforge.net/libintro.html) that 
I can use to generate nicely formed pages.  So... if I use this library first 
to clean up the page, could I use some of the stricter XML to JS libraries I 
see on this page?  https://github.com/joyent/node/wiki/modules#wiki-parsers-xml

I am still a little uncertain if it says "XML" if that means they will not be 
good for handling HTML pages.  


3) Not A Question.
Anyways, here are some libraries that caught my interest.
I found libxmljs and "HTML5 Parser for node.js" to be particularly interesting 
projects.

However, I am still a little bit confused about the options available in 
node.js:

libxmljs uses libxml2, which I am familiar with from Python: 
https://github.com/polotek/libxmljs

html5lib site lists 2 options (http://code.google.com/p/html5lib/wiki/Ports):
"HTML5 Parser for node.js" (https://github.com/aredridel/html5) -- suggests 
that it can handle any page on the internet (even malformed)   Here's a quote: 
"HTML5 parsing algorithm. If you find something this can't parse, I'll want
to know about it. It should make sense out of anything a browser can."
"dom.js" -- dead in experimental state?
node-o3-xml -- has GPL components, which I would like to avoid :(
node-03-fastxml -- supposedly fast
a large list of XML parsers on this page: 
https://github.com/joyent/node/wiki/modules#wiki-parsers
as far as I can tell, most of the items on that page specifically say XML and 
probably expect perfect XML.  Also, I'm not sure if they are even usable for 
working with HTML pages (or if that even matters).  As I mentioned before, I 
can pass the page through tidylib first 
(http://tidy.sourceforge.net/libintro.html) before sending it to one of these 
parsers; but I'm still uncertain about how usable they really are for html (or 
if that even matters).


Anyways,
hope it was ok to bring up this question here.  Right now, I'm leaning towards 
libxmljs or aredridel's html5, but I'm hoping there might be other options I am 
not aware of.

Thanks,
Kevin
                                          

-- 
-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to