We started with jsdom and then switched to cheerio.  Very happy with it. 

On Friday, May 3, 2013 9:29:55 PM UTC-7, Kevin Ar18 wrote:
>
> Not sure if this is an appropriate question to ask here, but I could not 
> find any other major node.js communities to ask on.
>
>
> My goal is to extract data from pages on the internet.
> Basically, I need to parse html pages (which might have errors, be 
> malformed, etc...), turn the page into a usable DOM that I can then access 
> elements using the DOM, xpath, or whatever.
> In Python, I used tidlib + libxml2 to do this.
> html5lib is also another library I was considering using in Python.
>
> However, I would like to find a node.js library that can do what I need.  
> One that meets these requirements:
> * can handle malformed html (basically the internet).  This is NOT a 
> strict requirement, since I can pass the page through tidylib first before 
> I send it to the node.js library.
> * presents an easy to use interface to access elements from the page 
> (bonus if it can access malformed parts of the page instead of just 
> discarding them).  The basic DOM (getElementById, etc...) or an xpath 
> command would be sufficient.  I can always build more features onto it if 
> needed.
> * BSD/MIT licensed end-to-end; no copyleft parts
> * bonus for speed and being design to work well or feel native to node.js
>
>
> Several questions:
> 1) Does anyone else have experience with using node.js in the way I just 
> described to extract various pieces of data from a site?  If so,  what 
> libraries do you recommend -- which do you use and which alternatives (that 
> you don't use) have you also considered?
>
> 2) One of the big problems is handling malformed pages.  There is a 
> library/program called tidylib (http://tidy.sourceforge.net/libintro.html) 
> that I can use to generate nicely formed pages.  So... if I use this 
> library first to clean up the page, could I use some of the stricter XML to 
> JS libraries I see on this page?  
> https://github.com/joyent/node/wiki/modules#wiki-parsers-xml
>
> I am still a little uncertain if it says "XML" if that means they will not 
> be good for handling HTML pages.  
>
>
> 3) Not A Question.
> Anyways, here are some libraries that caught my interest.
> I found *libxmljs *and "*HTML5 Parser for node.js*" to be particularly 
> interesting projects.
>
> However, I am still a little bit confused about the options available in 
> node.js:
>
> *libxmljs *uses libxml2, which I am familiar with from Python: 
> https://github.com/polotek/libxmljs
>
> html5lib site lists 2 options (
> http://code.google.com/p/html5lib/wiki/Ports):
> *"HTML5 Parser for node.js"* (https://github.com/aredridel/html5) -- 
> suggests that it can handle any page on the internet (even malformed)   
> Here's a quote: "HTML5 parsing algorithm. If you find something this can't 
> parse, I'll want to know about it. It should make sense out of anything a 
> browser can."
> "dom.js" -- dead in experimental state?
> node-o3-xml -- has GPL components, which I would like to avoid :(
> *node-03-fastxml* -- supposedly fast
> *a large list of XML parsers on this page: 
> https://github.com/joyent/node/wiki/modules#wiki-parsers*
> as far as I can tell, most of the items on that page specifically say XML 
> and probably expect perfect XML.  Also, I'm not sure if they are even 
> usable for working with HTML pages (or if that even matters).  As I 
> mentioned before, I can pass the page through tidylib first (
> http://tidy.sourceforge.net/libintro.html) before sending it to one of 
> these parsers; but I'm still uncertain about how usable they really are for 
> html (or if that even matters).
>
>
> Anyways,
> hope it was ok to bring up this question here.  Right now, I'm leaning 
> towards libxmljs or aredridel's html5, but I'm hoping there might be other 
> options I am not aware of.
>
> Thanks,
> Kevin
>  

-- 
-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to