Another upvote for cheerio. It doesn't support all jQuery selectors, but it's a good enough subset.
On Sat, May 4, 2013 at 11:56 AM, George Snelling <[email protected]> wrote: > We started with jsdom and then switched to cheerio. Very happy with it. > > > On Friday, May 3, 2013 9:29:55 PM UTC-7, Kevin Ar18 wrote: >> >> Not sure if this is an appropriate question to ask here, but I could not >> find any other major node.js communities to ask on. >> >> >> My goal is to extract data from pages on the internet. >> Basically, I need to parse html pages (which might have errors, be >> malformed, etc...), turn the page into a usable DOM that I can then access >> elements using the DOM, xpath, or whatever. >> In Python, I used tidlib + libxml2 to do this. >> html5lib is also another library I was considering using in Python. >> >> However, I would like to find a node.js library that can do what I need. >> One that meets these requirements: >> * can handle malformed html (basically the internet). This is NOT a >> strict requirement, since I can pass the page through tidylib first before >> I send it to the node.js library. >> * presents an easy to use interface to access elements from the page >> (bonus if it can access malformed parts of the page instead of just >> discarding them). The basic DOM (getElementById, etc...) or an xpath >> command would be sufficient. I can always build more features onto it if >> needed. >> * BSD/MIT licensed end-to-end; no copyleft parts >> * bonus for speed and being design to work well or feel native to node.js >> >> >> Several questions: >> 1) Does anyone else have experience with using node.js in the way I just >> described to extract various pieces of data from a site? If so, what >> libraries do you recommend -- which do you use and which alternatives (that >> you don't use) have you also considered? >> >> 2) One of the big problems is handling malformed pages. There is a >> library/program called tidylib (http://tidy.sourceforge.net/** >> libintro.html <http://tidy.sourceforge.net/libintro.html>) that I can >> use to generate nicely formed pages. So... if I use this library first to >> clean up the page, could I use some of the stricter XML to JS libraries I >> see on this page? https://github.com/joyent/**node/wiki/modules#wiki-** >> parsers-xml<https://github.com/joyent/node/wiki/modules#wiki-parsers-xml> >> >> I am still a little uncertain if it says "XML" if that means they will >> not be good for handling HTML pages. >> >> >> 3) Not A Question. >> Anyways, here are some libraries that caught my interest. >> I found *libxmljs *and "*HTML5 Parser for node.js*" to be particularly >> interesting projects. >> >> However, I am still a little bit confused about the options available in >> node.js: >> >> *libxmljs *uses libxml2, which I am familiar with from Python: >> https://github.com/polotek/**libxmljs<https://github.com/polotek/libxmljs> >> >> html5lib site lists 2 options (http://code.google.com/p/** >> html5lib/wiki/Ports <http://code.google.com/p/html5lib/wiki/Ports>): >> *"HTML5 Parser for node.js"* >> (https://github.com/aredridel/**html5<https://github.com/aredridel/html5>) >> -- suggests that it can handle any page on the internet (even malformed) >> Here's a quote: "HTML5 parsing algorithm. If you find something this can't >> parse, I'll want to know about it. It should make sense out of anything a >> browser can." >> "dom.js" -- dead in experimental state? >> node-o3-xml -- has GPL components, which I would like to avoid :( >> *node-03-fastxml* -- supposedly fast >> *a large list of XML parsers on this page: https://github.com/joyent/ >> node/wiki/modules#wiki-parsers* >> as far as I can tell, most of the items on that page specifically say XML >> and probably expect perfect XML. Also, I'm not sure if they are even >> usable for working with HTML pages (or if that even matters). As I >> mentioned before, I can pass the page through tidylib first ( >> http://tidy.sourceforge.net/**libintro.html<http://tidy.sourceforge.net/libintro.html>) >> before sending it to one of these parsers; but I'm still uncertain about >> how usable they really are for html (or if that even matters). >> >> >> Anyways, >> hope it was ok to bring up this question here. Right now, I'm leaning >> towards libxmljs or aredridel's html5, but I'm hoping there might be other >> options I am not aware of. >> >> Thanks, >> Kevin >> > -- > -- > Job Board: http://jobs.nodejs.org/ > Posting guidelines: > https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines > You received this message because you are subscribed to the Google > Groups "nodejs" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/nodejs?hl=en?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "nodejs" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- -- Job Board: http://jobs.nodejs.org/ Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines You received this message because you are subscribed to the Google Groups "nodejs" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/nodejs?hl=en?hl=en --- You received this message because you are subscribed to the Google Groups "nodejs" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
