You might also want to consider stepping out of node.js land and take a look at phantom.js for this: http://phantomjs.org/
Phantom.js is pretty powerfull on just this since its a full blown browser which can run as a server. At the moment phantom.js is QT's webkit but if I'm not wrong, they will be moving to Chrome in near future. Trygve On Sat, 2013-05-04 at 19:50 -0400, Kevin Ar18 wrote: > Thanks for the suggestions everyone. > jsdom and cheerio look interesting -- and to think I passed over both > of them. :) > Anyways, I guess cheerio looks like it might be the better option due > to the parser it uses. > > BTW, going a little bit off topic, I noticed that jsdom has a very > nifty feature: the ability to simulated a browser, and run the actual > javascript inside the page. Libraries in other languages (like > libxml2) can't do this, but it seems a natural fit for node.js. > > Going even further off-topic... I must say, that it might be nice if > cheerio also had the ability to access javascript like it can > html/xml. Granted, maybe not to the extent that jsdom does in > simulating an entire browser, but maybe the ability to turn the > javascript into some type of "abstract syntaxt tree" or "DOM"-like > structure that doesn't actually run the code, but let's you access > parts of the javascript code that you might need to extract data from > (just like you can with the HTML DOM). > > > > > > ______________________________________________________________________ > Date: Sat, 4 May 2013 08:56:24 -0700 > From: [email protected] > To: [email protected] > CC: [email protected] > Subject: [nodejs] Re: What libraries are available to parse web pages? > > We started with jsdom and then switched to cheerio. Very happy with > it. > > On Friday, May 3, 2013 9:29:55 PM UTC-7, Kevin Ar18 wrote: > Not sure if this is an appropriate question to ask here, but I > could not find any other major node.js communities to ask on. > > > My goal is to extract data from pages on the internet. > Basically, I need to parse html pages (which might have > errors, be malformed, etc...), turn the page into a usable DOM > that I can then access elements using the DOM, xpath, or > whatever. > In Python, I used tidlib + libxml2 to do this. > html5lib is also another library I was considering using in > Python. > > However, I would like to find a node.js library that can do > what I need. One that meets these requirements: > * can handle malformed html (basically the internet). This is > NOT a strict requirement, since I can pass the page through > tidylib first before I send it to the node.js library. > * presents an easy to use interface to access elements from > the page (bonus if it can access malformed parts of the page > instead of just discarding them). The basic DOM > (getElementById, etc...) or an xpath command would be > sufficient. I can always build more features onto it if > needed. > * BSD/MIT licensed end-to-end; no copyleft parts > * bonus for speed and being design to work well or feel native > to node.js > > > Several questions: > 1) Does anyone else have experience with using node.js in the > way I just described to extract various pieces of data from a > site? If so, what libraries do you recommend -- which do you > use and which alternatives (that you don't use) have you also > considered? > > 2) One of the big problems is handling malformed pages. There > is a library/program called tidylib > (http://tidy.sourceforge.net/libintro.html) that I can use to > generate nicely formed pages. So... if I use this library > first to clean up the page, could I use some of the stricter > XML to JS libraries I see on this page? > https://github.com/joyent/node/wiki/modules#wiki-parsers-xml > > I am still a little uncertain if it says "XML" if that means > they will not be good for handling HTML pages. > > > 3) Not A Question. > Anyways, here are some libraries that caught my interest. > I found libxmljs and "HTML5 Parser for node.js" to be > particularly interesting projects. > > However, I am still a little bit confused about the options > available in node.js: > > libxmljs uses libxml2, which I am familiar with from Python: > https://github.com/polotek/libxmljs > > html5lib site lists 2 options > (http://code.google.com/p/html5lib/wiki/Ports): > "HTML5 Parser for > node.js" (https://github.com/aredridel/html5) -- suggests that > it can handle any page on the internet (even malformed) > Here's a quote: "HTML5 parsing algorithm. If you find > something this can't parse, I'll want to know about it. It > should make sense out of anything a browser can." > "dom.js" -- dead in experimental state? > node-o3-xml -- has GPL components, which I would like to > avoid :( > node-03-fastxml -- supposedly fast > a large list of XML parsers on this page: > https://github.com/joyent/node/wiki/modules#wiki-parsers > as far as I can tell, most of the items on that page > specifically say XML and probably expect perfect XML. Also, > I'm not sure if they are even usable for working with HTML > pages (or if that even matters). As I mentioned before, I can > pass the page through tidylib first > (http://tidy.sourceforge.net/libintro.html) before sending it > to one of these parsers; but I'm still uncertain about how > usable they really are for html (or if that even matters). > > > Anyways, > hope it was ok to bring up this question here. Right now, I'm > leaning towards libxmljs or aredridel's html5, but I'm hoping > there might be other options I am not aware of. > > Thanks, > Kevin > > > -- > -- -- Job Board: http://jobs.nodejs.org/ Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines You received this message because you are subscribed to the Google Groups "nodejs" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/nodejs?hl=en?hl=en --- You received this message because you are subscribed to the Google Groups "nodejs" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
