We started with jsdom and then switched to cheerio. Very happy with it. On Friday, May 3, 2013 9:29:55 PM UTC-7, Kevin Ar18 wrote: > > Not sure if this is an appropriate question to ask here, but I could not > find any other major node.js communities to ask on. > > > My goal is to extract data from pages on the internet. > Basically, I need to parse html pages (which might have errors, be > malformed, etc...), turn the page into a usable DOM that I can then access > elements using the DOM, xpath, or whatever. > In Python, I used tidlib + libxml2 to do this. > html5lib is also another library I was considering using in Python. > > However, I would like to find a node.js library that can do what I need. > One that meets these requirements: > * can handle malformed html (basically the internet). This is NOT a > strict requirement, since I can pass the page through tidylib first before > I send it to the node.js library. > * presents an easy to use interface to access elements from the page > (bonus if it can access malformed parts of the page instead of just > discarding them). The basic DOM (getElementById, etc...) or an xpath > command would be sufficient. I can always build more features onto it if > needed. > * BSD/MIT licensed end-to-end; no copyleft parts > * bonus for speed and being design to work well or feel native to node.js > > > Several questions: > 1) Does anyone else have experience with using node.js in the way I just > described to extract various pieces of data from a site? If so, what > libraries do you recommend -- which do you use and which alternatives (that > you don't use) have you also considered? > > 2) One of the big problems is handling malformed pages. There is a > library/program called tidylib (http://tidy.sourceforge.net/libintro.html) > that I can use to generate nicely formed pages. So... if I use this > library first to clean up the page, could I use some of the stricter XML to > JS libraries I see on this page? > https://github.com/joyent/node/wiki/modules#wiki-parsers-xml > > I am still a little uncertain if it says "XML" if that means they will not > be good for handling HTML pages. > > > 3) Not A Question. > Anyways, here are some libraries that caught my interest. > I found *libxmljs *and "*HTML5 Parser for node.js*" to be particularly > interesting projects. > > However, I am still a little bit confused about the options available in > node.js: > > *libxmljs *uses libxml2, which I am familiar with from Python: > https://github.com/polotek/libxmljs > > html5lib site lists 2 options ( > http://code.google.com/p/html5lib/wiki/Ports): > *"HTML5 Parser for node.js"* (https://github.com/aredridel/html5) -- > suggests that it can handle any page on the internet (even malformed) > Here's a quote: "HTML5 parsing algorithm. If you find something this can't > parse, I'll want to know about it. It should make sense out of anything a > browser can." > "dom.js" -- dead in experimental state? > node-o3-xml -- has GPL components, which I would like to avoid :( > *node-03-fastxml* -- supposedly fast > *a large list of XML parsers on this page: > https://github.com/joyent/node/wiki/modules#wiki-parsers* > as far as I can tell, most of the items on that page specifically say XML > and probably expect perfect XML. Also, I'm not sure if they are even > usable for working with HTML pages (or if that even matters). As I > mentioned before, I can pass the page through tidylib first ( > http://tidy.sourceforge.net/libintro.html) before sending it to one of > these parsers; but I'm still uncertain about how usable they really are for > html (or if that even matters). > > > Anyways, > hope it was ok to bring up this question here. Right now, I'm leaning > towards libxmljs or aredridel's html5, but I'm hoping there might be other > options I am not aware of. > > Thanks, > Kevin >
-- -- Job Board: http://jobs.nodejs.org/ Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines You received this message because you are subscribed to the Google Groups "nodejs" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/nodejs?hl=en?hl=en --- You received this message because you are subscribed to the Google Groups "nodejs" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
