Not sure if this is an appropriate question to ask here, but I could not find any other major node.js communities to ask on.
My goal is to extract data from pages on the internet. Basically, I need to parse html pages (which might have errors, be malformed, etc...), turn the page into a usable DOM that I can then access elements using the DOM, xpath, or whatever. In Python, I used tidlib + libxml2 to do this. html5lib is also another library I was considering using in Python. However, I would like to find a node.js library that can do what I need. One that meets these requirements: * can handle malformed html (basically the internet). This is NOT a strict requirement, since I can pass the page through tidylib first before I send it to the node.js library. * presents an easy to use interface to access elements from the page (bonus if it can access malformed parts of the page instead of just discarding them). The basic DOM (getElementById, etc...) or an xpath command would be sufficient. I can always build more features onto it if needed. * BSD/MIT licensed end-to-end; no copyleft parts * bonus for speed and being design to work well or feel native to node.js Several questions: 1) Does anyone else have experience with using node.js in the way I just described to extract various pieces of data from a site? If so, what libraries do you recommend -- which do you use and which alternatives (that you don't use) have you also considered? 2) One of the big problems is handling malformed pages. There is a library/program called tidylib (http://tidy.sourceforge.net/libintro.html) that I can use to generate nicely formed pages. So... if I use this library first to clean up the page, could I use some of the stricter XML to JS libraries I see on this page? https://github.com/joyent/node/wiki/modules#wiki-parsers-xml I am still a little uncertain if it says "XML" if that means they will not be good for handling HTML pages. 3) Not A Question. Anyways, here are some libraries that caught my interest. I found libxmljs and "HTML5 Parser for node.js" to be particularly interesting projects. However, I am still a little bit confused about the options available in node.js: libxmljs uses libxml2, which I am familiar with from Python: https://github.com/polotek/libxmljs html5lib site lists 2 options (http://code.google.com/p/html5lib/wiki/Ports): "HTML5 Parser for node.js" (https://github.com/aredridel/html5) -- suggests that it can handle any page on the internet (even malformed) Here's a quote: "HTML5 parsing algorithm. If you find something this can't parse, I'll want to know about it. It should make sense out of anything a browser can." "dom.js" -- dead in experimental state? node-o3-xml -- has GPL components, which I would like to avoid :( node-03-fastxml -- supposedly fast a large list of XML parsers on this page: https://github.com/joyent/node/wiki/modules#wiki-parsers as far as I can tell, most of the items on that page specifically say XML and probably expect perfect XML. Also, I'm not sure if they are even usable for working with HTML pages (or if that even matters). As I mentioned before, I can pass the page through tidylib first (http://tidy.sourceforge.net/libintro.html) before sending it to one of these parsers; but I'm still uncertain about how usable they really are for html (or if that even matters). Anyways, hope it was ok to bring up this question here. Right now, I'm leaning towards libxmljs or aredridel's html5, but I'm hoping there might be other options I am not aware of. Thanks, Kevin -- -- Job Board: http://jobs.nodejs.org/ Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines You received this message because you are subscribed to the Google Groups "nodejs" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/nodejs?hl=en?hl=en --- You received this message because you are subscribed to the Google Groups "nodejs" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
