You might also want to consider stepping out of node.js land and take a
look at phantom.js for this: http://phantomjs.org/

Phantom.js is pretty powerfull on just this since its a full blown
browser which can run as a server. At the moment phantom.js is QT's
webkit but if I'm not wrong, they will be moving to Chrome in near
future.

Trygve



On Sat, 2013-05-04 at 19:50 -0400, Kevin Ar18 wrote:
> Thanks for the suggestions everyone.
> jsdom and cheerio look interesting -- and to think I passed over both
> of them. :)
> Anyways, I guess cheerio looks like it might be the better option due
> to the parser it uses.
> 
> BTW, going a little bit off topic, I noticed that jsdom has a very
> nifty feature: the ability to simulated a browser, and run the actual
> javascript inside the page.  Libraries in other languages (like
> libxml2) can't do this, but it seems a natural fit for node.js.
> 
> Going even further off-topic... I must say, that it might be nice if
> cheerio also had the ability to access javascript like it can
> html/xml.  Granted, maybe not to the extent that jsdom does in
> simulating an entire browser, but maybe the ability to turn the
> javascript into some type of "abstract syntaxt tree" or "DOM"-like
> structure that doesn't actually run the code, but let's you access
> parts of the javascript code that you might need to extract data from
> (just like you can with the HTML DOM).
> 
> 
> 
> 
> 
> ______________________________________________________________________
> Date: Sat, 4 May 2013 08:56:24 -0700
> From: [email protected]
> To: [email protected]
> CC: [email protected]
> Subject: [nodejs] Re: What libraries are available to parse web pages?
> 
> We started with jsdom and then switched to cheerio.  Very happy with
> it. 
> 
> On Friday, May 3, 2013 9:29:55 PM UTC-7, Kevin Ar18 wrote:
>         Not sure if this is an appropriate question to ask here, but I
>         could not find any other major node.js communities to ask on.
>         
>         
>         My goal is to extract data from pages on the internet.
>         Basically, I need to parse html pages (which might have
>         errors, be malformed, etc...), turn the page into a usable DOM
>         that I can then access elements using the DOM, xpath, or
>         whatever.
>         In Python, I used tidlib + libxml2 to do this.
>         html5lib is also another library I was considering using in
>         Python.
>         
>         However, I would like to find a node.js library that can do
>         what I need.  One that meets these requirements:
>         * can handle malformed html (basically the internet).  This is
>         NOT a strict requirement, since I can pass the page through
>         tidylib first before I send it to the node.js library.
>         * presents an easy to use interface to access elements from
>         the page (bonus if it can access malformed parts of the page
>         instead of just discarding them).  The basic DOM
>         (getElementById, etc...) or an xpath command would be
>         sufficient.  I can always build more features onto it if
>         needed.
>         * BSD/MIT licensed end-to-end; no copyleft parts
>         * bonus for speed and being design to work well or feel native
>         to node.js
>         
>         
>         Several questions:
>         1) Does anyone else have experience with using node.js in the
>         way I just described to extract various pieces of data from a
>         site?  If so,  what libraries do you recommend -- which do you
>         use and which alternatives (that you don't use) have you also
>         considered?
>         
>         2) One of the big problems is handling malformed pages.  There
>         is a library/program called tidylib
>         (http://tidy.sourceforge.net/libintro.html) that I can use to
>         generate nicely formed pages.  So... if I use this library
>         first to clean up the page, could I use some of the stricter
>         XML to JS libraries I see on this page?
>         https://github.com/joyent/node/wiki/modules#wiki-parsers-xml
>         
>         I am still a little uncertain if it says "XML" if that means
>         they will not be good for handling HTML pages.  
>         
>         
>         3) Not A Question.
>         Anyways, here are some libraries that caught my interest.
>         I found libxmljs and "HTML5 Parser for node.js" to be
>         particularly interesting projects.
>         
>         However, I am still a little bit confused about the options
>         available in node.js:
>         
>         libxmljs uses libxml2, which I am familiar with from Python:
>         https://github.com/polotek/libxmljs
>         
>         html5lib site lists 2 options
>         (http://code.google.com/p/html5lib/wiki/Ports):
>         "HTML5 Parser for
>         node.js" (https://github.com/aredridel/html5) -- suggests that
>         it can handle any page on the internet (even malformed)
>         Here's a quote: "HTML5 parsing algorithm. If you find
>         something this can't parse, I'll want to know about it. It
>         should make sense out of anything a browser can."
>         "dom.js" -- dead in experimental state?
>         node-o3-xml -- has GPL components, which I would like to
>         avoid :(
>         node-03-fastxml -- supposedly fast
>         a large list of XML parsers on this page:
>         https://github.com/joyent/node/wiki/modules#wiki-parsers
>         as far as I can tell, most of the items on that page
>         specifically say XML and probably expect perfect XML.  Also,
>         I'm not sure if they are even usable for working with HTML
>         pages (or if that even matters).  As I mentioned before, I can
>         pass the page through tidylib first
>         (http://tidy.sourceforge.net/libintro.html) before sending it
>         to one of these parsers; but I'm still uncertain about how
>         usable they really are for html (or if that even matters).
>         
>         
>         Anyways,
>         hope it was ok to bring up this question here.  Right now, I'm
>         leaning towards libxmljs or aredridel's html5, but I'm hoping
>         there might be other options I am not aware of.
>         
>         Thanks,
>         Kevin
>         
> 
> -- 
> 


-- 
-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to