Re: [nodejs] Re: What libraries are available to parse web pages?

Matt Sat, 04 May 2013 10:04:12 -0700

Another upvote for cheerio. It doesn't support all jQuery selectors, but
it's a good enough subset.



On Sat, May 4, 2013 at 11:56 AM, George Snelling <[email protected]> wrote:

> We started with jsdom and then switched to cheerio.  Very happy with it.
>
>
> On Friday, May 3, 2013 9:29:55 PM UTC-7, Kevin Ar18 wrote:
>>
>> Not sure if this is an appropriate question to ask here, but I could not
>> find any other major node.js communities to ask on.
>>
>>
>> My goal is to extract data from pages on the internet.
>> Basically, I need to parse html pages (which might have errors, be
>> malformed, etc...), turn the page into a usable DOM that I can then access
>> elements using the DOM, xpath, or whatever.
>> In Python, I used tidlib + libxml2 to do this.
>> html5lib is also another library I was considering using in Python.
>>
>> However, I would like to find a node.js library that can do what I need.
>> One that meets these requirements:
>> * can handle malformed html (basically the internet).  This is NOT a
>> strict requirement, since I can pass the page through tidylib first before
>> I send it to the node.js library.
>> * presents an easy to use interface to access elements from the page
>> (bonus if it can access malformed parts of the page instead of just
>> discarding them).  The basic DOM (getElementById, etc...) or an xpath
>> command would be sufficient.  I can always build more features onto it if
>> needed.
>> * BSD/MIT licensed end-to-end; no copyleft parts
>> * bonus for speed and being design to work well or feel native to node.js
>>
>>
>> Several questions:
>> 1) Does anyone else have experience with using node.js in the way I just
>> described to extract various pieces of data from a site?  If so,  what
>> libraries do you recommend -- which do you use and which alternatives (that
>> you don't use) have you also considered?
>>
>> 2) One of the big problems is handling malformed pages.  There is a
>> library/program called tidylib (http://tidy.sourceforge.net/**
>> libintro.html <http://tidy.sourceforge.net/libintro.html>) that I can
>> use to generate nicely formed pages.  So... if I use this library first to
>> clean up the page, could I use some of the stricter XML to JS libraries I
>> see on this page?  https://github.com/joyent/**node/wiki/modules#wiki-**
>> parsers-xml<https://github.com/joyent/node/wiki/modules#wiki-parsers-xml>
>>
>> I am still a little uncertain if it says "XML" if that means they will
>> not be good for handling HTML pages.
>>
>>
>> 3) Not A Question.
>> Anyways, here are some libraries that caught my interest.
>> I found *libxmljs *and "*HTML5 Parser for node.js*" to be particularly
>> interesting projects.
>>
>> However, I am still a little bit confused about the options available in
>> node.js:
>>
>> *libxmljs *uses libxml2, which I am familiar with from Python:
>> https://github.com/polotek/**libxmljs<https://github.com/polotek/libxmljs>
>>
>> html5lib site lists 2 options (http://code.google.com/p/**
>> html5lib/wiki/Ports <http://code.google.com/p/html5lib/wiki/Ports>):
>> *"HTML5 Parser for node.js"* 
>> (https://github.com/aredridel/**html5<https://github.com/aredridel/html5>)
>> -- suggests that it can handle any page on the internet (even malformed)
>> Here's a quote: "HTML5 parsing algorithm. If you find something this can't
>> parse, I'll want to know about it. It should make sense out of anything a
>> browser can."
>> "dom.js" -- dead in experimental state?
>> node-o3-xml -- has GPL components, which I would like to avoid :(
>> *node-03-fastxml* -- supposedly fast
>> *a large list of XML parsers on this page: https://github.com/joyent/
>> node/wiki/modules#wiki-parsers*
>> as far as I can tell, most of the items on that page specifically say XML
>> and probably expect perfect XML.  Also, I'm not sure if they are even
>> usable for working with HTML pages (or if that even matters).  As I
>> mentioned before, I can pass the page through tidylib first (
>> http://tidy.sourceforge.net/**libintro.html<http://tidy.sourceforge.net/libintro.html>)
>> before sending it to one of these parsers; but I'm still uncertain about
>> how usable they really are for html (or if that even matters).
>>
>>
>> Anyways,
>> hope it was ok to bring up this question here.  Right now, I'm leaning
>> towards libxmljs or aredridel's html5, but I'm hoping there might be other
>> options I am not aware of.
>>
>> Thanks,
>> Kevin
>>
>  --
> --
> Job Board: http://jobs.nodejs.org/
> Posting guidelines:
> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
> You received this message because you are subscribed to the Google
> Groups "nodejs" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/nodejs?hl=en?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "nodejs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [nodejs] Re: What libraries are available to parse web pages?

Reply via email to