There may not be anything wrong with existing parsers. It took me several months before I discovered certain problematic html pages with tautologistics's parser. I just figured an html parser from an actual web browser would likely be more complaint than the parsers written as library implementations. libhubbub is used in actual web browsers, so it's tested on some pretty ugly html pages already. Since libhubbub is written in C, I can make use of libuv's thread pool to schedule workers to do the parsing asynchronously, which will probably see performance increases for multi-core machines.
However, I didn't want people to re-learn a new api, so it was a matter of converting libhubbub into an api-compatible node-htmlparser library. If you use it like tautologistics's parser, it will operate as a blocking call since the original api does not support non-blocking semantics. I have yet to add documentation on how one can use it in a non-blocking mode. On Fri, Oct 26, 2012 at 8:58 AM, Jérémy Lal <[email protected]> wrote: > What's wrong with the htmlparser2 module used by cheerio ? > > On 26/10/2012 17:57, Domenic Denicola wrote: > > Very nice. As maintainer of jsdom, I've been looking for a replacement > default HTML parser that could solve many of the parsing issues we've > encountered. I'll put you on the shortlist. Thanks for announcing. > > > > On Friday, October 26, 2012 6:07:48 AM UTC-4, Dean Mao wrote: > > > > Hi All, > > > > I created a native html parser based on libhubbub, a parser library > used by the netsurf browser project. There were quite a few html pages > that didn't parse correctly on tautologistics's html parser so I thought it > might be easier pulling in a parser from an existing web browser. I > considered using webkit & firefox, but those browsers had too many external > dependencies. The parser can operate in blocking or non-blocking mode, and > streamed (chunked) data. The wonderful jsdom library uses > tautologistics/node-htmlparser by default, but one can choose this parser > as the overriding default. The readme shows an example of how this is done. > > > > Github: > > https://github.com/deanmao/node-hubbub < > https://github.com/deanmao/node-hubbub> > > > > To install: > > npm install hubbub > > > > > > -- > > Job Board: http://jobs.nodejs.org/ > > Posting guidelines: > https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines > > You received this message because you are subscribed to the Google > > Groups "nodejs" group. > > To post to this group, send email to [email protected] > > To unsubscribe from this group, send email to > > [email protected] > > For more options, visit this group at > > http://groups.google.com/group/nodejs?hl=en?hl=en > > -- > Job Board: http://jobs.nodejs.org/ > Posting guidelines: > https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines > You received this message because you are subscribed to the Google > Groups "nodejs" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/nodejs?hl=en?hl=en > -- Job Board: http://jobs.nodejs.org/ Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines You received this message because you are subscribed to the Google Groups "nodejs" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/nodejs?hl=en?hl=en
