I'm jumping into the middle of a discussion of HTML-Tree (which I just now learned of the existence of) compared to HTML::Parser and HTML::TreeBuilder. My apologies if I reveal that I've missed too much of the discussion to date. On http://www.best.com/~pjl/software/html_tree/comparison.html, Paul J. Lucas states: >HTML Tree is very similar to the HTML::Parser and HTML::TreeBuilder >Perl modules by Gisle Aas and Michael Chase, Are you using one of the old old old versions of TreeBuilder that doesn't have my name on it? If so, FIE UPON THEE! Also, be aware that "HTML-Tree" is the name of the CPAN dist that contains HTML::TreeBuilder, HTML::Element, and two other modules. Your coming out with another module suite called, homophonically, "HTML Tree" is just asking for confusion all around. > [...] except that it: [...] [I'm taking your points out of order] > 2. Isn't a strict DTD (Document Type Definition) parser. The goal is > to parse HTML files fast, not check for validity. (You should > check the validity of your HTML files with other tools before you > put them on your web site anyway.) HTML Tree couldn't care less > what attributes a given HTML element has just so long as the > syntax is correct. This is actually similar to browsers in that > both are very permissive in what they accept. This clearly implicates that TreeBuilder /is/ a strict DTD parser, and that it checks for validity. It isn't really, and doesn't really, and I really wish you would clarify this, lest anyone consider it an categorical misrepresentation of how TreeBuilder works. TreeBuilder "knows" a few things about HTML, but only the bare minimum required to be able to produce correct parse trees. (For example, if it sees "<p>foo<p>bar" (which is perfectly /valid/ code), it uses the fact that a p element can't be a child of another p element, and so closes the first p before opening the second. How you can do this any other way is beyond me.) Now, if you /can/ say that the HTML you have coming in is going to be valid AND has close-tags everywhere, then you don't need to know anything about HTML -- and, in fact, TreeBuilder has a mode, $tree->implicit_tags(1), that you can switch it into that bypasses (nearly?) all that slow and clunky context checking. (I've basically never used that mode, since it produces incorrect parses for anything put perfect code -- which is pretty hard to come by.) But (while I'm off an this tangent) the REAL way to do this would be to use XHTML. If you have control over the quality HTML coming it, then you can just demand that it be XHTML (just run it thru Raggett's Tidy first), and use an XML parser. Fast fast fast, because none of that context checking that's necessary for all non-XML-like SGML dialects (like HTML). > 3. Offers simple conditional and looping mechanisms assisting in the > generation of dynamic content. I consider that well outside of the scope of what HTML::TreeBuilder to be for, so I think it's fine if you do that and I don't. If I did want that kind of thing, I'd do it on top of XHTML (with the conditional and loopy things in PIs); or I could do the same in HTML, and have a 'preprocess' method in TreeBuilder that would traverse the tree and run whatever the PIs say to run. In fact, people can even now do this for themselves by just subclassing TreeBuilder and overriding whatever method it is that catches PIs, and having loop construct fun in there. And the new TreeBuilder will provide a better mechanism for that sort of thing, which will require no subclassing. (For various reasons, I'm thinking of deprecating use of TreeBuilder as a base class, and then explicitly making it unsubclassable; I REALLY don't like breaking things for other people -- but it's my experience that very few people have actually been subclassing it.) > 1. Is several times faster. HTML Tree owes its speed to two things: > using mmap(2) to read the HTML file bypassing conventional I/O and > buffering, and being written entirely in C++ as opposed to Perl I haven't benchmarked anything here, but I'll note that my first priority in my rewrite of HTML::TreeBuilder was that it actually produce correct parse trees; compatibility with pre-XS versions of HTML::Parser has also been a priority; and speed, admittedly, is a distant third. (Somewhere in there is the new ability to have XML::Twig-like callbacks, useful in processing large HTML documents; I'm adding that feature now; mercifully, it doesn't really interfere with the other priorities of correctness, compatibility, and speed! I owe a million thanks to Michel Rodriguez (XML::Twig author) for the idea. Like all great ideas, it's obvious -- in retrospect!) But speaking of speed: I'm about to put out a new version of TreeBuilder and Element that should exhibit improved speed (besides some fancy new features). I am also considering some speed tweaks that should make TreeBuilder even faster for people using the XS versions of HTML::Parser -- by telling the Parser object to not use the derived-class interface, but by specifying the callbacks that do about the same thing. (I think I could even optimize things a bit more my scrapping the whole existing interface and having the specified callbacks be closures where variables for the top of the tree and a few other temp variables are under closure -- that'd avoid having to constantly allocate $self and stuff. But this would work only under XS versions of Parser. I may well just make a version of TreeBuilder that compiles itself one way for XS versions of Parser, and another way for pre-XS versions -- as the differences would be minor and systematic. I'm also considering commenting out all the print "..." if $Debug > 2 statements that are practically every other instruction in TreeBuilder. Those are indispensible for any kind of development or debugging, but they do exact a minor performance hit for users.) I'm currently banging out my next TPJ article (greatly involving TreeBuilder, by the way), and once that's done I'll try to release the new HTML-Tree (TreeBuilder, Element, et al) containing many of the features discussed above. Look for it in CPAN hopefully in the next two weeks. -- Sean M. Burke [EMAIL PROTECTED] http://www.spinn.net/~sburke/