Re: GNUstep Web browser (was Re: WebKit Bounty)

Robert Slover Sun, 04 Mar 2007 17:50:22 -0800


On Mar 4, 2007, at 2:12 AM, Gregory John Casamento wrote:

Rogelio,


... [elided] ...

If html is so easy to do wrong and so hard to handle then we put a
bullet in the s*****'s head  and move on.
It's not that easy... it's nice to say that we will make a parser thatwill only handle "correct" HTML, but when you consider that this willmake the browser virtually useless for navigating almost half of theweb pages out there, the idea looses it's appeal. If you write afrom scratch implementation you will need to handle such pages, if youwant anyone to actually use it.
Later, GJC

... [elided] ...

I do not know if this helps or not, but I'll make the suggestionanyway. Several years ago I needed a parser for a project at work thatcould help extract all of the links and URL references in a set ofrelated HTML documents, then let me re-write the documents. This hadtwo purposes -- rewriting a set of HTML pages as a multi-part relatedMIME message including all images and directly related documents foremailing, and 'retargeting' -- moving a set of related HTML pages intoan altered hierarchy simply by describing the relationships between twohierarchies (from the one used in our application to the one used by anarbitrary customer Intranet) and a starting point. The real monkeywrench was that the HTML was often very sloppy, containing fragments ofHTML customers had entered themselves to customize the output, as wellas incorrect HTML produced by 3rd-party software modules (which we hadsource to, but no budgeted time to fix). While the latter we could dosomething about, the former we could not. My solution was to useHTML-Tidy, a W3C project by Dave Ragget. (http://www.w3.org/People/Raggett/tidy/ ). There was a project underwayat the time to turn Tidy into a library, but it still had a way to go-- so, instead, one of our developers took about 3 days and turned itinto a library suitable to our purpose that worked where we needed itto -- AIX and Solaris. He gave it an interface that was very much likeSAX, on top of which we wrote our logic to re-write pages on the fly.The Tidy code was very clean and easy to understand C, so this was astraightforward endeavor. We were then able to handle broken pages,with the added advantage that pages that were externalized by theapplication in this way were also "correct" HTML, regardless offragmentary or incorrect input. This has worked so well that we've nothad to touch it since (5 or 6 years).

There, of course, now exists the official TidyLib, which I do not knowa lot about, but it could be a useful tool in getting from the point ofhaving a renderer that works with correct HTML/XML to one that canunderstand the bulk of the incorrect HTML that exists in the realworld.


--Robert



_______________________________________________
Discuss-gnustep mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/discuss-gnustep

Re: GNUstep Web browser (was Re: WebKit Bounty)

Reply via email to