On Mar 4, 2007, at 2:12 AM, Gregory John Casamento wrote:
Rogelio,
... [elided] ...
If html is so easy to do wrong and so hard to handle then we put a
bullet in the s*****'s head and move on.
It's not that easy... it's nice to say that we will make a parser that
will only handle "correct" HTML, but when you consider that this will
make the browser virtually useless for navigating almost half of the
web pages out there, the idea looses it's appeal. If you write a
from scratch implementation you will need to handle such pages, if you
want anyone to actually use it.
Later, GJC
... [elided] ...
I do not know if this helps or not, but I'll make the suggestion
anyway. Several years ago I needed a parser for a project at work that
could help extract all of the links and URL references in a set of
related HTML documents, then let me re-write the documents. This had
two purposes -- rewriting a set of HTML pages as a multi-part related
MIME message including all images and directly related documents for
emailing, and 'retargeting' -- moving a set of related HTML pages into
an altered hierarchy simply by describing the relationships between two
hierarchies (from the one used in our application to the one used by an
arbitrary customer Intranet) and a starting point. The real monkey
wrench was that the HTML was often very sloppy, containing fragments of
HTML customers had entered themselves to customize the output, as well
as incorrect HTML produced by 3rd-party software modules (which we had
source to, but no budgeted time to fix). While the latter we could do
something about, the former we could not. My solution was to use
HTML-Tidy, a W3C project by Dave Ragget. (
http://www.w3.org/People/Raggett/tidy/ ). There was a project underway
at the time to turn Tidy into a library, but it still had a way to go
-- so, instead, one of our developers took about 3 days and turned it
into a library suitable to our purpose that worked where we needed it
to -- AIX and Solaris. He gave it an interface that was very much like
SAX, on top of which we wrote our logic to re-write pages on the fly.
The Tidy code was very clean and easy to understand C, so this was a
straightforward endeavor. We were then able to handle broken pages,
with the added advantage that pages that were externalized by the
application in this way were also "correct" HTML, regardless of
fragmentary or incorrect input. This has worked so well that we've not
had to touch it since (5 or 6 years).
There, of course, now exists the official TidyLib, which I do not know
a lot about, but it could be a useful tool in getting from the point of
having a renderer that works with correct HTML/XML to one that can
understand the bulk of the incorrect HTML that exists in the real
world.
--Robert
_______________________________________________
Discuss-gnustep mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/discuss-gnustep