Re: [racket-users] html parsing library does not handle 'article' tags -- any solutions?

Neil Van Dyke Thu, 07 Jan 2016 13:43:03 -0800


Matthew Butterick wrote on 01/07/2016 04:18 PM:

When we speak of "parsing HTML" we should distinguish between strictparsing (= explicit adherence to a given HTML spec) and permissiveparsing (= converting an HTML-ish string into Racket data.) Both havetheir place.

Alas, I think the W3C had to give up on trying to make people do strictparsing. Not enough people ran the W3C Validator in the earlier days ofthe Web, and the (since-abandoned) XML-based XHTML standard was startedafter the strict ship had long since sailed. The W3C has moved behindHTML5 for now.

The `html-parsing` parser was written 15 years ago for doing AI-ishsoftware agent scraping of info from real-world Web pages, so it wasnecessarily permissive. In some ways, HTML was even worse back then,because Mosaic/Navigator/MSIE tended to accept invalid HTML-- like ifthe Racket compiler never raised an error or gave a warning message foran error, and simply generated whatever code it wanted to, andprogrammers worked by mindlessly poking at their source code until thegenerated code seemed to be doing what they wanted. :) Syntactically,real-world HTML is somewhat better now, because the development toolsand the browsers are better. But a permissive parser still makes sensefor most purposes, including the massive HTML5 of 15 years later.


Neil V.

--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] html parsing library does not handle 'article' tags -- any solutions?

Reply via email to