Re: [Haskell-cafe] Is XHT a good tool for parsing web pages?
Uwe Schmidt u...@fh-wedel.de writes: The HTML parser in HXT is based on tagsoup. It's a lazy parser (it does not use parsec) and it tries to parse everything as HTML. But garbage in, garbage out, there is no approach to repair illegal HTML as e.g. the Tidy parsers do. The parser uses tagsoup as a scanner. So what is parsec used for in HXT then? -- Ivan Lazar Miljenovic ivan.miljeno...@gmail.com IvanMiljenovic.wordpress.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Is XHT a good tool for parsing web pages?
Hi Ivan, Uwe Schmidt u...@fh-wedel.de writes: The HTML parser in HXT is based on tagsoup. It's a lazy parser (it does not use parsec) and it tries to parse everything as HTML. But garbage in, garbage out, there is no approach to repair illegal HTML as e.g. the Tidy parsers do. The parser uses tagsoup as a scanner. So what is parsec used for in HXT then? for the XML parser. This XML parser also deals with DTDs. This parser only accepts well formed XML, everything else gives an error (not just a warning like HTML parser). tagsoup and the HTML parser do not deal with DTDs, so they can't be used for a full (validating) XML parser. Regards, Uwe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Is XHT a good tool for parsing web pages?
Subject: Is XHT a good tool for parsing web pages? I looked a little bit at XHT and it seems very elegant for writing concise definitions of parsers by forms but I read that it fails if the XML isn't strict and I know a lot of web pages don't use strict XHTML. Therefore I wonder if it is an appropriate tool for web pages. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Is XHT a good tool for parsing web pages?
On 27 April 2010 16:22, John Creighton johns2...@gmail.com wrote: Subject: Is XHT a good tool for parsing web pages? I looked a little bit at XHT and it seems very elegant for writing concise definitions of parsers by forms but I read that it fails if the XML isn't strict and I know a lot of web pages don't use strict XHTML. Therefore I wonder if it is an appropriate tool for web pages. I don't know about XHT but tagsoup [1] does a pretty good job parsing web pages. Peter [1] http://hackage.haskell.org/package/tagsoup ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Is XHT a good tool for parsing web pages?
Is XHT a good tool for parsing web pages? I read that it fails if the XML isn't strict and I know a lot of web pages don't use strict XHTML. Do you mean HXT rather than XHT? I know that the HaXml library has a separate error-correcting HTML parser that works around most of the common non-well-formedness bugs in HTML: Text.XML.HaXml.Html.Parse I believe HXT has a similar parser: Text.XML.HXT.Parser.HtmlParsec Indeed, some of the similarities suggest this parser was originally lifted directly out of HaXml (as permitted by HaXml's licence), although the two modules have now diverged significantly. Regards, Malcolm ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe