Re: [Haskell-cafe] Is XHT a good tool for parsing web pages?

2010-04-28 Thread Ivan Lazar Miljenovic
Uwe Schmidt u...@fh-wedel.de writes:
 The HTML parser in HXT is based on tagsoup. It's a lazy parser
 (it does not use parsec) and it tries to parse everything as HTML.
 But garbage in, garbage out, there is no approach to repair illegal HTML
 as e.g. the Tidy parsers do. The parser uses tagsoup as a scanner.

So what is parsec used for in HXT then?

-- 
Ivan Lazar Miljenovic
ivan.miljeno...@gmail.com
IvanMiljenovic.wordpress.com
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Is XHT a good tool for parsing web pages?

2010-04-28 Thread Uwe Schmidt
Hi Ivan,

 Uwe Schmidt u...@fh-wedel.de writes:
  The HTML parser in HXT is based on tagsoup. It's a lazy parser
  (it does not use parsec) and it tries to parse everything as HTML.
  But garbage in, garbage out, there is no approach to repair illegal HTML
  as e.g. the Tidy parsers do. The parser uses tagsoup as a scanner.

 So what is parsec used for in HXT then?

for the XML parser. This XML parser also deals with DTDs. This parser only 
accepts well formed XML, everything else gives an error (not just a warning 
like HTML parser). tagsoup and the HTML parser do not deal with DTDs,
so they can't be used for a full (validating) XML parser.

Regards,

   Uwe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Is XHT a good tool for parsing web pages?

2010-04-27 Thread John Creighton
 Subject: Is XHT a good tool for parsing web pages?
 I looked a little bit at XHT and it seems very elegant for writing
 concise definitions of parsers by forms but I read that it fails if
 the XML isn't strict and I know a lot of web pages don't use strict
 XHTML. Therefore I wonder if it is an appropriate tool for web pages.


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Is XHT a good tool for parsing web pages?

2010-04-27 Thread Peter Robinson
On 27 April 2010 16:22, John Creighton johns2...@gmail.com wrote:
 Subject: Is XHT a good tool for parsing web pages?
 I looked a little bit at XHT and it seems very elegant for writing
 concise definitions of parsers by forms but I read that it fails if
 the XML isn't strict and I know a lot of web pages don't use strict
 XHTML. Therefore I wonder if it is an appropriate tool for web pages.

I don't know about XHT but tagsoup [1] does a pretty good job parsing web pages.

  Peter

[1] http://hackage.haskell.org/package/tagsoup
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Is XHT a good tool for parsing web pages?

2010-04-27 Thread Malcolm Wallace

Is XHT a good tool for parsing web pages?
I read that it fails if the XML isn't strict and I know a lot of web  
pages don't use strict XHTML.


Do you mean HXT rather than XHT?

I know that the HaXml library has a separate error-correcting HTML  
parser that works around most of the common non-well-formedness bugs  
in HTML:

Text.XML.HaXml.Html.Parse

I believe HXT has a similar parser:
Text.XML.HXT.Parser.HtmlParsec

Indeed, some of the similarities suggest this parser was originally  
lifted directly out of HaXml (as permitted by HaXml's licence),  
although the two modules have now diverged significantly.


Regards,
Malcolm

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe