Re: [Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)
On 19 November 2010 22:17, Brandon S Allbery KF8NH allb...@ece.cmu.edu wrote: If a Perl expert tells you that regexps are the way to parse HTML/XML, you can safely conclude they've never actually tried to do it. For the original message it sounded like the Perl expert recommended regexps to scrape facts from Html. That's a quite different scenario from parsing and not unreasonable for regexps. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/13/10 09:19 , Brent Yorgey wrote: On Fri, Nov 12, 2010 at 03:56:26PM -0800, Michael Litchard wrote: a Perl perspective. I let him into what I was doing, and he opined I should be using pcre. So now I'm second guessing my choices. Why do people choose not to use regex for uri parsing? Never believe anything anyone coming from a Perl perspective says about regular expressions. If a Perl expert tells you that regexps are the way to parse HTML/XML, you can safely conclude they've never actually tried to do it. - -- brandon s. allbery [linux,solaris,freebsd,perl] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkzm93MACgkQIn7hlCsL25V6RACgxWMErR6armLoxyFooERkxnJa +I8Aniag5cRSZ9pdwsDeQ/nedMsxana+ =aiuP -END PGP SIGNATURE- ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)
On Fri, Nov 12, 2010 at 03:56:26PM -0800, Michael Litchard wrote: I've been working on a project that requires me to do screen scraping. When I first started this, I worked off of other people's examples. Not one used regex. By luck I found someone at work to help me along this project. His clues and hints don't use regex either. I was at a point where I had to make a decision concerning design, so I asked the guy sitting next to me at work. He's very experienced, and comes from a Perl perspective. I let him into what I was doing, and he opined I should be using pcre. So now I'm second guessing my choices. Why do people choose not to use regex for uri parsing? Never believe anything anyone coming from a Perl perspective says about regular expressions. -Brent ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)
I've been working on a project that requires me to do screen scraping. If you are screen scraping HTML I think tagsoup is a very good choice. The use of tagsoup means that you have a real HTML 5 compliant parser underneath, and then you can use whatever technique you wish to split up the page text - and regular expressions/parsec might be a reasonable choice. I've written lots of screen scraping stuff with tagsoup, and it's usually very easy - the manual even walks you through a couple of examples: http://community.haskell.org/~ndm/darcs/tagsoup/tagsoup.htm He's very experienced, and comes from a Perl perspective. I let him into what I was doing, and he opined I should be using pcre. When all you have is a hammer, everything looks like a thumb. Structured manipulation of algebraic data types is trivial in Haskell, and much less natural in Perl, so they use different techniques in different places. So now I'm second guessing my choices. Why do people choose not to use regex for uri parsing? If you mean HTML parsing, then it's because it's a nightmare to get right, and people on the web do all kinds of crazy stuff. A correct regular expression to match an HTML tag is lots of work. Given that it's a solved problem, why go to all that effort. It is possible to do with regular expressions, but not pleasant. Thanks, Neil ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)
On 13 November 2010 16:46, Neil Mitchell ndmitch...@gmail.com wrote: I've been working on a project that requires me to do screen scraping. If you are screen scraping HTML I think tagsoup is a very good choice. The use of tagsoup means that you have a real HTML 5 compliant parser underneath, and then you can use whatever technique you wish to split up the page text - and regular expressions/parsec might be a reasonable choice. I've written lots of screen scraping stuff with tagsoup, and it's usually very easy - the manual even walks you through a couple of examples: http://community.haskell.org/~ndm/darcs/tagsoup/tagsoup.htm Agreed, the tagsoup library just works. I've used it plenty of times for my scraping needs. E.g. scraping from paste sites: https://github.com/chrisdone/amelie/blob/master/src/Amelie/Import.hs#L84 https://github.com/chrisdone/hpaste-feed/blob/master/main.hs#L65 You can always regex match on what tagsoup gives you, too. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)
I've been working on a project that requires me to do screen scraping. When I first started this, I worked off of other people's examples. Not one used regex. By luck I found someone at work to help me along this project. His clues and hints don't use regex either. I was at a point where I had to make a decision concerning design, so I asked the guy sitting next to me at work. He's very experienced, and comes from a Perl perspective. I let him into what I was doing, and he opined I should be using pcre. So now I'm second guessing my choices. Why do people choose not to use regex for uri parsing? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)
On 11/12/10 6:56 PM, Michael Litchard wrote: I've been working on a project that requires me to do screen scraping. When I first started this, I worked off of other people's examples. Not one used regex. By luck I found someone at work to help me along this project. His clues and hints don't use regex either. I was at a point where I had to make a decision concerning design, so I asked the guy sitting next to me at work. He's very experienced, and comes from a Perl perspective. I let him into what I was doing, and he opined I should be using pcre. So now I'm second guessing my choices. Why do people choose not to use regex for uri parsing? As the grammar becomes more complex (i.e., as your patterns become more nuanced), using a real parser framework helps to improve code legibility since you can factor parts of the grammar out, give them names, etc. In addition to the documentation effects, this refactoring also allows you to make your grammars modular by using the same subgrammar in multiple places. While technically you can do the same factoring for constructing the regex that gets handed off to pcre, almost noone does that in practice. Also, using a real parsing framework allows you to construct more powerful grammars than regular grammars, so if you need the power of unbounded recursion or of context sensitivity, then regular expressions are out. Technically Perl's regexen are Turing complete and aren't regular expressions at all; pcre has inherited some of that extra power, put the point still holds at large. Even with more restricted regexen than Perl has, the modern idea of a regex isn't regular at all. Beginning of sentence and end of sentence anchors are not regular properties, which allows you to have the worst kind of fun :) http://zmievski.org/2010/08/the-prime-that-wasnt Even if you did decide to go for regular expressions, pcre chooses a specific implementation for handling choice (namely backtracking search). Depending on your grammars and the text they'll be applied to, this may not be the most efficient implementation since backtracking can lead to exponential behaviors that other regex implementations don't have. Also, regexes are apparently very difficult to implement *correctly*: http://www.haskell.org/haskellwiki/Regex_Posix -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe