Re: [Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)

2010-11-20 Thread Stephen Tetley
On 19 November 2010 22:17, Brandon S Allbery KF8NH allb...@ece.cmu.edu wrote:

 If a Perl expert tells you that regexps are the way to parse HTML/XML, you
 can safely conclude they've never actually tried to do it.

For the original message it sounded like the Perl expert recommended
regexps to scrape facts from Html. That's a quite different scenario
from parsing and not unreasonable for regexps.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)

2010-11-19 Thread Brandon S Allbery KF8NH
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/13/10 09:19 , Brent Yorgey wrote:
 On Fri, Nov 12, 2010 at 03:56:26PM -0800, Michael Litchard wrote:
 a Perl perspective. I let him into what I was doing, and he opined I
 should be using pcre. So now I'm second guessing my choices. Why do
 people choose not to use regex for uri parsing?
 
 Never believe anything anyone coming from a Perl perspective says
 about regular expressions.

If a Perl expert tells you that regexps are the way to parse HTML/XML, you
can safely conclude they've never actually tried to do it.

- -- 
brandon s. allbery [linux,solaris,freebsd,perl]  allb...@kf8nh.com
system administrator  [openafs,heimdal,too many hats]  allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon university  KF8NH
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkzm93MACgkQIn7hlCsL25V6RACgxWMErR6armLoxyFooERkxnJa
+I8Aniag5cRSZ9pdwsDeQ/nedMsxana+
=aiuP
-END PGP SIGNATURE-
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)

2010-11-15 Thread Brent Yorgey
On Fri, Nov 12, 2010 at 03:56:26PM -0800, Michael Litchard wrote:
 I've been working on a project that requires me to do screen scraping.
 When I first started this, I worked off of other people's examples.
 Not one used regex. By luck I found someone at work to help me along
 this project. His clues and hints don't use regex either. I was at a
 point where I had to make a decision concerning design, so I asked the
 guy sitting next to me at work. He's very experienced, and comes from
 a Perl perspective. I let him into what I was doing, and he opined I
 should be using pcre. So now I'm second guessing my choices. Why do
 people choose not to use regex for uri parsing?

Never believe anything anyone coming from a Perl perspective says
about regular expressions.

-Brent
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)

2010-11-15 Thread Neil Mitchell
 I've been working on a project that requires me to do screen scraping.

If you are screen scraping HTML I think tagsoup is a very good choice.
The use of tagsoup means that you have a real HTML 5 compliant parser
underneath, and then you can use whatever technique you wish to split
up the page text - and regular expressions/parsec might be a
reasonable choice. I've written lots of screen scraping stuff with
tagsoup, and it's usually very easy - the manual even walks you
through a couple of examples:
http://community.haskell.org/~ndm/darcs/tagsoup/tagsoup.htm

 He's very experienced, and comes from
 a Perl perspective. I let him into what I was doing, and he opined I
 should be using pcre.

When all you have is a hammer, everything looks like a thumb.
Structured manipulation of algebraic data types is trivial in Haskell,
and much less natural in Perl, so they use different techniques in
different places.

 So now I'm second guessing my choices. Why do
 people choose not to use regex for uri parsing?

If you mean HTML parsing, then it's because it's a nightmare to get
right, and people on the web do all kinds of crazy stuff. A correct
regular expression to match an HTML tag is lots of work. Given that
it's a solved problem, why go to all that effort. It is possible to do
with regular expressions, but not pleasant.

Thanks, Neil
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)

2010-11-15 Thread Christopher Done
On 13 November 2010 16:46, Neil Mitchell ndmitch...@gmail.com wrote:
 I've been working on a project that requires me to do screen scraping.

 If you are screen scraping HTML I think tagsoup is a very good choice.
 The use of tagsoup means that you have a real HTML 5 compliant parser
 underneath, and then you can use whatever technique you wish to split
 up the page text - and regular expressions/parsec might be a
 reasonable choice. I've written lots of screen scraping stuff with
 tagsoup, and it's usually very easy - the manual even walks you
 through a couple of examples:
 http://community.haskell.org/~ndm/darcs/tagsoup/tagsoup.htm

Agreed, the tagsoup library just works. I've used it plenty of times
for my scraping needs. E.g. scraping from paste sites:

https://github.com/chrisdone/amelie/blob/master/src/Amelie/Import.hs#L84

https://github.com/chrisdone/hpaste-feed/blob/master/main.hs#L65

You can always regex match on what tagsoup gives you, too.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)

2010-11-12 Thread Michael Litchard
I've been working on a project that requires me to do screen scraping.
When I first started this, I worked off of other people's examples.
Not one used regex. By luck I found someone at work to help me along
this project. His clues and hints don't use regex either. I was at a
point where I had to make a decision concerning design, so I asked the
guy sitting next to me at work. He's very experienced, and comes from
a Perl perspective. I let him into what I was doing, and he opined I
should be using pcre. So now I'm second guessing my choices. Why do
people choose not to use regex for uri parsing?
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] RegEx versus (Parsec, TagSoup, others...)

2010-11-12 Thread wren ng thornton

On 11/12/10 6:56 PM, Michael Litchard wrote:

I've been working on a project that requires me to do screen scraping.
When I first started this, I worked off of other people's examples.
Not one used regex. By luck I found someone at work to help me along
this project. His clues and hints don't use regex either. I was at a
point where I had to make a decision concerning design, so I asked the
guy sitting next to me at work. He's very experienced, and comes from
a Perl perspective. I let him into what I was doing, and he opined I
should be using pcre. So now I'm second guessing my choices. Why do
people choose not to use regex for uri parsing?


As the grammar becomes more complex (i.e., as your patterns become more 
nuanced), using a real parser framework helps to improve code legibility 
since you can factor parts of the grammar out, give them names, etc. In 
addition to the documentation effects, this refactoring also allows you 
to make your grammars modular by using the same subgrammar in multiple 
places. While technically you can do the same factoring for constructing 
the regex that gets handed off to pcre, almost noone does that in practice.


Also, using a real parsing framework allows you to construct more 
powerful grammars than regular grammars, so if you need the power of 
unbounded recursion or of context sensitivity, then regular expressions 
are out. Technically Perl's regexen are Turing complete and aren't 
regular expressions at all; pcre has inherited some of that extra 
power, put the point still holds at large.


Even with more restricted regexen than Perl has, the modern idea of a 
regex isn't regular at all. Beginning of sentence and end of sentence 
anchors are not regular properties, which allows you to have the worst 
kind of fun :)


http://zmievski.org/2010/08/the-prime-that-wasnt

Even if you did decide to go for regular expressions, pcre chooses a 
specific implementation for handling choice (namely backtracking 
search). Depending on your grammars and the text they'll be applied to, 
this may not be the most efficient implementation since backtracking can 
lead to exponential behaviors that other regex implementations don't have.


Also, regexes are apparently very difficult to implement *correctly*:

http://www.haskell.org/haskellwiki/Regex_Posix

--
Live well,
~wren
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe