Re: [Haskell-cafe] attoparsec and backtracking

2013-03-19 Thread oleg

Wren Thornton wrote:
 I had some similar issues recently. The trick is figuring out how to
 convince attoparsec to commit to a particular alternative. For example,
 consider the grammar: A (B A)* C; where if the B succeeds then we want to
 commit to parsing an A (and if it fails then return A's error, not C's).

Indeed. Consider the following (greatly simplified) fragment from the
OCaml grammar

| let; r = opt_rec; bi = binding; in;
   x = expr LEVEL ; -
| function; a = match_case -
| if; e1 = SELF; then; e2 = expr LEVEL top;
  else; e3 = expr LEVEL top -
...
| false - 
| true  - 

It would be bizarre if the parser -- upon seeing if but not finding
then -- would've reported the error that `found if when true was
expected'. Many people would think that when the parser comes across
if, it should commit to parsing the conditional. And if it fails later, it
should report the error with the conditional, rather than trying to
test how else the conditional cannot be parsed. This is exactly the
intuition of pattern matching. For example, given

 foo (if:t) = case t of
  (e:then:_) - e
 foo _ = 

we expect that 
foo [if,false,false]
will throw an exception rather than return the empty string. If the
pattern has matched, we are committed to the corresponding
branch. Such an intuition ought to apply to parsing -- and indeed it
does. The OCaml grammar above was taken from the camlp4 code. Camlp4
parsers

http://caml.inria.fr/pub/docs/tutorial-camlp4/tutorial002.html#toc6

do pattern-matching on a stream, for example
 # let rec expr =
 parser
   [ 'If; x = expr; 'Then; y = expr; 'Else; z = expr ] - if
 | [ 'Let; 'Ident x; 'Equal; x = expr; 'In; y = expr ] - let

and raise two different sort of exceptions. A parser raises
Stream.Failure if it failed on the first element of the stream (in the
above case, if the stream contains neither If nor Let). If the parser
successfully consumed the first element but failed later, a different
Stream.Error is thrown. Although Camlp4 has many detractors, even they
admit that the parsing technology by itself is surprisingly powerful,
and produces error messages that are oftentimes better than those by
the yacc-like, native OCaml parser. Camlp4 parsers are used
extensively in Coq.

The idea of two different failures may help in the case of attoparsec
or parsec. Regular parser failure initiates backtracking. If we wish
to terminate the parser, we should raise the exception (and cut the
rest of the choice points). Perhaps the could be a combinator `commit'
that converts a failure to the exception. In the original example
A (B A)* C we would use it as A (B (commit A))* C.



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] attoparsec and backtracking

2013-03-16 Thread Roman Cheplyaka
* Niklas Hambüchen m...@nh2.me [2013-03-16 03:49:29+]
 I would agree that what attoparsec does for | of Alternative and mplus
 for MonadPlus is correct since e.g. the mplus laws say that a failure
 must be identity and therefore the following alternatives must be
 considered. I also find it very convenient that attoparsec works this
 way, and prefer it to what parsec does by default.

empty/mzero are indeed identities in Parsec.

What doesn't hold is the law

   v  mzero = mzero

But this one is often violated:

   flip runState 0 $ runMaybeT mzero
  (Nothing,0)

   flip runState 0 $ runMaybeT $ lift (modify (+1))  mzero
  (Nothing,1)

Roman

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] attoparsec and backtracking

2013-03-16 Thread Evan Laforge
 I think the mistake here is to parse something and then decide if
 its it valid. It should be the parser which decides whether its
 valid. So rather than:

  suffix - A.option  ((:) $ A.letter_ascii)

 try:

  typ - A.choice [ {- list or valid suffix parsers -} ]
  return $ Score.Typed typ num

I actually had that originally, but but switched to fail-after for the
better error msg.  It worked with parsec, but then I lost it again
when I switched to attoparsec.  I think Wren is right, I really would
need to refactor the parser to put the decisions in the right spot.

 We you using Parsec as a token parser or as a Char parser. Obviously
 the second is going to be slow in comparison to the first.

It was actually the Text version of parsec, back when that was new.  I
should go do a profile again someday, but since attoparsec and parsec
APIs are almost but not quite the same, it's kind of a pain.  I
actually tried the Text version of attoparsec, back when that was not
yet integrated into attoparsec itself, and bytestring was still
significantly faster.  So I don't know how much was Text vs.
ByteString and how much was parsec vs. attoparsec.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] attoparsec and backtracking

2013-03-16 Thread Evan Laforge
On Fri, Mar 15, 2013 at 8:49 PM, Niklas Hambüchen m...@nh2.me wrote:
 Is it not possible to add an alternative (no pun intended) to | that
 supports the semantics Evan wants?

I assume it's the performance thing.  Presumably it would need to pass
an extra flag with to the failure continuation to tell it to not
retry, though that doesn't sound so bad.  Actually, Bryan's response
here:

https://github.com/bos/attoparsec/issues/42

makes it sound like he's not opposed to || on performance grounds,
just that | is more intuitive.  I agree in general, but not in the
case of error msgs!  So maybe I just need to see if I can make a patch
to add ||, sounds like Johan at least would be into that.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] attoparsec and backtracking

2013-03-16 Thread Niklas Hambüchen
@Evan Thanks for that link, I posted a somewhat longer argument in 
there.

Personally, I'd love ||.

On Sun 17 Mar 2013 02:02:44 GMT, Evan Laforge wrote:
 On Fri, Mar 15, 2013 at 8:49 PM, Niklas Hambüchen m...@nh2.me wrote:
 Is it not possible to add an alternative (no pun intended) to | that
 supports the semantics Evan wants?

 I assume it's the performance thing.  Presumably it would need to pass
 an extra flag with to the failure continuation to tell it to not
 retry, though that doesn't sound so bad.  Actually, Bryan's response
 here:

 https://github.com/bos/attoparsec/issues/42

 makes it sound like he's not opposed to || on performance grounds,
 just that | is more intuitive.  I agree in general, but not in the
 case of error msgs!  So maybe I just need to see if I can make a patch
 to add ||, sounds like Johan at least would be into that.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] attoparsec and backtracking

2013-03-15 Thread Evan Laforge
I have a couple of problems with attoparsec which I think are related
to its always backtrack nature, but maybe there's some other way to
solve the same problems.

The first is that it's hard to get the right error msg out.  For
instance, I have a parser that tries to parse a number with an
optional type suffix.  It's an error if the suffix is unrecognized:

p_num :: A.Parser Score.TypedVal
p_num = do
num - p_untyped_num
suffix - A.option  ((:) $ A.letter_ascii)
case Score.code_to_type suffix of
Nothing - fail $ p_num expected suffix in [cdsr]:  ++ show suffix
Just typ - return $ Score.Typed typ num

However, which error msg shows up depends on the order of the (|)
alternatives, and in general the global structure of the entire
parser, because I think it just backtracks and then picks the last
failing backtrack.  Even after carefully rearranging all the parsers
it seems impossible to get this particular error to bubble up to the
top.  The thing is, as soon as I see an unexpected suffix I know I can
fail entirely right there, with exactly that error msg, but since
there's no way to turn off backtracking I think there's no way to do
that.


The second thing is that I want to lex a single token.  I thought I
could just parse a term and then see how far I got and then use that
index to splitAt the input.  But attoparsec doesn't keep track of the
current index, so I wrote ((,) $ p_term * A.takeByteString), and
then I can use the length of the left over bytestring.  But due to
backtracking, that changes what p_term will parse.  Since
takeByteString always succeeds, it will cause p_term to backtrack
until it finds some prefix that will match.  The result is that
instead of failing to parse, 1. will lex as (1, .).  Since I
integrate lexing and parsing (as is common with combinator parsers),
and since it seems I can't keep track of byte position with
attoparsec, I think I'm out of luck trying to do this the easy way.  I
think I have to write a separate lexer that tries to have the same
behaviour as the parser, but is all separate code.


I know attoparsec was designed for speed, and error reporting and
source positions are secondary.  Am I simply asking too much of it?  I
originally used parsec, but parsing speed is my main bottleneck, so I
don't want to give ground there.  Is there a clever way to get
attoparsec to do what I want?  Or a ByteString or Text parser out
there which is fast, but can not backtrack, or keep track of input
position?  I've heard some good things about traditional alex+happy...
of course it would mean a complete rewrite but might be interesting.
Has anyone compared the performance of attoparsec vs. alex+happy?

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] attoparsec and backtracking

2013-03-15 Thread wren ng thornton
On 3/15/13 3:29 PM, Evan Laforge wrote:
 However, which error msg shows up depends on the order of the (|)
 alternatives, and in general the global structure of the entire
 parser, because I think it just backtracks and then picks the last
 failing backtrack.  Even after carefully rearranging all the parsers
 it seems impossible to get this particular error to bubble up to the
 top.  The thing is, as soon as I see an unexpected suffix I know I can
 fail entirely right there, with exactly that error msg, but since
 there's no way to turn off backtracking I think there's no way to do
 that.

I had some similar issues recently. The trick is figuring out how to
convince attoparsec to commit to a particular alternative. For example,
consider the grammar: A (B A)* C; where if the B succeeds then we want to
commit to parsing an A (and if it fails then return A's error, not C's).

To simplify things, let's drop the leading A since it's not part of the
problem. And let's try to parse an invalid string like BX (or BABX).
The key point is that,

bad = (pB * pure (:) * pA * bad) | (pC * pure [])

is different than,

good = do
e - eitherP pB pC -- (Left $ pB) | (Right $ pC)
case e of
Left  _ - (:) $ pA * good
Right _ - pure []

In particular, the first one is bad (for our purposes) because due to
hoisting the choice up high, after parsing the B we fail to commit, so
when parsing A fails we'll backtrack over the B and try C instead.
Assuming C doesn't overlap with B, we'll then report C's error. Whereas
the latter is good because due to pushing the choice down, once we've
parsed B (or C) we're committed to that choice; so when A fails, we'll
report A's error (or backtrack to the lowest choice that dominates the
call to good).

Attoparsec does indeed just report the failure generated by the final
parse, so you'll have to refactor things to recognize which sort of token
you're looking for (e.g., p_num vs p_identifier or whatever), and then
commit to that choice before actually parsing the token. It's not very
modular that way, but I think that's the only option right now. It
shouldn't be too hard to design a combinator for doing a hard commit (by
discarding the backtrack continuation); but that has modularity issues of
its own...


Another option, of course, is to drop down to performing lexing on the
ByteString itself (e.g., [1]) and then wrap those individual lexers to
work as attoparsec Parsers. Even if using attoparsec for the heavy
lifting, this is a good approach for maximizing performance of the lexing
step.


[1] http://hackage.haskell.org/package/bytestring-lexing

-- 
Live well,
~wren


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] attoparsec and backtracking

2013-03-15 Thread Erik de Castro Lopo
Evan Laforge wrote:

 The first is that it's hard to get the right error msg out.  For
 instance, I have a parser that tries to parse a number with an
 optional type suffix.  It's an error if the suffix is unrecognized:
 
 p_num :: A.Parser Score.TypedVal
 p_num = do
 num - p_untyped_num
 suffix - A.option  ((:) $ A.letter_ascii)
 case Score.code_to_type suffix of
 Nothing - fail $ p_num expected suffix in [cdsr]:  ++ show suffix
 Just typ - return $ Score.Typed typ num

I think the mistake here is to parse something and then decide if
its it valid. It should be the parser which decides whether its
valid. So rather than:

 suffix - A.option  ((:) $ A.letter_ascii)

try:

 typ - A.choice [ {- list or valid suffix parsers -} ]
 return $ Score.Typed typ num

 However, which error msg shows up depends on the order of the (|)
 alternatives, and in general the global structure of the entire
 parser, because I think it just backtracks and then picks the last
 failing backtrack.

I'm not sure if what I've offered will help, but its worth a try.

 Even after carefully rearranging all the parsers
 it seems impossible to get this particular error to bubble up to the
 top.

Yes, I've found it impossible to force attoparsec to fail a parse.
I think that is intended as a feature.

 The thing is, as soon as I see an unexpected suffix I know I can
 fail entirely right there, with exactly that error msg, but since
 there's no way to turn off backtracking I think there's no way to do
 that.

Yes, that's my impression.

snip

 I
 originally used parsec, but parsing speed is my main bottleneck, so I
 don't want to give ground there.

We you using Parsec as a token parser or as a Char parser. Obviously
the second is going to be slow in comparison to the first.

 I've heard some good things about traditional alex+happy...
 of course it would mean a complete rewrite but might be interesting.

I've user Alex with both Parsec and Happy, where speed was strong
secondary goal. Personally I much prefer Parsec; IMO its easier to debug
and extend and modify that Happy based parsers. I also know some people
prefer Happy.

 Has anyone compared the performance of attoparsec vs. alex+happy?

I haven't, nor have I compared those two with alex+parsec. It would
be an interesting experiment.

Erik
-- 
--
Erik de Castro Lopo
http://www.mega-nerd.com/

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] attoparsec and backtracking

2013-03-15 Thread Ivan Lazar Miljenovic
On 16 March 2013 12:54, Erik de Castro Lopo mle...@mega-nerd.com wrote:
 Evan Laforge wrote:

 The first is that it's hard to get the right error msg out.  For
 instance, I have a parser that tries to parse a number with an
 optional type suffix.  It's an error if the suffix is unrecognized:

 p_num :: A.Parser Score.TypedVal
 p_num = do
 num - p_untyped_num
 suffix - A.option  ((:) $ A.letter_ascii)
 case Score.code_to_type suffix of
 Nothing - fail $ p_num expected suffix in [cdsr]:  ++ show suffix
 Just typ - return $ Score.Typed typ num

 I think the mistake here is to parse something and then decide if
 its it valid. It should be the parser which decides whether its
 valid. So rather than:

  suffix - A.option  ((:) $ A.letter_ascii)

 try:

  typ - A.choice [ {- list or valid suffix parsers -} ]
  return $ Score.Typed typ num

 However, which error msg shows up depends on the order of the (|)
 alternatives, and in general the global structure of the entire
 parser, because I think it just backtracks and then picks the last
 failing backtrack.

 I'm not sure if what I've offered will help, but its worth a try.

 Even after carefully rearranging all the parsers
 it seems impossible to get this particular error to bubble up to the
 top.

 Yes, I've found it impossible to force attoparsec to fail a parse.
 I think that is intended as a feature.

I don't know about a feature, but I tried adding polyparse-style
commit semantics to attoparsec and couldn't do so without making it
rather noticeably slower.


 The thing is, as soon as I see an unexpected suffix I know I can
 fail entirely right there, with exactly that error msg, but since
 there's no way to turn off backtracking I think there's no way to do
 that.

 Yes, that's my impression.

 snip

 I
 originally used parsec, but parsing speed is my main bottleneck, so I
 don't want to give ground there.

 We you using Parsec as a token parser or as a Char parser. Obviously
 the second is going to be slow in comparison to the first.

 I've heard some good things about traditional alex+happy...
 of course it would mean a complete rewrite but might be interesting.

 I've user Alex with both Parsec and Happy, where speed was strong
 secondary goal. Personally I much prefer Parsec; IMO its easier to debug
 and extend and modify that Happy based parsers. I also know some people
 prefer Happy.

 Has anyone compared the performance of attoparsec vs. alex+happy?

 I haven't, nor have I compared those two with alex+parsec. It would
 be an interesting experiment.

 Erik
 --
 --
 Erik de Castro Lopo
 http://www.mega-nerd.com/

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe



-- 
Ivan Lazar Miljenovic
ivan.miljeno...@gmail.com
http://IvanMiljenovic.wordpress.com

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] attoparsec and backtracking

2013-03-15 Thread Niklas Hambüchen
Is it not possible to add an alternative (no pun intended) to | that
supports the semantics Evan wants?

I would agree that what attoparsec does for | of Alternative and mplus
for MonadPlus is correct since e.g. the mplus laws say that a failure
must be identity and therefore the following alternatives must be
considered. I also find it very convenient that attoparsec works this
way, and prefer it to what parsec does by default.

However, I do not see why attoparsec cannot have a function || that on
failure with consumed input does not evaluate the remaining alternatives.

On 16/03/13 01:54, Erik de Castro Lopo wrote:
 Evan Laforge wrote:
 However, which error msg shows up depends on the order of the (|)
 alternatives, and in general the global structure of the entire
 parser, because I think it just backtracks and then picks the last
 failing backtrack.
 
 I'm not sure if what I've offered will help, but its worth a try.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe