Re: [Pharo-dev] PetitParser speed and cost of PPFailure

Jan Vrany Sun, 16 Aug 2015 23:31:51 -0700

Hi Holger, 

Jan and me are working on a PetitCompiler [1,2], a tool whose aim
is to compile PetitParsers to Smalltalk to gain performance and 
allow for hand-tuning. In short, we try to generate a tokenizer
and LL(1) parser on top of it. In general case, given the flexibility
of PetitParser, this is rather tricky :-)


The results are - performance wise - not bad, IMO [3].
As long as your grammar is sort of LL(1) it should not backtrack
(but Jan would give you more details, this is his domain)

Now, the whole thing is work-in-progress and by no means
is - as of today - production-ready tool. I'd like to give it
a try and benchmark it as it would make nice case-study for us.
But - no promises time-wise. I'm bloody busy now and this is 
a sort of spare time-project for me.

Best, Jan 

[1] https://bitbucket.org/janvrany/stx-goodies-petitparser
[2] http://smalltalkhub.com/#!/~JanKurs/PetitParser/
[3] http://bit.ly/1IXnz0c 


On Sun, 2015-08-16 at 17:58 +0200, Holger Freyther wrote:
> Hi,
> 
> once again I am not sure if this is the right list. The first parser 
> I wrote using
> PetitParser was a SIP (and then MGCP) parser. I have recently 
> ported[1] the
> code to Pharo and with Pharo it is very tempting to Use 
> BlockClosure>>#bench
> to get an idea of the speed.
> 
> 
> I have two performance “issues” and wonder if others hand similar 
> issues with
> PetitParser and if there is a general approach to this.
> 
> 
> 
> 1.) Combining two PPCharsetPredicates does not combine the 
> “classification”
> table it had. One could create a PPPredicateObjectParser subclass 
> that is
> special casing >>#/ to build a combined classification table.
> 
> 
> 2.) When blindly following a BNF enumeration of "A or B or C or D or 
> E
> or CatchAll” and each “A, B” follow common pattern (e.g. token COLON 
> value)
> one pays a high cost in the backtracking and constructing the 
> PPFailure for
> each failed case.
> 
> In my SIPGrammar I have action parsers for To ==>.. From ==>  and 
> would
> like to keep that. At the same time I would be happy if the token in 
> front of the
> colon is only consumed once and then delegated to the right parser 
> and if that
> one failed use the ‘catch all’ one.
> 
> I don’t know which abstraction would be needed to allow creating 
> optimized
> PetitParsers for such grammars.
> 
> sorry for the long mail, long details and context is below.
> 
> 
> kind regards
>       holger
> 
> 
> 
> 
> 
> 
> Full details:
> 
> 
> 1.) CharSetPredicate
> 
> > aParser bParser combinedParser aTime bTime cTime |
> 
> aParser := #digit asParser.
> bParser := #letter asParser.
> combinedParser := aParser / bParser.
> 
> aTime := [  aParser parse: 'b'] bench.
> bTime := [ bParser parse: 'b'] bench.
> cTime := [ combinedParser parse: 'b'] bench.
> { aTime. bTime. cTime }
> 
> cTime is bounded by the time execution time of of the slowest
> of these parsers + overhead (e.g. PPFailure creation).
> 
> e.g.
> 
> #('559,000 per second.' '1,010,000 per second.' '429,000 per 
> second.')
> 
> With a proof of concept PPPredicateCharSetParser
> 
> #('1,330,000 per second.' '1,550,000 per second.' '1,580,000 per 
> second.’)
> 
> The noise is pretty string here but what is important is that bParser 
> and the
> combinedParser are in the same ballpark.
> 
> 2.) Choice Parser
> 
> 
> 
> The BNF grammar of the parser is roughly:
> 
> Request        =  Request-Line
>                   *( message-header )
>                   CRLF
>                   [ message-body ]
> 
> message-header  =  (Accept
>               …
>               / To 
>               / From
>               / Via
>               /  extension-header) CRLF
> 
> Alert-Info   =  "Alert-Info" HCOLON alert-param *(COMMA alert-param)
> Accept         =  "Accept" HCOLON
>                    [ accept-range *(COMMA accept-range) ]
> 
> 
> So there can be several lines of “message-header”. And each method 
> header
> starts with a token/word, a colon and then the parameter. “extension
> -header”
> is kind of a catch all if no other rule matched. E.g. if a client 
> sends a To which is
> wrongly encoded it would end up with the extension-header.
> 
> I transferred the above naively to PetitParser and end up with 
> something like
> parsing ~500 messages a second. The main cost appears to come from 
> the
> choice parser that needs to create a PetitFailure all the time. E.g. 
> if you have a
> line like this:
> 
>       ‘From: “Holger Freyther” <sip:[email protected]>’
> 
> The choice parser will start with the “Accept” rule, parse the token 
> (“From” and
> then create a PPFailure, then … rules, then “To”, parse the token.. 
> So we have
> parsing the same token more than once and creating PPFailures all the 
> time. I
> ended up creating a custom parser that will peek the token, have a 
> horrible chain
> of token = ‘XYZ’ ifTrue and then dispatch to the other rule.
> 
> It would be nice if PetitParser could be taught to only parse the 
> token once and
> then delegate to the param rule. E.g. a PPAnyOfParser that allows to 
> specify the
> token to match, the parser to continue with and a fallback parser?
> 
> 
> 
> [1] 
> http://smalltalkhub.com/#!/~osmocom/SIP
> http://smalltalkhub.com/#!/~osmocom/MGCP
>

Re: [Pharo-dev] PetitParser speed and cost of PPFailure

Reply via email to