Hi Holger, Jan and me are working on a PetitCompiler [1,2], a tool whose aim is to compile PetitParsers to Smalltalk to gain performance and allow for hand-tuning. In short, we try to generate a tokenizer and LL(1) parser on top of it. In general case, given the flexibility of PetitParser, this is rather tricky :-)
The results are - performance wise - not bad, IMO [3]. As long as your grammar is sort of LL(1) it should not backtrack (but Jan would give you more details, this is his domain) Now, the whole thing is work-in-progress and by no means is - as of today - production-ready tool. I'd like to give it a try and benchmark it as it would make nice case-study for us. But - no promises time-wise. I'm bloody busy now and this is a sort of spare time-project for me. Best, Jan [1] https://bitbucket.org/janvrany/stx-goodies-petitparser [2] http://smalltalkhub.com/#!/~JanKurs/PetitParser/ [3] http://bit.ly/1IXnz0c On Sun, 2015-08-16 at 17:58 +0200, Holger Freyther wrote: > Hi, > > once again I am not sure if this is the right list. The first parser > I wrote using > PetitParser was a SIP (and then MGCP) parser. I have recently > ported[1] the > code to Pharo and with Pharo it is very tempting to Use > BlockClosure>>#bench > to get an idea of the speed. > > > I have two performance “issues” and wonder if others hand similar > issues with > PetitParser and if there is a general approach to this. > > > > 1.) Combining two PPCharsetPredicates does not combine the > “classification” > table it had. One could create a PPPredicateObjectParser subclass > that is > special casing >>#/ to build a combined classification table. > > > 2.) When blindly following a BNF enumeration of "A or B or C or D or > E > or CatchAll” and each “A, B” follow common pattern (e.g. token COLON > value) > one pays a high cost in the backtracking and constructing the > PPFailure for > each failed case. > > In my SIPGrammar I have action parsers for To ==>.. From ==> and > would > like to keep that. At the same time I would be happy if the token in > front of the > colon is only consumed once and then delegated to the right parser > and if that > one failed use the ‘catch all’ one. > > I don’t know which abstraction would be needed to allow creating > optimized > PetitParsers for such grammars. > > sorry for the long mail, long details and context is below. > > > kind regards > holger > > > > > > > Full details: > > > 1.) CharSetPredicate > > > aParser bParser combinedParser aTime bTime cTime | > > aParser := #digit asParser. > bParser := #letter asParser. > combinedParser := aParser / bParser. > > aTime := [ aParser parse: 'b'] bench. > bTime := [ bParser parse: 'b'] bench. > cTime := [ combinedParser parse: 'b'] bench. > { aTime. bTime. cTime } > > cTime is bounded by the time execution time of of the slowest > of these parsers + overhead (e.g. PPFailure creation). > > e.g. > > #('559,000 per second.' '1,010,000 per second.' '429,000 per > second.') > > With a proof of concept PPPredicateCharSetParser > > #('1,330,000 per second.' '1,550,000 per second.' '1,580,000 per > second.’) > > The noise is pretty string here but what is important is that bParser > and the > combinedParser are in the same ballpark. > > 2.) Choice Parser > > > > The BNF grammar of the parser is roughly: > > Request = Request-Line > *( message-header ) > CRLF > [ message-body ] > > message-header = (Accept > … > / To > / From > / Via > / extension-header) CRLF > > Alert-Info = "Alert-Info" HCOLON alert-param *(COMMA alert-param) > Accept = "Accept" HCOLON > [ accept-range *(COMMA accept-range) ] > > > So there can be several lines of “message-header”. And each method > header > starts with a token/word, a colon and then the parameter. “extension > -header” > is kind of a catch all if no other rule matched. E.g. if a client > sends a To which is > wrongly encoded it would end up with the extension-header. > > I transferred the above naively to PetitParser and end up with > something like > parsing ~500 messages a second. The main cost appears to come from > the > choice parser that needs to create a PetitFailure all the time. E.g. > if you have a > line like this: > > ‘From: “Holger Freyther” <sip:[email protected]>’ > > The choice parser will start with the “Accept” rule, parse the token > (“From” and > then create a PPFailure, then … rules, then “To”, parse the token.. > So we have > parsing the same token more than once and creating PPFailures all the > time. I > ended up creating a custom parser that will peek the token, have a > horrible chain > of token = ‘XYZ’ ifTrue and then dispatch to the other rule. > > It would be nice if PetitParser could be taught to only parse the > token once and > then delegate to the param rule. E.g. a PPAnyOfParser that allows to > specify the > token to match, the parser to continue with and a fallback parser? > > > > [1] > http://smalltalkhub.com/#!/~osmocom/SIP > http://smalltalkhub.com/#!/~osmocom/MGCP >
