Hi Holger, Thank you for bringing this.
In the first case, I suggest to create a specialised parser, that fills in the classification as you desire. In the second case, the left refactoring of the common prefix is the way to go. As you are speaking about general solution, there is a PetitParser rewriter, that can do what you mentioned. Unfortunately I cannot provide more details as I never really got into this. The other approach, as the first Jan already mentioned, is a compiler that takes PetitParser as an input and produces a class representing a parser that recognizes equivalent language and that should resemble hand-written top-down parser. During the compilation, we do many optimizations and I believe we can get decent performance. As the compiler is not production ready and as we are looking for a use cases, I can import your grammar and let you know how to make it working with our tool and what would be the performance gain. In that case, I would need some test cases and benchmark inputs from you. Cheers Jan On Mon, 17 Aug 2015 08:30 Jan Vrany <[email protected]> wrote: > Hi Holger, > > Jan and me are working on a PetitCompiler [1,2], a tool whose aim > is to compile PetitParsers to Smalltalk to gain performance and > allow for hand-tuning. In short, we try to generate a tokenizer > and LL(1) parser on top of it. In general case, given the flexibility > of PetitParser, this is rather tricky :-) > > The results are - performance wise - not bad, IMO [3]. > As long as your grammar is sort of LL(1) it should not backtrack > (but Jan would give you more details, this is his domain) > > Now, the whole thing is work-in-progress and by no means > is - as of today - production-ready tool. I'd like to give it > a try and benchmark it as it would make nice case-study for us. > But - no promises time-wise. I'm bloody busy now and this is > a sort of spare time-project for me. > > Best, Jan > > [1] https://bitbucket.org/janvrany/stx-goodies-petitparser > [2] http://smalltalkhub.com/#!/~JanKurs/PetitParser/ > [3] http://bit.ly/1IXnz0c > > > On Sun, 2015-08-16 at 17:58 +0200, Holger Freyther wrote: > > Hi, > > > > once again I am not sure if this is the right list. The first parser > > I wrote using > > PetitParser was a SIP (and then MGCP) parser. I have recently > > ported[1] the > > code to Pharo and with Pharo it is very tempting to Use > > BlockClosure>>#bench > > to get an idea of the speed. > > > > > > I have two performance “issues” and wonder if others hand similar > > issues with > > PetitParser and if there is a general approach to this. > > > > > > > > 1.) Combining two PPCharsetPredicates does not combine the > > “classification” > > table it had. One could create a PPPredicateObjectParser subclass > > that is > > special casing >>#/ to build a combined classification table. > > > > > > 2.) When blindly following a BNF enumeration of "A or B or C or D or > > E > > or CatchAll” and each “A, B” follow common pattern (e.g. token COLON > > value) > > one pays a high cost in the backtracking and constructing the > > PPFailure for > > each failed case. > > > > In my SIPGrammar I have action parsers for To ==>.. From ==> and > > would > > like to keep that. At the same time I would be happy if the token in > > front of the > > colon is only consumed once and then delegated to the right parser > > and if that > > one failed use the ‘catch all’ one. > > > > I don’t know which abstraction would be needed to allow creating > > optimized > > PetitParsers for such grammars. > > > > sorry for the long mail, long details and context is below. > > > > > > kind regards > > holger > > > > > > > > > > > > > > Full details: > > > > > > 1.) CharSetPredicate > > > > > aParser bParser combinedParser aTime bTime cTime | > > > > aParser := #digit asParser. > > bParser := #letter asParser. > > combinedParser := aParser / bParser. > > > > aTime := [ aParser parse: 'b'] bench. > > bTime := [ bParser parse: 'b'] bench. > > cTime := [ combinedParser parse: 'b'] bench. > > { aTime. bTime. cTime } > > > > cTime is bounded by the time execution time of of the slowest > > of these parsers + overhead (e.g. PPFailure creation). > > > > e.g. > > > > #('559,000 per second.' '1,010,000 per second.' '429,000 per > > second.') > > > > With a proof of concept PPPredicateCharSetParser > > > > #('1,330,000 per second.' '1,550,000 per second.' '1,580,000 per > > second.’) > > > > The noise is pretty string here but what is important is that bParser > > and the > > combinedParser are in the same ballpark. > > > > 2.) Choice Parser > > > > > > > > The BNF grammar of the parser is roughly: > > > > Request = Request-Line > > *( message-header ) > > CRLF > > [ message-body ] > > > > message-header = (Accept > > … > > / To > > / From > > / Via > > / extension-header) CRLF > > > > Alert-Info = "Alert-Info" HCOLON alert-param *(COMMA alert-param) > > Accept = "Accept" HCOLON > > [ accept-range *(COMMA accept-range) ] > > > > > > So there can be several lines of “message-header”. And each method > > header > > starts with a token/word, a colon and then the parameter. “extension > > -header” > > is kind of a catch all if no other rule matched. E.g. if a client > > sends a To which is > > wrongly encoded it would end up with the extension-header. > > > > I transferred the above naively to PetitParser and end up with > > something like > > parsing ~500 messages a second. The main cost appears to come from > > the > > choice parser that needs to create a PetitFailure all the time. E.g. > > if you have a > > line like this: > > > > ‘From: “Holger Freyther” <sip:[email protected]>’ > > > > The choice parser will start with the “Accept” rule, parse the token > > (“From” and > > then create a PPFailure, then … rules, then “To”, parse the token.. > > So we have > > parsing the same token more than once and creating PPFailures all the > > time. I > > ended up creating a custom parser that will peek the token, have a > > horrible chain > > of token = ‘XYZ’ ifTrue and then dispatch to the other rule. > > > > It would be nice if PetitParser could be taught to only parse the > > token once and > > then delegate to the param rule. E.g. a PPAnyOfParser that allows to > > specify the > > token to match, the parser to continue with and a fallback parser? > > > > > > > > [1] > > http://smalltalkhub.com/#!/~osmocom/SIP > > http://smalltalkhub.com/#!/~osmocom/MGCP > > >
