Re: [Pharo-project] PetitSmalltalk: now with a self-contained grammar for number literals

Frank Shearar Thu, 26 Apr 2012 08:10:57 -0700

On 26 April 2012 16:01, Levente Uzonyi <[email protected]> wrote:
> On Thu, 26 Apr 2012, Frank Shearar wrote:
>
>> On 26 April 2012 03:35, Levente Uzonyi <[email protected]> wrote:
>>>
>>> On Wed, 25 Apr 2012, Levente Uzonyi wrote:
>>>
>>>> On Wed, 25 Apr 2012, Frank Shearar wrote:
>>>>
>>>>> On 20 April 2012 15:15, Frank Shearar <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>> On 20 April 2012 03:51, Levente Uzonyi <[email protected]> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 19 Apr 2012, Frank Shearar wrote:
>>>>>>>
>>>>>>>> I found a serious bug in parsing numbers with negative exponents. It
>>>>>>>> was completey broken, in fact, parsing 1e-1 as 10, not 1 / 10.
>>>>>>>> Anyway.
>>>>>>>> This version fixes that, and adds a bunch of tests demonstrating
>>>>>>>> that
>>>>>>>> number parsing will return rationals if it can.
>>>>>>>>
>>>>>>>> It's significantly slower than Squeak's SqNumberParser:
>>>>>>>>
>>>>>>>> Time millisecondsToRun: [100000 timesRepeat: [SqNumberParser parse:
>>>>>>>> '1234567890']] => 466
>>>>>>>>
>>>>>>>> Time millisecondsToRun: [100000 timesRepeat:
>>>>>>>> [PPSmalltalkNumberParser
>>>>>>>> parse: '1234567890']] => 32082
>>>>>>>>
>>>>>>>> I've attached a MessageTally spying on the latter: I've not much
>>>>>>>> skill
>>>>>>>> in reading these, but nothing leaps out at me as being obviously
>>>>>>>> awful.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Didn't check the code, just the tally, and I think that
>>>>>>> PPSmalltalkNumberParser(PPSmalltalkNumberGrammar)>>digitsBase: is
>>>>>>> begging
>>>>>>> for optimization. It's probably also the cause of the high amount of
>>>>>>> garbage
>>>>>>> which causes significant amount of time spent with garbage
>>>>>>> collection.
>>>>>>> It's also interesting is that the finalization process does so much
>>>>>>> work,
>>>>>>> there may be something wrong with your image.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks for taking a look, Levente.
>>>>>>
>>>>>> I'd expect digitsBase: to dominate the running costs, given that we're
>>>>>> parsing numbers.
>>>>>>
>>>>>> I do make a large number of throwaway "immutable" values with a
>>>>>> Builder-like pattern... in PPSmalltalkNumberParser >>
>>>>>> #makeNumberFrom:base:. That, I would imagine, could explain the
>>>>>> garbage?
>>>>>>
>>>>>> If I may, what do you look for when reading the MessageTally? How do
>>>>>> you tell, for instance, that there's excessive garbage production?
>>>>>> That the incremental GCs take 7ms? (I'm reading Andreas' comments on
>>>>>> http://wiki.squeak.org/squeak/4210 again.)
>>>>>
>>>>>
>>>>>
>>>>> Levente, you're quite right: #digitsBase: has now been optimised even
>>>>> more, reducing the time taken to run my benchmark
>>>>>
>>>>> MessageTally spyOn: [Time millisecondsToRun: [100000 timesRepeat:
>>>>> [PPSmalltalkNumberParser parse: '1234567890']]]
>>>>>
>>>>> from ~32 seconds to ~16 seconds. (Memoising was the answer:
>>>>> #digitsBase: is effectively a higher-order production and, like OMeta,
>>>>> PPCompositeParser doesn't memoise those. A simple class var dictionary
>>>>> solves that problem.
>>>>
>>>>
>>>>
>>>> Creating a new parser (PPSmalltalkNumberParser) takes quite a while, if
>>>> you extract the parser creation, then you'll get closer to the
>>>> performance
>>>> limit of PetitParser:
>>>>
>>>> | p |
>>>> p := PPSmalltalkNumberParser new.
>>>> [ Time millisecondsToRun: [100000 timesRepeat: [p parse: '1234567890']]
>>>> ]
>>>> timeProfile. "==> 2594"
>>>>
>>>> There are still places where you create parsers on the fly, like
>>>> PPSmalltalkNumberGrammar #>> number. These should be avoided.
>>>
>>>
>>>
>>> I hacked the code a bit and got the following results:
>>>
>>> (1 to: 5) collect: [ :run |
>>>
>>>        | p |
>>>        p := PPSmalltalkNumberParser new.
>>>        Smalltalk garbageCollect.
>>>        [ 1 to: 100000 do: [ :i | p parse: '1234567890' ] ] timeToRun ].
>>>
>>> #(1198 1209 1225 1221 1204)
>>>
>>> It's ~6 times slower than SqNumberParser and I don't really think it can
>>> get
>>> much closer to it. The mcz is available here:
>>> http://leves.web.elte.hu/squeak/PetitSmalltalk-ul.62.mcz
>>
>>
>> Ah excellent! Thanks! I see you eagerly initialise the number parsers
>> where I'd gone the lazy/memoise route. Any performance implications I
>> should be aware of, other than of course that eager initialisation
>
>
> There's only one perdicate parser (instead of two + a choice parser) and a
> String is passed as a collection to the PPPredicateParser, so #includes: can
> use a fast primitive.
>
>
>> means keeping 36 parsers around when maybe three or four of those will
>> see regular use? (decimal, hex, binary, octal) For instance, I guess
>> the running cost of Dictionary >> #at:ifAbsentPut: is greater than
>> Array >> #at: because the latter just uses a primitive.
>
>
> Lazy initialization of the parsers is also possible, but running the tests
> will initialize 35 of them. I wonder if holding the parsers in a class
> variable has any side effects. Anyway, I uploaded a new version which has
> lazy initialization:
> http://leves.web.elte.hu/squeak/PetitSmalltalk-ul.65.mcz


Er, yes. That should have been obvious. I feel a bit silly now! In
which case there's nothing to decide either way.

I'm unhappy using a class-side cache: it means consuming memory
forever when you only need (some of) the 35 parsers during parsing.
But PPCompositeParser works by initialising ALL instvars to
PPDelegateParsers, so it's a bit complicated putting non-parser things
in the instvars. It worked when I memoised the original digit parsers
(so when I moved DigitParsers -> digitParsers) but broke when I
applied the same technique elsewhere. Presumably I was lucky the first
time round, and it only worked because of where in the initialisation
I initialised the dictionary.

If I could browbeat the grammar to use an instvar to hold the digit
parsers it would make a lot of sense to switch to lazily instantiating
the number parsers: you're likely only to instantiate at most four of
them at a time, unless you're being very tricky with your arithmetic.

frank

> Levente
>
>
>>
>> Lastly, I added PetitSmalltalk-fbs.64 to
>> http://ss3.gemstone.com/ss/Scratchpad-fbs/, based on your .62. It just
>> normalises how signs work. Why I didn't write the exponent sign as a
>> string from the beginning (like I do the number's sign), I don't know.
>> At any rate, that normalises NumberMaker's API (it always takes
>> Strings) and keeps the cheaper x = '-' rather than #($- '-') includes:
>> x, and makes the tests all pass :)
>>
>> And now I'm going to add a feature request for SS3, asking for a
>> RESTful interface to the repositories!
>>
>> frank
>>
>>> Levente
>>>
>>>
>>>>
>>>>
>>>> Levente
>>>>
>>>>
>>>>>
>>>>> frank
>>>>>
>>>>>> frank
>>>>>>
>>>>>>> Levente
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> frank
>>>>>>>>
>>>>>>>> On 14 September 2011 20:26, Frank Shearar <[email protected]>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3 September 2011 19:35, Nicolas Cellier
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2011/9/3 Frank Shearar <[email protected]>:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 3 September 2011 18:50, Lukas Renggli <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I think it is a good idea to have the number parser separate,
>>>>>>>>>>>> after
>>>>>>>>>>>> all it might also make sense to use it separately.
>>>>>>>>>>>>
>>>>>>>>>>>> It seems that the new Smalltalk grammar is significantly slower.
>>>>>>>>>>>> The
>>>>>>>>>>>> benchmark PPSmalltalkClassesTests class>>#benchmark: that uses
>>>>>>>>>>>> the
>>>>>>>>>>>> source code of the collection hierarchy and does not especially
>>>>>>>>>>>> target
>>>>>>>>>>>> number literals runs 30% slower.
>>>>>>>>>>>>
>>>>>>>>>>>> Also I see that "Number readFrom: ..." is still used within the
>>>>>>>>>>>> grammar. This seems to be a bit strange, no?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yes: it's a double-parse, which is a bit lame. First, we parse
>>>>>>>>>>> the
>>>>>>>>>>> literal with PPSmalltalkNumberParser, which ensures that the
>>>>>>>>>>> thing
>>>>>>>>>>> given to Number class >> #readFrom: is a well-formed token (so,
>>>>>>>>>>> in
>>>>>>>>>>> particular, Squeak's Number doesn't get to see anything other
>>>>>>>>>>> than
>>>>>>>>>>> a
>>>>>>>>>>> well-formed token).
>>>>>>>>>>>
>>>>>>>>>>> It sounds like you're happy with the basic concept, so maybe I
>>>>>>>>>>> should
>>>>>>>>>>> remove the Number class >> #readFrom: stuff, see if I can't
>>>>>>>>>>> remove
>>>>>>>>>>> the
>>>>>>>>>>> performance issues, and resubmit the patch.
>>>>>>>>>>>
>>>>>>>>>>> frank
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes, a NumberParser is essentially parsing, and this duplication
>>>>>>>>>> sounds
>>>>>>>>>> useless.
>>>>>>>>>> The main feature of interest in NumberParser that I consider a
>>>>>>>>>> requirement and should find its equivalence in a PetitNumberParser
>>>>>>>>>> is:
>>>>>>>>>> - round a decimal representation to nearest Float
>>>>>>>>>> It's simple, just convert a Fraction asFloat in a single final
>>>>>>>>>> step
>>>>>>>>>> to
>>>>>>>>>> avoid cumulating round off errors - see
>>>>>>>>>> #makeFloatFromMantissa:exponent:base:
>>>>>>>>>>
>>>>>>>>>> The second feature of interest in NumberParser is the ability to
>>>>>>>>>> parser LargeInteger efficiently by avoiding (10 * largeValue +
>>>>>>>>>> digitValue) loops, and replacing them with a log(n) cost.
>>>>>>>>>> This would be a simple thing to implement in a functional
>>>>>>>>>> language.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hopefully this won't offend your sensibilities too much :). It
>>>>>>>>> does,
>>>>>>>>> in fact, use 10* loops - I wrote an experimental "front half * rear
>>>>>>>>> half" recursion, which was slower in my benchmarks.
>>>>>>>>>
>>>>>>>>> This version has the grammar and parser doing no string->number
>>>>>>>>> conversion at all. PPSmalltalkNumberMaker supplies a number of
>>>>>>>>> utility
>>>>>>>>> methods designed to stop one from making malformed numbers. It also
>>>>>>>>> supplies a builder interface that the parser uses to construct
>>>>>>>>> numbers.
>>>>>>>>>
>>>>>>>>> frank
>>>>>>>>>
>>>>>>>>>> Nicolas
>>>>>>>>>>
>>>>>>>>>>>> Lukas
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 3 September 2011 17:18, Frank Shearar
>>>>>>>>>>>> <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 3 September 2011 15:56, Lukas Renggli <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 3 September 2011 16:51, Frank Shearar
>>>>>>>>>>>>>> <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Lukas,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I haven't :) mainly because I'm unsure where to put it - is
>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>> perhaps a PP Inbox, or shall I just post the merged version,
>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>> what's
>>>>>>>>>>>>>>> your preference? (How about an mcd between my merge and PP's
>>>>>>>>>>>>>>> head?)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just put the .mcz at some public URL (dropbox, squeak source,
>>>>>>>>>>>>>> ...)
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> attach it to a mail.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ah, great - here it is. You'll see I've written the grammar as
>>>>>>>>>>>>> a
>>>>>>>>>>>>> separate class. That was really more to make what I'd done more
>>>>>>>>>>>>> obvious and to minimise the change to PPSmalltalkGrammar, but
>>>>>>>>>>>>> perhaps
>>>>>>>>>>>>> it's not a bad idea anyway: it's easy to see the number literal
>>>>>>>>>>>>> subgrammar.
>>>>>>>>>>>>>
>>>>>>>>>>>>> frank
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Lukas
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Lukas Renggli
>>>>>>>>>>>>>> www.lukas-renggli.ch
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Lukas Renggli
>>>>>>>>>>>> www.lukas-renggli.ch
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: [Pharo-project] PetitSmalltalk: now with a self-contained grammar for number literals

Reply via email to