communication around new object format

Stefan Marr Wed, 30 May 2012 13:58:16 -0700

Hi Eliot:

From my experience with the RoarVM, it seems to be a rather simple exercise to 
enable the VM to support a custom 'pre-header' for objects.
That is, a constant offset in the memory that comes from the allocator, and is 
normally ignored by the GC.


That allows me to do all kind of stuff. Of course at a cost of a word per 
object, and at the cost of recompiling the VM.
But, that should be a reasonable price to pay for someone doing research on 
these kind of things.

Sometimes a few bits are just not enough, and such a pre-header gives much much 
more flexibility.
For the people interested in that, I could dig out the details (I think, I did 
that already once on this list).

Best regards
Stefan


On 30 May 2012, at 22:22, Eliot Miranda wrote:

> 
> 
> On Wed, May 30, 2012 at 12:59 PM, Stéphane Ducasse 
> <[email protected]> wrote:
> I would like to be sure that we can have
>        - bit for immutable objects
>        - bits for experimenting.
> 
> There will be quite a few.  And one will be able to steal bits from the class 
> field if one needs fewer classes.  I'm not absolutely sure of the layout yet. 
>  But for example
> 
> 8: slot size (255 => extra header word with large size)
> 3: odd bytes/fixed fields (odd bytes for non-pointer, fixed fields for 
> pointer, 7 => # fixed fields is in the class)
> 4 bits: format (pointers, indexable, bytes/shorts/longs/doubles indexable, 
> compiled method, ephemerons, weak, etc)
> 1: immutability
> 3: GC 2 mark bits. 1 forwarded bit
> 20: identity hash
> 20: class index
> 
> still leaves 5 bits unused.  And stealing 4 bits each from class index still 
> leaves 64k classes.  So this format is simple and provides lots of unused 
> bits.  The format field is a great idea as it combines a number of orthogonal 
> properties in very few bits.  I don't want to include odd bytes in format 
> because I think a separate field that holds odd bytes and fixed fields is 
> better use of space.  But we can gather statistics before deciding.
> 
> 
> Stef
> 
> On May 30, 2012, at 8:48 AM, Stéphane Ducasse wrote:
> 
> > Hi guys
> >
> > Here is an important topic I would like to see discussed so that we see how 
> > we can improve and join forces.
> > May a mail discussion then a wiki for the summary would be good.
> >
> >
> > stef
> >
> >
> >
> > Begin forwarded message:
> >
> >> From: Eliot Miranda <[email protected]>
> >> Subject: Re: Plan/discussion/communication around new object format
> >> Date: May 27, 2012 10:49:54 PM GMT+02:00
> >> To: Stéphane Ducasse <[email protected]>
> >>
> >>
> >>
> >> On Sat, May 26, 2012 at 1:46 AM, Stéphane Ducasse 
> >> <[email protected]> wrote:
> >> Hi eliot
> >>
> >> do you have a description of the new object format you want to introduce?
> >>
> >> The design is in the class comment of CogMemoryManager in the Cog VMMaker 
> >> packages.
> >>
> >> Then what is your schedule?
> >>
> >> This is difficult. I have made a small start and should be able to spend 
> >> time on it starting soon.  I want to have it finished by early next year.  
> >> But it depends on work schedules etc.
> >>
> >>
> >> I would like to see if we can allocate igor/esteban time before we run out 
> >> of money
> >> to help on that important topic.
> >> Now the solution is unclear and I did not see any document where we can 
> >> evaluate
> >> and plan how we can help. So do you want help on that topic? Then how can 
> >> people
> >> contribute if any?
> >>
> >> The first thing to do is to read the design document, to see if the Pharo 
> >> community thinks it is the right direction, and to review it, spot 
> >> deficiencies etc.  So please get those interested to read the class 
> >> comment of CogMemoryManager in the latest VMMaker.oscog.
> >>
> >> Here's the current version of it:
> >>
> >> CogMemoryManager is currently a place-holder for the design of the new Cog 
> >> VM's object representation and garbage collector.  The goals for the GC are
> >>
> >> - efficient object representation a la Eliot Miranda's VisualWorks 64-bit 
> >> object representation that uses a 64-bit header, eliminating direct class 
> >> references so that all objects refer to their classes indirectly.  Instead 
> >> the header contains a constant class index, in a field smaller than a full 
> >> pointer, These class indices are used in inline and first-level method 
> >> caches, hence they do not have to be updated on GC (although they do have 
> >> to be traced to be able to GC classes).  Classes are held in a sparse weak 
> >> table.  The class table needs only to be indexed by an instance's class 
> >> index in class hierarchy search, in the class primitive, and in tracing 
> >> live objects in the heap.  The additional header space is allocated to a 
> >> much expanded identity hash field, reducing hash efficiency problems in 
> >> identity collections due to the extremely small (11 bit) hash field in the 
> >> old Squeak GC.  The identity hash field is also a key element of the class 
> >> index scheme.  A class's identity hash is its index into the class table, 
> >> so to create an instance of a class one merely copies its identity hash 
> >> into the class index field of the new instance.  This implies that when 
> >> classes gain their identity hash they are entered into the class table and 
> >> their identity hash is that of a previously unused index in the table.  It 
> >> also implies that there is a maximum number of classes in the table.  At 
> >> least for a few years 64k classes should be enough.  A class is entered 
> >> into the class table in the following operations:
> >>      behaviorHash
> >>      adoptInstance
> >>      instantiate
> >>      become  (i.e. if an old class becomes a new class)
> >>              if target class field's = to original's id hash
> >>                 and replacement's id hash is zero
> >>                      enter replacement in class table
> >> behaviorHash is a special version of identityHash that must be implemented 
> >> in the image by any object that can function as a class (i.e. Behavior).
> >>
> >> - more immediate classes.  An immediate Character class would speed up 
> >> String accessing, especially for WideString, since no instatiation needs 
> >> to be done on at:put: and no dereference need be done on at:.  In a 32-bit 
> >> system tag checking is complex since it is thought important to retain 
> >> 31-bit SmallIntegers.  Hence, as in current Squeak, the least significant 
> >> bit set implies a SmallInteger, but Characters would likely have a tag 
> >> pattern of xxx10.  Hence masking with 11 results in two values for 
> >> SmallInteger, xxx01 and xxx11.  30-bit characters are more than adequate 
> >> for Unicode.  In a 64-bit system we can use the full three bits and 
> >> usefully implement an immediate Float.  As in VisualWorks a functional 
> >> representation takes three bits away from the exponent.  Rotating to put 
> >> the sign bit in the least significant non-tag bit makes expanding and 
> >> contracting the 8-bit exponent to the 11-bit IEEE double exponent easy ad 
> >> makes comparing negative and positive zero easier (an immediate Float is 
> >> zero if its unsigned 64-bits are < 16).  So the representation looks like
> >>      | 8 bit exponent | 52 bit mantissa | sign bit | 3 tag bits |
> >> For details see "60-bit immediate Floats" below.
> >>
> >>
> >> - efficient scavenging.  The current Squeak GC uses a slow 
> >> pointer-reversal collector that writes every field in live objects three 
> >> times in each collection, twice in the pointer-reversing heap traversal to 
> >> mark live objects and once to update the pointer to its new location.  A 
> >> scavenger writes every field of live data twice in each collection, once 
> >> as it does a block copy of the object when copying to to space, once as it 
> >> traverses the live pointers in the to space objects.  Of course the block 
> >> copy is a relatively cheap write.
> >>
> >> - lazy become.  The JIT's use of inline cacheing provides a cheap way of 
> >> avoiding scanning the heap as part of a become (which is the simple 
> >> approach to implementing become in a system with direct pointers).  A 
> >> becomeForward: on a (set of) non-zero-sized object(s) turns the object 
> >> into a "corpse" or "forwarding object" whose first (non-header) word/slot 
> >> is replaced by a pointer to the target of the becomeForward:.  The 
> >> corpse's class index is set to one that identifies corpses and, because it 
> >> is a hidden class index, will always fail an inline cache test.  The 
> >> inline cache failure code is then responsible for following the forwarding 
> >> pointer chain (these are Iliffe vectors :) ) and resolving to the actual 
> >> target.  We have yet to determine exactly how this is done (e.g. change 
> >> the receiver register and/or stack contents and retry the send, perhaps 
> >> scanning the current activation).  See below on how we deal with becomes 
> >> on objects with named inst vars.  Note that we probably don't have to 
> >> worry about zero-sized objects.  These are unlikely to be passed through 
> >> the FFI (there is nothing to pass :) ) and so will rarely be becommed.  If 
> >> they do, they can become slowly.  Alternatively we can insist that objects 
> >> are at least 16 bytes in size (see a8-byte alignment below) so that there 
> >> will always be space for a forwarding pointer.  Since none of the 
> >> immediate classes can have non-immediate instances and since we allocate 
> >> the immediate classes indices corresponding to their tag pattern 
> >> (SmallInteger = 1, Character = 3, SmallFloat = 4?) we can use all the 
> >> class indices from 0 to 7 for special uses, 0 = forward, and e.g. 1 = 
> >> header-sized filler.
> >>
> >> - pinning.  To support a robust and easy-to-use FFI the memory manager 
> >> must support temporary pinning where individual objects can be prevented 
> >> from being moved by the GC for as long as required, either by being one of 
> >> an in-progress FFI call's arguments, or by having pinning asserted by a 
> >> primitive (allowing objects to be passed to external code that retains a 
> >> reference to the object after returning).  Pinning probably implies a 
> >> per-object "is-pinned" bit in the object header.  Pinning will be done via 
> >> lazy become; i..e an object in new space will be becommed into a pinned 
> >> object in old space.  We will only support pinning in old space.
> >>
> >> - efficient old space collection.  An incremental collector (a la 
> >> Dijkstra's three colour algorithm) collects old space, e.g. via an amount 
> >> of tracing being hung off scavenges and/or old space allocations at an 
> >> adaptive rate that keeps full garbage collections to a minimum.  (see free 
> >> space/free list below)
> >>
> >> - 8-byte alignment.  It is advantageous for the FFI, for floating-point 
> >> access, for object movement and for 32/64-bit compatibility to keep object 
> >> sizes in units of 8 bytes.  For the FFI, 8-byte alignment means passing 
> >> objects to code that expects that requirement (such as modern x86 numeric 
> >> processing instructions).  This implies that
> >>      - the starts of all spaces are aligned on 8-byte boundaries
> >>      - object allocation rounds up the requested size to a multiple of 8 
> >> bytes
> >>      - the overflow size field is also 8 bytes
> >> We shall probably keep the minimum object size at 16 bytes so that there 
> >> is always room for a forwarding pointer.  But this implies that we will 
> >> need to implement an 8-byte filler to fill holes between objects > 16 
> >> bytes whose length mod 16 bytes is 8 bytes and following pinned objects.  
> >> We can do this using a special class index, e.g. 1, so that the method 
> >> that answers the size of an object looks like, e.g.
> >>      chunkSizeOf: oop
> >>              <var: #oop type: #'object *'>
> >>              ^object classIndex = 1
> >>                      ifTrue: [BaseHeaderSize]
> >>                      ifFalse: [BaseHeaderSize
> >>                                + (object slotSize = OverflowSlotSize
> >>                                              ifTrue: [OverflowSizeBytes]
> >>                                              ifFalse: [0])
> >>                                + (object slotSize * BytesPerSlot)]
> >>
> >>      chunkStartOf: oop
> >>              <var: #oop type: #'object *'>
> >>              ^(self cCoerceSimple: oop to: #'char *')
> >>                      - ((object classIndex = 1
> >>                          or: [object slotSize ~= OverflowSlotSize])
> >>                                      ifTrue: [0]
> >>                                      ifFalse: [OverflowSizeBytes])
> >>
> >> For the moment we do not tackle the issue of heap growth and shrinkage 
> >> with the ability to allocate and deallocate heap segments via 
> >> memory-mapping.  This technique allows space to be released back to the OS 
> >> by unmapping empty segments.  We may revisit this but it is not a key 
> >> requirement for the first implementation.
> >>
> >> The basic approach is to use a fixed size new space and a growable old 
> >> space.  The new space is a classic three-space nursery a la Ungar's 
> >> Generation Scavenging, a large eden for new objects and two smaller 
> >> survivor spaces that exchange roles on each collection, one being the to 
> >> space to which surviving objects are copied, the other being the from 
> >> space of the survivors of the previous collection, i.e. the previous to 
> >> space.
> >>
> >> To provide apparent pinning in new space we rely on lazy become.  Since 
> >> most pinned objects will be byte data and these do not require stack zone 
> >> activation scanning, the overhead is simply an old space allocation and 
> >> corpsing.
> >>
> >> To provide pinning in old space, large objects are implicitly pinned 
> >> (because it is expensive to move large objects and, because they are both 
> >> large and relatively rare, they contribute little to overall fragmentation 
> >> - as in aggregates, small objects can be used to fill-in the spaces 
> >> between karge objects).  Hence, objects above a particular size are 
> >> automatically allocated in old space, rather than new space.  Small 
> >> objects are pinned as per objects in new space, by asserting the pin bit, 
> >> which will be set automaticaly when allocating a large object.  As a last 
> >> resort, or by programmer control (the fullGC primitive) old space is 
> >> collected via mark-sweep (mark-compact) and so the mark phase must build 
> >> the list of pinned objects around which the sweep/compact phase must 
> >> carefully step.
> >>
> >> Free space in old space is organized by a free list/free tree as in 
> >> Eliot's VisualWorks 5i old space allocator.  There are 64 free lists, 
> >> indices 1 through 63 holding blocks of space of that size, index 0 holding 
> >> a semi-balanced ordered tree of free blocks, each node being the head of 
> >> the list of free blocks of that size.  At the start of the mark phase the 
> >> free list is thrown away and the sweep phase coallesces free space and 
> >> steps over pinned objects as it proceeds.  We can reuse the forwarding 
> >> pointer compaction scheme used in the old collector.  Incremental 
> >> collections merely move unmarked objects to the free lists (as well as 
> >> nilling weak pointers in weak arrays and scheduling them for 
> >> finalization).  The occupancy of the free lists is represented by a bitmap 
> >> in a 64-bit integer so that an allocation of size 63 or less can know 
> >> whether there exists a free chunk of that size, but more importantly can 
> >> know whether a free chunk larger than it exists in the fixed size free 
> >> lists without having to search all larger free list heads.
> >>
> >> The incremental collector (a la Dijkstra's three colour algorithm) 
> >> collects old space via an amount of tracing being hung off scavenges 
> >> and/or old space allocations at an adaptive rate that keeps full garbage 
> >> collections to a minimum.  [N.B. Not sure how to do this yet.  The 
> >> incremental collector needs to complete a pass often enough to reclaim 
> >> objects, but infrequent enough not to waste time.  So some form of 
> >> feedback should work.  In VisualWorks tracing is broken into quanta or 
> >> work where image-level code determines the size of a quantum based on how 
> >> fast the machine is, and how big the heap is.  This code could easily live 
> >> in the VM, controllable through vmParameterAt:put:.  An alternative would 
> >> be to use the heartbeat to bound quanta by time.  But in any case some 
> >> amount of incremental collection would be done on old space allocation and 
> >> scavenging, the ammount being chosen to keep pause times acceptably short, 
> >> and at a rate to reclaim old space before a full GC is required, i.e. at a 
> >> rate proportional to the growth in old space]. The incemental collector is 
> >> a state machine, being either marking, nilling weak pointers, or freeing.  
> >> If nilling weak pointers is not done atomically then there must be a read 
> >> barrier in weak array at: so that reading from an old space weak array 
> >> that is holding stale un-nilled references to unmarked objects.  Tricks 
> >> such as including the weak bit in bounds calculations can make this cheap 
> >> for non-weak arrays.  Alternatively nilling weak pointers can be made an 
> >> atomic part of incremental collection, which can be made cheaper by 
> >> maintaining the set of weak arrays (e.g. on a list).
> >>
> >> The incremental collector implies a more complex write barrier.  Objects 
> >> are of three colours, black, having been scanned, grey, being scanned, and 
> >> white, unreached.  A mark stack holds the grey objects.   If the 
> >> incremental collector is marking and an unmarked white object is stored 
> >> into a black object then the stored object must become grey, being added 
> >> to the mark stack.  So the wrte barrier is essentially
> >>      target isYoung ifFalse:
> >>              [newValue isYoung
> >>                      ifTrue: [target isInRememberedSet ifFalse:
> >>                                      [target addToRememberedSet]] "target 
> >> now refers to a young object; it is a root for scavenges"
> >>                      ifFalse:
> >>                              [(target isBlack
> >>                                and: [igc marking
> >>                                and: [newValue isWhite]]) ifTrue:
> >>                                      [newValue beGrey]]] "add newValue to 
> >> IGC's markStack for subsequent scanning"
> >>
> >> The incremental collector does not detect already marked objects all of 
> >> whose references have been overwritten by other stores (e.g. in the above 
> >> if newValue overwrites the sole remaining reference to a marked object).  
> >> So the incremental collector only guarantees to collect all garbage 
> >> created in cycle N at the end of cycle N + 1.  The cost is hence slightly 
> >> worse memory density but the benefit, provided the IGC works hard enough, 
> >> is the elimination of long pauses due to full garbage collections, which 
> >> become actions of last resort or programmer desire.
> >>
> >> Lazy become.
> >>
> >> As described earlier the basic idea behind lazy become is to use corpses 
> >> (forwarding objects) that are followed lazily during GC and inline cache 
> >> miss.  However, a lazy scheme cannot be used on objects with named inst 
> >> vars without adding checking to all inst var accesses, which we judge too 
> >> expensive.  Instead, when becomming objects with named inst vars, we scan 
> >> all activations in the stack zone, eagerly becomming these references, and 
> >> we check for corpses when faulting in a context into the stack zone.  
> >> Essentially, the invariant is that there are no references to corpses from 
> >> the receiver slots of stack activations.  A detail is whether we allow or 
> >> forbid pinning of closure indirection vectors, or scan the entire stack of 
> >> each activation.  Using a special class index pun for indirection vectors 
> >> is a cheap way of preventing their becomming/pinning etc.  Although "don't 
> >> do that" (don't attempt to pin/become indirection vectors) is also an 
> >> acceptable response.
> >>
> >> 60-bit immediate Floats
> >> Representation for immediate doubles, only used in the 64-bit 
> >> implementation. Immediate doubles have the same 52 bit mantissa as IEEE 
> >> double-precision  floating-point, but only have 8 bits of exponent.  So 
> >> they occupy just less than the middle 1/8th of the double range.  They 
> >> overlap the normal single-precision floats which also have 8 bit 
> >> exponents, but exclude the single-precision denormals (exponent-127) and 
> >> the single-precsion NaNs (exponent +127).  +/- zero is just a pair of 
> >> values with both exponent and mantissa 0.
> >> So the non-zero immediate doubles range from
> >>         +/- 0x3800,0000,0000,0001 / 5.8774717541114d-39
> >> to      +/- 0x47ff,ffff,ffff,ffff / 6.8056473384188d+38
> >> The encoded tagged form has the sign bit moved to the least significant 
> >> bit, which allows for faster encode/decode because offsetting the exponent 
> >> can't overflow into the sign bit and because testing for +/- 0 is an 
> >> unsigned compare for <= 0xf:
> >>     msb                                                                    
> >>                     lsb
> >>     [8 exponent subset bits][52 mantissa bits ][1 sign bit][3 tag bits]
> >> So assuming the tag is 5, the tagged non-zero bit patterns are
> >>              0x0000,0000,0000,001[d/5]
> >> to           0xffff,ffff,ffff,fff[d/5]
> >> and +/- 0d is 0x0000,0000,0000,000[d/5]
> >> Encode/decode of non-zero values in machine code looks like:
> >>                                              msb                           
> >>                    lsb
> >> Decode:                              [8expsubset][52mantissa][1s][3tags]
> >> shift away tags:                     [ 000 ][8expsubset][52mantissa][1s]
> >> add exponent offset: [     11 exponent     ][52mantissa][1s]
> >> rot sign:                            [1s][     11 exponent     
> >> ][52mantissa]
> >>
> >> Encode:                                      [1s][     11 exponent     
> >> ][52mantissa]
> >> rot sign:                            [     11 exponent     
> >> ][52mantissa][1s]
> >> sub exponent offset: [ 000 ][8expsubset][52 mantissa][1s]
> >> shift:                                       [8expsubset][52 
> >> mantissa][1s][ 000 ]
> >> or/add tags:                 [8expsubset][52mantissa][1s][3tags]
> >> but is slower in C because
> >> a) there is no rotate, and
> >> b) raw conversion between double and quadword must (at least in the 
> >> source) move bits through memory ( quadword = *(q64 *)&doubleVariable).
> >>
> >> Issues:
> >> How do we avoid the Size4Bit for 64-bits?  The format word encodes the 
> >> number of odd bytes, but currently has only 4 bits and hence only supports 
> >> odd bytes of 0 - 3.  We need odd bytes of 0 - 7.  But I don't like the 
> >> separate Size4Bit.  Best to change the VI code and have a 5 bit format?  
> >> We loose one bit but save two bits (isEphemeron and isWeak (or three, if 
> >> isPointers)) for a net gain of one (or two)
> >>
> >> Further, keep Squeak's format idea or go for separate bits?  For 64-bits 
> >> we need a 5 bit format field.  This contrasts with isPointers, isWeak, 
> >> isEphemeron, 3 odd size bits (or byte size)..  format field is quite 
> >> economical.
> >>
> >> Are class indices in inline caches strong references to classes or weak 
> >> references?
> >> If strong then they must be scanned during GC and the methodZone must be 
> >> flushed on fullGC to reclaim all classes (this looks to be a bug in the V3 
> >> Cogit).
> >> If weak then when the class table loses references, PICs containing freed 
> >> classes must be freed and then sends to freed PICs or containing freed 
> >> clases must be unlinked.
> >> The second approach is faster; the common case is scanning the class 
> >> table, the uncommon case is freeing classes.  The second approach is 
> >> better; in-line caches do not prevent reclamation of classes.
> >>
> >>
> >> Stef
> >>
> >>
> >>
> >> --
> >> best,
> >> Eliot
> >>
> >
> 
> 
> 
> 
> 
> -- 
> best,
> Eliot
> 

-- 
Stefan Marr
Software Languages Lab
Vrije Universiteit Brussel
Pleinlaan 2 / B-1050 Brussels / Belgium
http://soft.vub.ac.be/~smarr
Phone: +32 2 629 2974
Fax:   +32 2 629 3525

Re: [Pharo-project] [Vm-dev] Re: Plan/discussion/communication around new object format

Reply via email to