Hi guys

Here is an important topic I would like to see discussed so that we see how we 
can improve and join forces.
May a mail discussion then a wiki for the summary would be good.


stef



Begin forwarded message:

> From: Eliot Miranda <[email protected]>
> Subject: Re: Plan/discussion/communication around new object format
> Date: May 27, 2012 10:49:54 PM GMT+02:00
> To: Stéphane Ducasse <[email protected]>
> 
> 
> 
> On Sat, May 26, 2012 at 1:46 AM, Stéphane Ducasse <[email protected]> 
> wrote:
> Hi eliot
> 
> do you have a description of the new object format you want to introduce?
> 
> The design is in the class comment of CogMemoryManager in the Cog VMMaker 
> packages.
>  
> Then what is your schedule?
> 
> This is difficult. I have made a small start and should be able to spend time 
> on it starting soon.  I want to have it finished by early next year.  But it 
> depends on work schedules etc.
> 
>  
> I would like to see if we can allocate igor/esteban time before we run out of 
> money
> to help on that important topic.
> Now the solution is unclear and I did not see any document where we can 
> evaluate
> and plan how we can help. So do you want help on that topic? Then how can 
> people
> contribute if any?
> 
> The first thing to do is to read the design document, to see if the Pharo 
> community thinks it is the right direction, and to review it, spot 
> deficiencies etc.  So please get those interested to read the class comment 
> of CogMemoryManager in the latest VMMaker.oscog.
> 
> Here's the current version of it:
> 
> CogMemoryManager is currently a place-holder for the design of the new Cog 
> VM's object representation and garbage collector.  The goals for the GC are
> 
> - efficient object representation a la Eliot Miranda's VisualWorks 64-bit 
> object representation that uses a 64-bit header, eliminating direct class 
> references so that all objects refer to their classes indirectly.  Instead 
> the header contains a constant class index, in a field smaller than a full 
> pointer, These class indices are used in inline and first-level method 
> caches, hence they do not have to be updated on GC (although they do have to 
> be traced to be able to GC classes).  Classes are held in a sparse weak 
> table.  The class table needs only to be indexed by an instance's class index 
> in class hierarchy search, in the class primitive, and in tracing live 
> objects in the heap.  The additional header space is allocated to a much 
> expanded identity hash field, reducing hash efficiency problems in identity 
> collections due to the extremely small (11 bit) hash field in the old Squeak 
> GC.  The identity hash field is also a key element of the class index scheme. 
>  A class's identity hash is its index into the class table, so to create an 
> instance of a class one merely copies its identity hash into the class index 
> field of the new instance.  This implies that when classes gain their 
> identity hash they are entered into the class table and their identity hash 
> is that of a previously unused index in the table.  It also implies that 
> there is a maximum number of classes in the table.  At least for a few years 
> 64k classes should be enough.  A class is entered into the class table in the 
> following operations:
>       behaviorHash
>       adoptInstance
>       instantiate
>       become  (i.e. if an old class becomes a new class)
>               if target class field's = to original's id hash
>                  and replacement's id hash is zero
>                       enter replacement in class table
> behaviorHash is a special version of identityHash that must be implemented in 
> the image by any object that can function as a class (i.e. Behavior).
> 
> - more immediate classes.  An immediate Character class would speed up String 
> accessing, especially for WideString, since no instatiation needs to be done 
> on at:put: and no dereference need be done on at:.  In a 32-bit system tag 
> checking is complex since it is thought important to retain 31-bit 
> SmallIntegers.  Hence, as in current Squeak, the least significant bit set 
> implies a SmallInteger, but Characters would likely have a tag pattern of 
> xxx10.  Hence masking with 11 results in two values for SmallInteger, xxx01 
> and xxx11.  30-bit characters are more than adequate for Unicode.  In a 
> 64-bit system we can use the full three bits and usefully implement an 
> immediate Float.  As in VisualWorks a functional representation takes three 
> bits away from the exponent.  Rotating to put the sign bit in the least 
> significant non-tag bit makes expanding and contracting the 8-bit exponent to 
> the 11-bit IEEE double exponent easy ad makes comparing negative and positive 
> zero easier (an immediate Float is zero if its unsigned 64-bits are < 16).  
> So the representation looks like
>       | 8 bit exponent | 52 bit mantissa | sign bit | 3 tag bits |
> For details see "60-bit immediate Floats" below.
> 
> 
> - efficient scavenging.  The current Squeak GC uses a slow pointer-reversal 
> collector that writes every field in live objects three times in each 
> collection, twice in the pointer-reversing heap traversal to mark live 
> objects and once to update the pointer to its new location.  A scavenger 
> writes every field of live data twice in each collection, once as it does a 
> block copy of the object when copying to to space, once as it traverses the 
> live pointers in the to space objects.  Of course the block copy is a 
> relatively cheap write.
> 
> - lazy become.  The JIT's use of inline cacheing provides a cheap way of 
> avoiding scanning the heap as part of a become (which is the simple approach 
> to implementing become in a system with direct pointers).  A becomeForward: 
> on a (set of) non-zero-sized object(s) turns the object into a "corpse" or 
> "forwarding object" whose first (non-header) word/slot is replaced by a 
> pointer to the target of the becomeForward:.  The corpse's class index is set 
> to one that identifies corpses and, because it is a hidden class index, will 
> always fail an inline cache test.  The inline cache failure code is then 
> responsible for following the forwarding pointer chain (these are Iliffe 
> vectors :) ) and resolving to the actual target.  We have yet to determine 
> exactly how this is done (e.g. change the receiver register and/or stack 
> contents and retry the send, perhaps scanning the current activation).  See 
> below on how we deal with becomes on objects with named inst vars.  Note that 
> we probably don't have to worry about zero-sized objects.  These are unlikely 
> to be passed through the FFI (there is nothing to pass :) ) and so will 
> rarely be becommed.  If they do, they can become slowly.  Alternatively we 
> can insist that objects are at least 16 bytes in size (see a8-byte alignment 
> below) so that there will always be space for a forwarding pointer.  Since 
> none of the immediate classes can have non-immediate instances and since we 
> allocate the immediate classes indices corresponding to their tag pattern 
> (SmallInteger = 1, Character = 3, SmallFloat = 4?) we can use all the class 
> indices from 0 to 7 for special uses, 0 = forward, and e.g. 1 = header-sized 
> filler.
> 
> - pinning.  To support a robust and easy-to-use FFI the memory manager must 
> support temporary pinning where individual objects can be prevented from 
> being moved by the GC for as long as required, either by being one of an 
> in-progress FFI call's arguments, or by having pinning asserted by a 
> primitive (allowing objects to be passed to external code that retains a 
> reference to the object after returning).  Pinning probably implies a 
> per-object "is-pinned" bit in the object header.  Pinning will be done via 
> lazy become; i..e an object in new space will be becommed into a pinned 
> object in old space.  We will only support pinning in old space.
> 
> - efficient old space collection.  An incremental collector (a la Dijkstra's 
> three colour algorithm) collects old space, e.g. via an amount of tracing 
> being hung off scavenges and/or old space allocations at an adaptive rate 
> that keeps full garbage collections to a minimum.  (see free space/free list 
> below)
> 
> - 8-byte alignment.  It is advantageous for the FFI, for floating-point 
> access, for object movement and for 32/64-bit compatibility to keep object 
> sizes in units of 8 bytes.  For the FFI, 8-byte alignment means passing 
> objects to code that expects that requirement (such as modern x86 numeric 
> processing instructions).  This implies that
>       - the starts of all spaces are aligned on 8-byte boundaries
>       - object allocation rounds up the requested size to a multiple of 8 
> bytes
>       - the overflow size field is also 8 bytes
> We shall probably keep the minimum object size at 16 bytes so that there is 
> always room for a forwarding pointer.  But this implies that we will need to 
> implement an 8-byte filler to fill holes between objects > 16 bytes whose 
> length mod 16 bytes is 8 bytes and following pinned objects.  We can do this 
> using a special class index, e.g. 1, so that the method that answers the size 
> of an object looks like, e.g.
>       chunkSizeOf: oop
>               <var: #oop type: #'object *'>
>               ^object classIndex = 1
>                       ifTrue: [BaseHeaderSize]
>                       ifFalse: [BaseHeaderSize
>                                 + (object slotSize = OverflowSlotSize
>                                               ifTrue: [OverflowSizeBytes]
>                                               ifFalse: [0])
>                                 + (object slotSize * BytesPerSlot)]
> 
>       chunkStartOf: oop
>               <var: #oop type: #'object *'>
>               ^(self cCoerceSimple: oop to: #'char *')
>                       - ((object classIndex = 1
>                           or: [object slotSize ~= OverflowSlotSize])
>                                       ifTrue: [0]
>                                       ifFalse: [OverflowSizeBytes])
> 
> For the moment we do not tackle the issue of heap growth and shrinkage with 
> the ability to allocate and deallocate heap segments via memory-mapping.  
> This technique allows space to be released back to the OS by unmapping empty 
> segments.  We may revisit this but it is not a key requirement for the first 
> implementation.
> 
> The basic approach is to use a fixed size new space and a growable old space. 
>  The new space is a classic three-space nursery a la Ungar's Generation 
> Scavenging, a large eden for new objects and two smaller survivor spaces that 
> exchange roles on each collection, one being the to space to which surviving 
> objects are copied, the other being the from space of the survivors of the 
> previous collection, i.e. the previous to space.
> 
> To provide apparent pinning in new space we rely on lazy become.  Since most 
> pinned objects will be byte data and these do not require stack zone 
> activation scanning, the overhead is simply an old space allocation and 
> corpsing.
> 
> To provide pinning in old space, large objects are implicitly pinned (because 
> it is expensive to move large objects and, because they are both large and 
> relatively rare, they contribute little to overall fragmentation - as in 
> aggregates, small objects can be used to fill-in the spaces between karge 
> objects).  Hence, objects above a particular size are automatically allocated 
> in old space, rather than new space.  Small objects are pinned as per objects 
> in new space, by asserting the pin bit, which will be set automaticaly when 
> allocating a large object.  As a last resort, or by programmer control (the 
> fullGC primitive) old space is collected via mark-sweep (mark-compact) and so 
> the mark phase must build the list of pinned objects around which the 
> sweep/compact phase must carefully step.
> 
> Free space in old space is organized by a free list/free tree as in Eliot's 
> VisualWorks 5i old space allocator.  There are 64 free lists, indices 1 
> through 63 holding blocks of space of that size, index 0 holding a 
> semi-balanced ordered tree of free blocks, each node being the head of the 
> list of free blocks of that size.  At the start of the mark phase the free 
> list is thrown away and the sweep phase coallesces free space and steps over 
> pinned objects as it proceeds.  We can reuse the forwarding pointer 
> compaction scheme used in the old collector.  Incremental collections merely 
> move unmarked objects to the free lists (as well as nilling weak pointers in 
> weak arrays and scheduling them for finalization).  The occupancy of the free 
> lists is represented by a bitmap in a 64-bit integer so that an allocation of 
> size 63 or less can know whether there exists a free chunk of that size, but 
> more importantly can know whether a free chunk larger than it exists in the 
> fixed size free lists without having to search all larger free list heads.
> 
> The incremental collector (a la Dijkstra's three colour algorithm) collects 
> old space via an amount of tracing being hung off scavenges and/or old space 
> allocations at an adaptive rate that keeps full garbage collections to a 
> minimum.  [N.B. Not sure how to do this yet.  The incremental collector needs 
> to complete a pass often enough to reclaim objects, but infrequent enough not 
> to waste time.  So some form of feedback should work.  In VisualWorks tracing 
> is broken into quanta or work where image-level code determines the size of a 
> quantum based on how fast the machine is, and how big the heap is.  This code 
> could easily live in the VM, controllable through vmParameterAt:put:.  An 
> alternative would be to use the heartbeat to bound quanta by time.  But in 
> any case some amount of incremental collection would be done on old space 
> allocation and scavenging, the ammount being chosen to keep pause times 
> acceptably short, and at a rate to reclaim old space before a full GC is 
> required, i.e. at a rate proportional to the growth in old space]. The 
> incemental collector is a state machine, being either marking, nilling weak 
> pointers, or freeing.  If nilling weak pointers is not done atomically then 
> there must be a read barrier in weak array at: so that reading from an old 
> space weak array that is holding stale un-nilled references to unmarked 
> objects.  Tricks such as including the weak bit in bounds calculations can 
> make this cheap for non-weak arrays.  Alternatively nilling weak pointers can 
> be made an atomic part of incremental collection, which can be made cheaper 
> by maintaining the set of weak arrays (e.g. on a list).
> 
> The incremental collector implies a more complex write barrier.  Objects are 
> of three colours, black, having been scanned, grey, being scanned, and white, 
> unreached.  A mark stack holds the grey objects.   If the incremental 
> collector is marking and an unmarked white object is stored into a black 
> object then the stored object must become grey, being added to the mark 
> stack.  So the wrte barrier is essentially
>       target isYoung ifFalse:
>               [newValue isYoung
>                       ifTrue: [target isInRememberedSet ifFalse:
>                                       [target addToRememberedSet]] "target 
> now refers to a young object; it is a root for scavenges"
>                       ifFalse:
>                               [(target isBlack
>                                 and: [igc marking
>                                 and: [newValue isWhite]]) ifTrue:
>                                       [newValue beGrey]]] "add newValue to 
> IGC's markStack for subsequent scanning"
> 
> The incremental collector does not detect already marked objects all of whose 
> references have been overwritten by other stores (e.g. in the above if 
> newValue overwrites the sole remaining reference to a marked object).  So the 
> incremental collector only guarantees to collect all garbage created in cycle 
> N at the end of cycle N + 1.  The cost is hence slightly worse memory density 
> but the benefit, provided the IGC works hard enough, is the elimination of 
> long pauses due to full garbage collections, which become actions of last 
> resort or programmer desire.
> 
> Lazy become.
> 
> As described earlier the basic idea behind lazy become is to use corpses 
> (forwarding objects) that are followed lazily during GC and inline cache 
> miss.  However, a lazy scheme cannot be used on objects with named inst vars 
> without adding checking to all inst var accesses, which we judge too 
> expensive.  Instead, when becomming objects with named inst vars, we scan all 
> activations in the stack zone, eagerly becomming these references, and we 
> check for corpses when faulting in a context into the stack zone.  
> Essentially, the invariant is that there are no references to corpses from 
> the receiver slots of stack activations.  A detail is whether we allow or 
> forbid pinning of closure indirection vectors, or scan the entire stack of 
> each activation.  Using a special class index pun for indirection vectors is 
> a cheap way of preventing their becomming/pinning etc.  Although "don't do 
> that" (don't attempt to pin/become indirection vectors) is also an acceptable 
> response.
> 
> 60-bit immediate Floats
> Representation for immediate doubles, only used in the 64-bit implementation. 
> Immediate doubles have the same 52 bit mantissa as IEEE double-precision  
> floating-point, but only have 8 bits of exponent.  So they occupy just less 
> than the middle 1/8th of the double range.  They overlap the normal 
> single-precision floats which also have 8 bit exponents, but exclude the 
> single-precision denormals (exponent-127) and the single-precsion NaNs 
> (exponent +127).  +/- zero is just a pair of values with both exponent and 
> mantissa 0. 
> So the non-zero immediate doubles range from 
>         +/- 0x3800,0000,0000,0001 / 5.8774717541114d-39 
> to      +/- 0x47ff,ffff,ffff,ffff / 6.8056473384188d+38 
> The encoded tagged form has the sign bit moved to the least significant bit, 
> which allows for faster encode/decode because offsetting the exponent can't 
> overflow into the sign bit and because testing for +/- 0 is an unsigned 
> compare for <= 0xf: 
>     msb                                                                       
>                  lsb 
>     [8 exponent subset bits][52 mantissa bits ][1 sign bit][3 tag bits] 
> So assuming the tag is 5, the tagged non-zero bit patterns are 
>              0x0000,0000,0000,001[d/5] 
> to           0xffff,ffff,ffff,fff[d/5] 
> and +/- 0d is 0x0000,0000,0000,000[d/5] 
> Encode/decode of non-zero values in machine code looks like: 
>                                               msb                             
>                  lsb 
> Decode:                               [8expsubset][52mantissa][1s][3tags] 
> shift away tags:                      [ 000 ][8expsubset][52mantissa][1s] 
> add exponent offset:  [     11 exponent     ][52mantissa][1s] 
> rot sign:                             [1s][     11 exponent     ][52mantissa]
> 
> Encode:                                       [1s][     11 exponent     
> ][52mantissa] 
> rot sign:                             [     11 exponent     ][52mantissa][1s] 
> sub exponent offset:  [ 000 ][8expsubset][52 mantissa][1s] 
> shift:                                        [8expsubset][52 mantissa][1s][ 
> 000 ] 
> or/add tags:                  [8expsubset][52mantissa][1s][3tags] 
> but is slower in C because 
> a) there is no rotate, and 
> b) raw conversion between double and quadword must (at least in the source) 
> move bits through memory ( quadword = *(q64 *)&doubleVariable). 
> 
> Issues:
> How do we avoid the Size4Bit for 64-bits?  The format word encodes the number 
> of odd bytes, but currently has only 4 bits and hence only supports odd bytes 
> of 0 - 3.  We need odd bytes of 0 - 7.  But I don't like the separate 
> Size4Bit.  Best to change the VI code and have a 5 bit format?  We loose one 
> bit but save two bits (isEphemeron and isWeak (or three, if isPointers)) for 
> a net gain of one (or two)
> 
> Further, keep Squeak's format idea or go for separate bits?  For 64-bits we 
> need a 5 bit format field.  This contrasts with isPointers, isWeak, 
> isEphemeron, 3 odd size bits (or byte size)..  format field is quite 
> economical.
> 
> Are class indices in inline caches strong references to classes or weak 
> references?
> If strong then they must be scanned during GC and the methodZone must be 
> flushed on fullGC to reclaim all classes (this looks to be a bug in the V3 
> Cogit).
> If weak then when the class table loses references, PICs containing freed 
> classes must be freed and then sends to freed PICs or containing freed clases 
> must be unlinked.
> The second approach is faster; the common case is scanning the class table, 
> the uncommon case is freeing classes.  The second approach is better; in-line 
> caches do not prevent reclamation of classes.
> 
> 
> Stef
> 
> 
> 
> -- 
> best,
> Eliot
> 

Reply via email to