communication around new object format

Michael Roberts Wed, 30 May 2012 02:10:53 -0700

Hi Stef, I have skimmed the proposal in the included class comment. It is 
beyond my knowledge really to comment specifically. Looks great! :-)


I picked up on the reference to 64k classes limit. Whilst at the moment I can't 
imagine a Pharo system needing that many classes, the VW system I am working on 
already has ~64k classes. I just thought it an interesting data point.

I don't know what happens in the scheme below if the limit is a hard one or 
not. 

For my stats I did "roots of the system, gather all classes, go" style loop and 
skimmed the result for sanity. 

This includes metaclasses.  So I don't know if the number is really half or 
not. Do metaclasses count towards this total in this scheme? 

Also our number is quite high because for tool/browser support we also have 
proxies for all GemStone classes.  Anyway I don't mention this to adjust any 
technical decisions but it was more just a real (if 50%) data point. 

Cheers,
Mike 

On 30 May 2012, at 07:48, Stéphane Ducasse <[email protected]> wrote:

> Hi guys
> 
> Here is an important topic I would like to see discussed so that we see how 
> we can improve and join forces.
> May a mail discussion then a wiki for the summary would be good.
> 
> 
> stef
> 
> 
> 
> Begin forwarded message:
> 
>> From: Eliot Miranda <[email protected]>
>> Subject: Re: Plan/discussion/communication around new object format
>> Date: May 27, 2012 10:49:54 PM GMT+02:00
>> To: Stéphane Ducasse <[email protected]>
>> 
>> 
>> 
>> On Sat, May 26, 2012 at 1:46 AM, Stéphane Ducasse 
>> <[email protected]> wrote:
>> Hi eliot
>> 
>> do you have a description of the new object format you want to introduce?
>> 
>> The design is in the class comment of CogMemoryManager in the Cog VMMaker 
>> packages.
>>  
>> Then what is your schedule?
>> 
>> This is difficult. I have made a small start and should be able to spend 
>> time on it starting soon.  I want to have it finished by early next year.  
>> But it depends on work schedules etc.
>> 
>>  
>> I would like to see if we can allocate igor/esteban time before we run out 
>> of money
>> to help on that important topic.
>> Now the solution is unclear and I did not see any document where we can 
>> evaluate
>> and plan how we can help. So do you want help on that topic? Then how can 
>> people
>> contribute if any?
>> 
>> The first thing to do is to read the design document, to see if the Pharo 
>> community thinks it is the right direction, and to review it, spot 
>> deficiencies etc.  So please get those interested to read the class comment 
>> of CogMemoryManager in the latest VMMaker.oscog.
>> 
>> Here's the current version of it:
>> 
>> CogMemoryManager is currently a place-holder for the design of the new Cog 
>> VM's object representation and garbage collector.  The goals for the GC are
>> 
>> - efficient object representation a la Eliot Miranda's VisualWorks 64-bit 
>> object representation that uses a 64-bit header, eliminating direct class 
>> references so that all objects refer to their classes indirectly.  Instead 
>> the header contains a constant class index, in a field smaller than a full 
>> pointer, These class indices are used in inline and first-level method 
>> caches, hence they do not have to be updated on GC (although they do have to 
>> be traced to be able to GC classes).  Classes are held in a sparse weak 
>> table.  The class table needs only to be indexed by an instance's class 
>> index in class hierarchy search, in the class primitive, and in tracing live 
>> objects in the heap.  The additional header space is allocated to a much 
>> expanded identity hash field, reducing hash efficiency problems in identity 
>> collections due to the extremely small (11 bit) hash field in the old Squeak 
>> GC.  The identity hash field is also a key element of the class index 
>> scheme.  A class's identity hash is its index into the class table, so to 
>> create an instance of a class one merely copies its identity hash into the 
>> class index field of the new instance.  This implies that when classes gain 
>> their identity hash they are entered into the class table and their identity 
>> hash is that of a previously unused index in the table.  It also implies 
>> that there is a maximum number of classes in the table.  At least for a few 
>> years 64k classes should be enough.  A class is entered into the class table 
>> in the following operations:
>>      behaviorHash
>>      adoptInstance
>>      instantiate
>>      become  (i.e. if an old class becomes a new class)
>>              if target class field's = to original's id hash
>>                 and replacement's id hash is zero
>>                      enter replacement in class table
>> behaviorHash is a special version of identityHash that must be implemented 
>> in the image by any object that can function as a class (i.e. Behavior).
>> 
>> - more immediate classes.  An immediate Character class would speed up 
>> String accessing, especially for WideString, since no instatiation needs to 
>> be done on at:put: and no dereference need be done on at:.  In a 32-bit 
>> system tag checking is complex since it is thought important to retain 
>> 31-bit SmallIntegers.  Hence, as in current Squeak, the least significant 
>> bit set implies a SmallInteger, but Characters would likely have a tag 
>> pattern of xxx10.  Hence masking with 11 results in two values for 
>> SmallInteger, xxx01 and xxx11.  30-bit characters are more than adequate for 
>> Unicode.  In a 64-bit system we can use the full three bits and usefully 
>> implement an immediate Float.  As in VisualWorks a functional representation 
>> takes three bits away from the exponent.  Rotating to put the sign bit in 
>> the least significant non-tag bit makes expanding and contracting the 8-bit 
>> exponent to the 11-bit IEEE double exponent easy ad makes comparing negative 
>> and positive zero easier (an immediate Float is zero if its unsigned 64-bits 
>> are < 16).  So the representation looks like
>>      | 8 bit exponent | 52 bit mantissa | sign bit | 3 tag bits |
>> For details see "60-bit immediate Floats" below.
>> 
>> 
>> - efficient scavenging.  The current Squeak GC uses a slow pointer-reversal 
>> collector that writes every field in live objects three times in each 
>> collection, twice in the pointer-reversing heap traversal to mark live 
>> objects and once to update the pointer to its new location.  A scavenger 
>> writes every field of live data twice in each collection, once as it does a 
>> block copy of the object when copying to to space, once as it traverses the 
>> live pointers in the to space objects.  Of course the block copy is a 
>> relatively cheap write.
>> 
>> - lazy become.  The JIT's use of inline cacheing provides a cheap way of 
>> avoiding scanning the heap as part of a become (which is the simple approach 
>> to implementing become in a system with direct pointers).  A becomeForward: 
>> on a (set of) non-zero-sized object(s) turns the object into a "corpse" or 
>> "forwarding object" whose first (non-header) word/slot is replaced by a 
>> pointer to the target of the becomeForward:.  The corpse's class index is 
>> set to one that identifies corpses and, because it is a hidden class index, 
>> will always fail an inline cache test.  The inline cache failure code is 
>> then responsible for following the forwarding pointer chain (these are 
>> Iliffe vectors :) ) and resolving to the actual target.  We have yet to 
>> determine exactly how this is done (e.g. change the receiver register and/or 
>> stack contents and retry the send, perhaps scanning the current activation). 
>>  See below on how we deal with becomes on objects with named inst vars.  
>> Note that we probably don't have to worry about zero-sized objects.  These 
>> are unlikely to be passed through the FFI (there is nothing to pass :) ) and 
>> so will rarely be becommed.  If they do, they can become slowly.  
>> Alternatively we can insist that objects are at least 16 bytes in size (see 
>> a8-byte alignment below) so that there will always be space for a forwarding 
>> pointer.  Since none of the immediate classes can have non-immediate 
>> instances and since we allocate the immediate classes indices corresponding 
>> to their tag pattern (SmallInteger = 1, Character = 3, SmallFloat = 4?) we 
>> can use all the class indices from 0 to 7 for special uses, 0 = forward, and 
>> e.g. 1 = header-sized filler.
>> 
>> - pinning.  To support a robust and easy-to-use FFI the memory manager must 
>> support temporary pinning where individual objects can be prevented from 
>> being moved by the GC for as long as required, either by being one of an 
>> in-progress FFI call's arguments, or by having pinning asserted by a 
>> primitive (allowing objects to be passed to external code that retains a 
>> reference to the object after returning).  Pinning probably implies a 
>> per-object "is-pinned" bit in the object header.  Pinning will be done via 
>> lazy become; i..e an object in new space will be becommed into a pinned 
>> object in old space.  We will only support pinning in old space.
>> 
>> - efficient old space collection.  An incremental collector (a la Dijkstra's 
>> three colour algorithm) collects old space, e.g. via an amount of tracing 
>> being hung off scavenges and/or old space allocations at an adaptive rate 
>> that keeps full garbage collections to a minimum.  (see free space/free list 
>> below)
>> 
>> - 8-byte alignment.  It is advantageous for the FFI, for floating-point 
>> access, for object movement and for 32/64-bit compatibility to keep object 
>> sizes in units of 8 bytes.  For the FFI, 8-byte alignment means passing 
>> objects to code that expects that requirement (such as modern x86 numeric 
>> processing instructions).  This implies that
>>      - the starts of all spaces are aligned on 8-byte boundaries
>>      - object allocation rounds up the requested size to a multiple of 8 
>> bytes
>>      - the overflow size field is also 8 bytes
>> We shall probably keep the minimum object size at 16 bytes so that there is 
>> always room for a forwarding pointer.  But this implies that we will need to 
>> implement an 8-byte filler to fill holes between objects > 16 bytes whose 
>> length mod 16 bytes is 8 bytes and following pinned objects.  We can do this 
>> using a special class index, e.g. 1, so that the method that answers the 
>> size of an object looks like, e.g.
>>      chunkSizeOf: oop
>>              <var: #oop type: #'object *'>
>>              ^object classIndex = 1
>>                      ifTrue: [BaseHeaderSize]
>>                      ifFalse: [BaseHeaderSize
>>                                + (object slotSize = OverflowSlotSize
>>                                              ifTrue: [OverflowSizeBytes]
>>                                              ifFalse: [0])
>>                                + (object slotSize * BytesPerSlot)]
>> 
>>      chunkStartOf: oop
>>              <var: #oop type: #'object *'>
>>              ^(self cCoerceSimple: oop to: #'char *')
>>                      - ((object classIndex = 1
>>                          or: [object slotSize ~= OverflowSlotSize])
>>                                      ifTrue: [0]
>>                                      ifFalse: [OverflowSizeBytes])
>> 
>> For the moment we do not tackle the issue of heap growth and shrinkage with 
>> the ability to allocate and deallocate heap segments via memory-mapping.  
>> This technique allows space to be released back to the OS by unmapping empty 
>> segments.  We may revisit this but it is not a key requirement for the first 
>> implementation.
>> 
>> The basic approach is to use a fixed size new space and a growable old 
>> space.  The new space is a classic three-space nursery a la Ungar's 
>> Generation Scavenging, a large eden for new objects and two smaller survivor 
>> spaces that exchange roles on each collection, one being the to space to 
>> which surviving objects are copied, the other being the from space of the 
>> survivors of the previous collection, i.e. the previous to space.
>> 
>> To provide apparent pinning in new space we rely on lazy become.  Since most 
>> pinned objects will be byte data and these do not require stack zone 
>> activation scanning, the overhead is simply an old space allocation and 
>> corpsing.
>> 
>> To provide pinning in old space, large objects are implicitly pinned 
>> (because it is expensive to move large objects and, because they are both 
>> large and relatively rare, they contribute little to overall fragmentation - 
>> as in aggregates, small objects can be used to fill-in the spaces between 
>> karge objects).  Hence, objects above a particular size are automatically 
>> allocated in old space, rather than new space.  Small objects are pinned as 
>> per objects in new space, by asserting the pin bit, which will be set 
>> automaticaly when allocating a large object.  As a last resort, or by 
>> programmer control (the fullGC primitive) old space is collected via 
>> mark-sweep (mark-compact) and so the mark phase must build the list of 
>> pinned objects around which the sweep/compact phase must carefully step.
>> 
>> Free space in old space is organized by a free list/free tree as in Eliot's 
>> VisualWorks 5i old space allocator.  There are 64 free lists, indices 1 
>> through 63 holding blocks of space of that size, index 0 holding a 
>> semi-balanced ordered tree of free blocks, each node being the head of the 
>> list of free blocks of that size.  At the start of the mark phase the free 
>> list is thrown away and the sweep phase coallesces free space and steps over 
>> pinned objects as it proceeds.  We can reuse the forwarding pointer 
>> compaction scheme used in the old collector.  Incremental collections merely 
>> move unmarked objects to the free lists (as well as nilling weak pointers in 
>> weak arrays and scheduling them for finalization).  The occupancy of the 
>> free lists is represented by a bitmap in a 64-bit integer so that an 
>> allocation of size 63 or less can know whether there exists a free chunk of 
>> that size, but more importantly can know whether a free chunk larger than it 
>> exists in the fixed size free lists without having to search all larger free 
>> list heads.
>> 
>> The incremental collector (a la Dijkstra's three colour algorithm) collects 
>> old space via an amount of tracing being hung off scavenges and/or old space 
>> allocations at an adaptive rate that keeps full garbage collections to a 
>> minimum.  [N.B. Not sure how to do this yet.  The incremental collector 
>> needs to complete a pass often enough to reclaim objects, but infrequent 
>> enough not to waste time.  So some form of feedback should work.  In 
>> VisualWorks tracing is broken into quanta or work where image-level code 
>> determines the size of a quantum based on how fast the machine is, and how 
>> big the heap is.  This code could easily live in the VM, controllable 
>> through vmParameterAt:put:.  An alternative would be to use the heartbeat to 
>> bound quanta by time.  But in any case some amount of incremental collection 
>> would be done on old space allocation and scavenging, the ammount being 
>> chosen to keep pause times acceptably short, and at a rate to reclaim old 
>> space before a full GC is required, i.e. at a rate proportional to the 
>> growth in old space]. The incemental collector is a state machine, being 
>> either marking, nilling weak pointers, or freeing.  If nilling weak pointers 
>> is not done atomically then there must be a read barrier in weak array at: 
>> so that reading from an old space weak array that is holding stale un-nilled 
>> references to unmarked objects.  Tricks such as including the weak bit in 
>> bounds calculations can make this cheap for non-weak arrays.  Alternatively 
>> nilling weak pointers can be made an atomic part of incremental collection, 
>> which can be made cheaper by maintaining the set of weak arrays (e.g. on a 
>> list).
>> 
>> The incremental collector implies a more complex write barrier.  Objects are 
>> of three colours, black, having been scanned, grey, being scanned, and 
>> white, unreached.  A mark stack holds the grey objects.   If the incremental 
>> collector is marking and an unmarked white object is stored into a black 
>> object then the stored object must become grey, being added to the mark 
>> stack.  So the wrte barrier is essentially
>>      target isYoung ifFalse:
>>              [newValue isYoung
>>                      ifTrue: [target isInRememberedSet ifFalse:
>>                                      [target addToRememberedSet]] "target 
>> now refers to a young object; it is a root for scavenges"
>>                      ifFalse:
>>                              [(target isBlack
>>                                and: [igc marking
>>                                and: [newValue isWhite]]) ifTrue:
>>                                      [newValue beGrey]]] "add newValue to 
>> IGC's markStack for subsequent scanning"
>> 
>> The incremental collector does not detect already marked objects all of 
>> whose references have been overwritten by other stores (e.g. in the above if 
>> newValue overwrites the sole remaining reference to a marked object).  So 
>> the incremental collector only guarantees to collect all garbage created in 
>> cycle N at the end of cycle N + 1.  The cost is hence slightly worse memory 
>> density but the benefit, provided the IGC works hard enough, is the 
>> elimination of long pauses due to full garbage collections, which become 
>> actions of last resort or programmer desire.
>> 
>> Lazy become.
>> 
>> As described earlier the basic idea behind lazy become is to use corpses 
>> (forwarding objects) that are followed lazily during GC and inline cache 
>> miss.  However, a lazy scheme cannot be used on objects with named inst vars 
>> without adding checking to all inst var accesses, which we judge too 
>> expensive.  Instead, when becomming objects with named inst vars, we scan 
>> all activations in the stack zone, eagerly becomming these references, and 
>> we check for corpses when faulting in a context into the stack zone.  
>> Essentially, the invariant is that there are no references to corpses from 
>> the receiver slots of stack activations.  A detail is whether we allow or 
>> forbid pinning of closure indirection vectors, or scan the entire stack of 
>> each activation.  Using a special class index pun for indirection vectors is 
>> a cheap way of preventing their becomming/pinning etc.  Although "don't do 
>> that" (don't attempt to pin/become indirection vectors) is also an 
>> acceptable response.
>> 
>> 60-bit immediate Floats
>> Representation for immediate doubles, only used in the 64-bit 
>> implementation. Immediate doubles have the same 52 bit mantissa as IEEE 
>> double-precision  floating-point, but only have 8 bits of exponent.  So they 
>> occupy just less than the middle 1/8th of the double range.  They overlap 
>> the normal single-precision floats which also have 8 bit exponents, but 
>> exclude the single-precision denormals (exponent-127) and the 
>> single-precsion NaNs (exponent +127).  +/- zero is just a pair of values 
>> with both exponent and mantissa 0. 
>> So the non-zero immediate doubles range from 
>>         +/- 0x3800,0000,0000,0001 / 5.8774717541114d-39 
>> to      +/- 0x47ff,ffff,ffff,ffff / 6.8056473384188d+38 
>> The encoded tagged form has the sign bit moved to the least significant bit, 
>> which allows for faster encode/decode because offsetting the exponent can't 
>> overflow into the sign bit and because testing for +/- 0 is an unsigned 
>> compare for <= 0xf: 
>>     msb                                                                      
>>                   lsb 
>>     [8 exponent subset bits][52 mantissa bits ][1 sign bit][3 tag bits] 
>> So assuming the tag is 5, the tagged non-zero bit patterns are 
>>              0x0000,0000,0000,001[d/5] 
>> to           0xffff,ffff,ffff,fff[d/5] 
>> and +/- 0d is 0x0000,0000,0000,000[d/5] 
>> Encode/decode of non-zero values in machine code looks like: 
>>                                              msb                             
>>                  lsb 
>> Decode:                              [8expsubset][52mantissa][1s][3tags] 
>> shift away tags:                     [ 000 ][8expsubset][52mantissa][1s] 
>> add exponent offset: [     11 exponent     ][52mantissa][1s] 
>> rot sign:                            [1s][     11 exponent     ][52mantissa]
>> 
>> Encode:                                      [1s][     11 exponent     
>> ][52mantissa] 
>> rot sign:                            [     11 exponent     ][52mantissa][1s] 
>> sub exponent offset: [ 000 ][8expsubset][52 mantissa][1s] 
>> shift:                                       [8expsubset][52 mantissa][1s][ 
>> 000 ] 
>> or/add tags:                 [8expsubset][52mantissa][1s][3tags] 
>> but is slower in C because 
>> a) there is no rotate, and 
>> b) raw conversion between double and quadword must (at least in the source) 
>> move bits through memory ( quadword = *(q64 *)&doubleVariable). 
>> 
>> Issues:
>> How do we avoid the Size4Bit for 64-bits?  The format word encodes the 
>> number of odd bytes, but currently has only 4 bits and hence only supports 
>> odd bytes of 0 - 3.  We need odd bytes of 0 - 7.  But I don't like the 
>> separate Size4Bit.  Best to change the VI code and have a 5 bit format?  We 
>> loose one bit but save two bits (isEphemeron and isWeak (or three, if 
>> isPointers)) for a net gain of one (or two)
>> 
>> Further, keep Squeak's format idea or go for separate bits?  For 64-bits we 
>> need a 5 bit format field.  This contrasts with isPointers, isWeak, 
>> isEphemeron, 3 odd size bits (or byte size)..  format field is quite 
>> economical.
>> 
>> Are class indices in inline caches strong references to classes or weak 
>> references?
>> If strong then they must be scanned during GC and the methodZone must be 
>> flushed on fullGC to reclaim all classes (this looks to be a bug in the V3 
>> Cogit).
>> If weak then when the class table loses references, PICs containing freed 
>> classes must be freed and then sends to freed PICs or containing freed 
>> clases must be unlinked.
>> The second approach is faster; the common case is scanning the class table, 
>> the uncommon case is freeing classes.  The second approach is better; 
>> in-line caches do not prevent reclamation of classes.
>> 
>> 
>> Stef
>> 
>> 
>> 
>> -- 
>> best,
>> Eliot
>> 
>

Re: [Pharo-project] Fwd: Plan/discussion/communication around new object format

Reply via email to