I would like to be sure that we can have
- bit for immutable objects
- bits for experimenting.
Stef
On May 30, 2012, at 8:48 AM, Stéphane Ducasse wrote:
> Hi guys
>
> Here is an important topic I would like to see discussed so that we see how
> we can improve and join forces.
> May a mail discussion then a wiki for the summary would be good.
>
>
> stef
>
>
>
> Begin forwarded message:
>
>> From: Eliot Miranda <[email protected]>
>> Subject: Re: Plan/discussion/communication around new object format
>> Date: May 27, 2012 10:49:54 PM GMT+02:00
>> To: Stéphane Ducasse <[email protected]>
>>
>>
>>
>> On Sat, May 26, 2012 at 1:46 AM, Stéphane Ducasse
>> <[email protected]> wrote:
>> Hi eliot
>>
>> do you have a description of the new object format you want to introduce?
>>
>> The design is in the class comment of CogMemoryManager in the Cog VMMaker
>> packages.
>>
>> Then what is your schedule?
>>
>> This is difficult. I have made a small start and should be able to spend
>> time on it starting soon. I want to have it finished by early next year.
>> But it depends on work schedules etc.
>>
>>
>> I would like to see if we can allocate igor/esteban time before we run out
>> of money
>> to help on that important topic.
>> Now the solution is unclear and I did not see any document where we can
>> evaluate
>> and plan how we can help. So do you want help on that topic? Then how can
>> people
>> contribute if any?
>>
>> The first thing to do is to read the design document, to see if the Pharo
>> community thinks it is the right direction, and to review it, spot
>> deficiencies etc. So please get those interested to read the class comment
>> of CogMemoryManager in the latest VMMaker.oscog.
>>
>> Here's the current version of it:
>>
>> CogMemoryManager is currently a place-holder for the design of the new Cog
>> VM's object representation and garbage collector. The goals for the GC are
>>
>> - efficient object representation a la Eliot Miranda's VisualWorks 64-bit
>> object representation that uses a 64-bit header, eliminating direct class
>> references so that all objects refer to their classes indirectly. Instead
>> the header contains a constant class index, in a field smaller than a full
>> pointer, These class indices are used in inline and first-level method
>> caches, hence they do not have to be updated on GC (although they do have to
>> be traced to be able to GC classes). Classes are held in a sparse weak
>> table. The class table needs only to be indexed by an instance's class
>> index in class hierarchy search, in the class primitive, and in tracing live
>> objects in the heap. The additional header space is allocated to a much
>> expanded identity hash field, reducing hash efficiency problems in identity
>> collections due to the extremely small (11 bit) hash field in the old Squeak
>> GC. The identity hash field is also a key element of the class index
>> scheme. A class's identity hash is its index into the class table, so to
>> create an instance of a class one merely copies its identity hash into the
>> class index field of the new instance. This implies that when classes gain
>> their identity hash they are entered into the class table and their identity
>> hash is that of a previously unused index in the table. It also implies
>> that there is a maximum number of classes in the table. At least for a few
>> years 64k classes should be enough. A class is entered into the class table
>> in the following operations:
>> behaviorHash
>> adoptInstance
>> instantiate
>> become (i.e. if an old class becomes a new class)
>> if target class field's = to original's id hash
>> and replacement's id hash is zero
>> enter replacement in class table
>> behaviorHash is a special version of identityHash that must be implemented
>> in the image by any object that can function as a class (i.e. Behavior).
>>
>> - more immediate classes. An immediate Character class would speed up
>> String accessing, especially for WideString, since no instatiation needs to
>> be done on at:put: and no dereference need be done on at:. In a 32-bit
>> system tag checking is complex since it is thought important to retain
>> 31-bit SmallIntegers. Hence, as in current Squeak, the least significant
>> bit set implies a SmallInteger, but Characters would likely have a tag
>> pattern of xxx10. Hence masking with 11 results in two values for
>> SmallInteger, xxx01 and xxx11. 30-bit characters are more than adequate for
>> Unicode. In a 64-bit system we can use the full three bits and usefully
>> implement an immediate Float. As in VisualWorks a functional representation
>> takes three bits away from the exponent. Rotating to put the sign bit in
>> the least significant non-tag bit makes expanding and contracting the 8-bit
>> exponent to the 11-bit IEEE double exponent easy ad makes comparing negative
>> and positive zero easier (an immediate Float is zero if its unsigned 64-bits
>> are < 16). So the representation looks like
>> | 8 bit exponent | 52 bit mantissa | sign bit | 3 tag bits |
>> For details see "60-bit immediate Floats" below.
>>
>>
>> - efficient scavenging. The current Squeak GC uses a slow pointer-reversal
>> collector that writes every field in live objects three times in each
>> collection, twice in the pointer-reversing heap traversal to mark live
>> objects and once to update the pointer to its new location. A scavenger
>> writes every field of live data twice in each collection, once as it does a
>> block copy of the object when copying to to space, once as it traverses the
>> live pointers in the to space objects. Of course the block copy is a
>> relatively cheap write.
>>
>> - lazy become. The JIT's use of inline cacheing provides a cheap way of
>> avoiding scanning the heap as part of a become (which is the simple approach
>> to implementing become in a system with direct pointers). A becomeForward:
>> on a (set of) non-zero-sized object(s) turns the object into a "corpse" or
>> "forwarding object" whose first (non-header) word/slot is replaced by a
>> pointer to the target of the becomeForward:. The corpse's class index is
>> set to one that identifies corpses and, because it is a hidden class index,
>> will always fail an inline cache test. The inline cache failure code is
>> then responsible for following the forwarding pointer chain (these are
>> Iliffe vectors :) ) and resolving to the actual target. We have yet to
>> determine exactly how this is done (e.g. change the receiver register and/or
>> stack contents and retry the send, perhaps scanning the current activation).
>> See below on how we deal with becomes on objects with named inst vars.
>> Note that we probably don't have to worry about zero-sized objects. These
>> are unlikely to be passed through the FFI (there is nothing to pass :) ) and
>> so will rarely be becommed. If they do, they can become slowly.
>> Alternatively we can insist that objects are at least 16 bytes in size (see
>> a8-byte alignment below) so that there will always be space for a forwarding
>> pointer. Since none of the immediate classes can have non-immediate
>> instances and since we allocate the immediate classes indices corresponding
>> to their tag pattern (SmallInteger = 1, Character = 3, SmallFloat = 4?) we
>> can use all the class indices from 0 to 7 for special uses, 0 = forward, and
>> e.g. 1 = header-sized filler.
>>
>> - pinning. To support a robust and easy-to-use FFI the memory manager must
>> support temporary pinning where individual objects can be prevented from
>> being moved by the GC for as long as required, either by being one of an
>> in-progress FFI call's arguments, or by having pinning asserted by a
>> primitive (allowing objects to be passed to external code that retains a
>> reference to the object after returning). Pinning probably implies a
>> per-object "is-pinned" bit in the object header. Pinning will be done via
>> lazy become; i..e an object in new space will be becommed into a pinned
>> object in old space. We will only support pinning in old space.
>>
>> - efficient old space collection. An incremental collector (a la Dijkstra's
>> three colour algorithm) collects old space, e.g. via an amount of tracing
>> being hung off scavenges and/or old space allocations at an adaptive rate
>> that keeps full garbage collections to a minimum. (see free space/free list
>> below)
>>
>> - 8-byte alignment. It is advantageous for the FFI, for floating-point
>> access, for object movement and for 32/64-bit compatibility to keep object
>> sizes in units of 8 bytes. For the FFI, 8-byte alignment means passing
>> objects to code that expects that requirement (such as modern x86 numeric
>> processing instructions). This implies that
>> - the starts of all spaces are aligned on 8-byte boundaries
>> - object allocation rounds up the requested size to a multiple of 8
>> bytes
>> - the overflow size field is also 8 bytes
>> We shall probably keep the minimum object size at 16 bytes so that there is
>> always room for a forwarding pointer. But this implies that we will need to
>> implement an 8-byte filler to fill holes between objects > 16 bytes whose
>> length mod 16 bytes is 8 bytes and following pinned objects. We can do this
>> using a special class index, e.g. 1, so that the method that answers the
>> size of an object looks like, e.g.
>> chunkSizeOf: oop
>> <var: #oop type: #'object *'>
>> ^object classIndex = 1
>> ifTrue: [BaseHeaderSize]
>> ifFalse: [BaseHeaderSize
>> + (object slotSize = OverflowSlotSize
>> ifTrue: [OverflowSizeBytes]
>> ifFalse: [0])
>> + (object slotSize * BytesPerSlot)]
>>
>> chunkStartOf: oop
>> <var: #oop type: #'object *'>
>> ^(self cCoerceSimple: oop to: #'char *')
>> - ((object classIndex = 1
>> or: [object slotSize ~= OverflowSlotSize])
>> ifTrue: [0]
>> ifFalse: [OverflowSizeBytes])
>>
>> For the moment we do not tackle the issue of heap growth and shrinkage with
>> the ability to allocate and deallocate heap segments via memory-mapping.
>> This technique allows space to be released back to the OS by unmapping empty
>> segments. We may revisit this but it is not a key requirement for the first
>> implementation.
>>
>> The basic approach is to use a fixed size new space and a growable old
>> space. The new space is a classic three-space nursery a la Ungar's
>> Generation Scavenging, a large eden for new objects and two smaller survivor
>> spaces that exchange roles on each collection, one being the to space to
>> which surviving objects are copied, the other being the from space of the
>> survivors of the previous collection, i.e. the previous to space.
>>
>> To provide apparent pinning in new space we rely on lazy become. Since most
>> pinned objects will be byte data and these do not require stack zone
>> activation scanning, the overhead is simply an old space allocation and
>> corpsing.
>>
>> To provide pinning in old space, large objects are implicitly pinned
>> (because it is expensive to move large objects and, because they are both
>> large and relatively rare, they contribute little to overall fragmentation -
>> as in aggregates, small objects can be used to fill-in the spaces between
>> karge objects). Hence, objects above a particular size are automatically
>> allocated in old space, rather than new space. Small objects are pinned as
>> per objects in new space, by asserting the pin bit, which will be set
>> automaticaly when allocating a large object. As a last resort, or by
>> programmer control (the fullGC primitive) old space is collected via
>> mark-sweep (mark-compact) and so the mark phase must build the list of
>> pinned objects around which the sweep/compact phase must carefully step.
>>
>> Free space in old space is organized by a free list/free tree as in Eliot's
>> VisualWorks 5i old space allocator. There are 64 free lists, indices 1
>> through 63 holding blocks of space of that size, index 0 holding a
>> semi-balanced ordered tree of free blocks, each node being the head of the
>> list of free blocks of that size. At the start of the mark phase the free
>> list is thrown away and the sweep phase coallesces free space and steps over
>> pinned objects as it proceeds. We can reuse the forwarding pointer
>> compaction scheme used in the old collector. Incremental collections merely
>> move unmarked objects to the free lists (as well as nilling weak pointers in
>> weak arrays and scheduling them for finalization). The occupancy of the
>> free lists is represented by a bitmap in a 64-bit integer so that an
>> allocation of size 63 or less can know whether there exists a free chunk of
>> that size, but more importantly can know whether a free chunk larger than it
>> exists in the fixed size free lists without having to search all larger free
>> list heads.
>>
>> The incremental collector (a la Dijkstra's three colour algorithm) collects
>> old space via an amount of tracing being hung off scavenges and/or old space
>> allocations at an adaptive rate that keeps full garbage collections to a
>> minimum. [N.B. Not sure how to do this yet. The incremental collector
>> needs to complete a pass often enough to reclaim objects, but infrequent
>> enough not to waste time. So some form of feedback should work. In
>> VisualWorks tracing is broken into quanta or work where image-level code
>> determines the size of a quantum based on how fast the machine is, and how
>> big the heap is. This code could easily live in the VM, controllable
>> through vmParameterAt:put:. An alternative would be to use the heartbeat to
>> bound quanta by time. But in any case some amount of incremental collection
>> would be done on old space allocation and scavenging, the ammount being
>> chosen to keep pause times acceptably short, and at a rate to reclaim old
>> space before a full GC is required, i.e. at a rate proportional to the
>> growth in old space]. The incemental collector is a state machine, being
>> either marking, nilling weak pointers, or freeing. If nilling weak pointers
>> is not done atomically then there must be a read barrier in weak array at:
>> so that reading from an old space weak array that is holding stale un-nilled
>> references to unmarked objects. Tricks such as including the weak bit in
>> bounds calculations can make this cheap for non-weak arrays. Alternatively
>> nilling weak pointers can be made an atomic part of incremental collection,
>> which can be made cheaper by maintaining the set of weak arrays (e.g. on a
>> list).
>>
>> The incremental collector implies a more complex write barrier. Objects are
>> of three colours, black, having been scanned, grey, being scanned, and
>> white, unreached. A mark stack holds the grey objects. If the incremental
>> collector is marking and an unmarked white object is stored into a black
>> object then the stored object must become grey, being added to the mark
>> stack. So the wrte barrier is essentially
>> target isYoung ifFalse:
>> [newValue isYoung
>> ifTrue: [target isInRememberedSet ifFalse:
>> [target addToRememberedSet]] "target
>> now refers to a young object; it is a root for scavenges"
>> ifFalse:
>> [(target isBlack
>> and: [igc marking
>> and: [newValue isWhite]]) ifTrue:
>> [newValue beGrey]]] "add newValue to
>> IGC's markStack for subsequent scanning"
>>
>> The incremental collector does not detect already marked objects all of
>> whose references have been overwritten by other stores (e.g. in the above if
>> newValue overwrites the sole remaining reference to a marked object). So
>> the incremental collector only guarantees to collect all garbage created in
>> cycle N at the end of cycle N + 1. The cost is hence slightly worse memory
>> density but the benefit, provided the IGC works hard enough, is the
>> elimination of long pauses due to full garbage collections, which become
>> actions of last resort or programmer desire.
>>
>> Lazy become.
>>
>> As described earlier the basic idea behind lazy become is to use corpses
>> (forwarding objects) that are followed lazily during GC and inline cache
>> miss. However, a lazy scheme cannot be used on objects with named inst vars
>> without adding checking to all inst var accesses, which we judge too
>> expensive. Instead, when becomming objects with named inst vars, we scan
>> all activations in the stack zone, eagerly becomming these references, and
>> we check for corpses when faulting in a context into the stack zone.
>> Essentially, the invariant is that there are no references to corpses from
>> the receiver slots of stack activations. A detail is whether we allow or
>> forbid pinning of closure indirection vectors, or scan the entire stack of
>> each activation. Using a special class index pun for indirection vectors is
>> a cheap way of preventing their becomming/pinning etc. Although "don't do
>> that" (don't attempt to pin/become indirection vectors) is also an
>> acceptable response.
>>
>> 60-bit immediate Floats
>> Representation for immediate doubles, only used in the 64-bit
>> implementation. Immediate doubles have the same 52 bit mantissa as IEEE
>> double-precision floating-point, but only have 8 bits of exponent. So they
>> occupy just less than the middle 1/8th of the double range. They overlap
>> the normal single-precision floats which also have 8 bit exponents, but
>> exclude the single-precision denormals (exponent-127) and the
>> single-precsion NaNs (exponent +127). +/- zero is just a pair of values
>> with both exponent and mantissa 0.
>> So the non-zero immediate doubles range from
>> +/- 0x3800,0000,0000,0001 / 5.8774717541114d-39
>> to +/- 0x47ff,ffff,ffff,ffff / 6.8056473384188d+38
>> The encoded tagged form has the sign bit moved to the least significant bit,
>> which allows for faster encode/decode because offsetting the exponent can't
>> overflow into the sign bit and because testing for +/- 0 is an unsigned
>> compare for <= 0xf:
>> msb
>> lsb
>> [8 exponent subset bits][52 mantissa bits ][1 sign bit][3 tag bits]
>> So assuming the tag is 5, the tagged non-zero bit patterns are
>> 0x0000,0000,0000,001[d/5]
>> to 0xffff,ffff,ffff,fff[d/5]
>> and +/- 0d is 0x0000,0000,0000,000[d/5]
>> Encode/decode of non-zero values in machine code looks like:
>> msb
>> lsb
>> Decode: [8expsubset][52mantissa][1s][3tags]
>> shift away tags: [ 000 ][8expsubset][52mantissa][1s]
>> add exponent offset: [ 11 exponent ][52mantissa][1s]
>> rot sign: [1s][ 11 exponent ][52mantissa]
>>
>> Encode: [1s][ 11 exponent
>> ][52mantissa]
>> rot sign: [ 11 exponent ][52mantissa][1s]
>> sub exponent offset: [ 000 ][8expsubset][52 mantissa][1s]
>> shift: [8expsubset][52 mantissa][1s][
>> 000 ]
>> or/add tags: [8expsubset][52mantissa][1s][3tags]
>> but is slower in C because
>> a) there is no rotate, and
>> b) raw conversion between double and quadword must (at least in the source)
>> move bits through memory ( quadword = *(q64 *)&doubleVariable).
>>
>> Issues:
>> How do we avoid the Size4Bit for 64-bits? The format word encodes the
>> number of odd bytes, but currently has only 4 bits and hence only supports
>> odd bytes of 0 - 3. We need odd bytes of 0 - 7. But I don't like the
>> separate Size4Bit. Best to change the VI code and have a 5 bit format? We
>> loose one bit but save two bits (isEphemeron and isWeak (or three, if
>> isPointers)) for a net gain of one (or two)
>>
>> Further, keep Squeak's format idea or go for separate bits? For 64-bits we
>> need a 5 bit format field. This contrasts with isPointers, isWeak,
>> isEphemeron, 3 odd size bits (or byte size).. format field is quite
>> economical.
>>
>> Are class indices in inline caches strong references to classes or weak
>> references?
>> If strong then they must be scanned during GC and the methodZone must be
>> flushed on fullGC to reclaim all classes (this looks to be a bug in the V3
>> Cogit).
>> If weak then when the class table loses references, PICs containing freed
>> classes must be freed and then sends to freed PICs or containing freed
>> clases must be unlinked.
>> The second approach is faster; the common case is scanning the class table,
>> the uncommon case is freeing classes. The second approach is better;
>> in-line caches do not prevent reclamation of classes.
>>
>>
>> Stef
>>
>>
>>
>> --
>> best,
>> Eliot
>>
>