Hi guys Here is an important topic I would like to see discussed so that we see how we can improve and join forces. May a mail discussion then a wiki for the summary would be good.
stef Begin forwarded message: > From: Eliot Miranda <[email protected]> > Subject: Re: Plan/discussion/communication around new object format > Date: May 27, 2012 10:49:54 PM GMT+02:00 > To: Stéphane Ducasse <[email protected]> > > > > On Sat, May 26, 2012 at 1:46 AM, Stéphane Ducasse <[email protected]> > wrote: > Hi eliot > > do you have a description of the new object format you want to introduce? > > The design is in the class comment of CogMemoryManager in the Cog VMMaker > packages. > > Then what is your schedule? > > This is difficult. I have made a small start and should be able to spend time > on it starting soon. I want to have it finished by early next year. But it > depends on work schedules etc. > > > I would like to see if we can allocate igor/esteban time before we run out of > money > to help on that important topic. > Now the solution is unclear and I did not see any document where we can > evaluate > and plan how we can help. So do you want help on that topic? Then how can > people > contribute if any? > > The first thing to do is to read the design document, to see if the Pharo > community thinks it is the right direction, and to review it, spot > deficiencies etc. So please get those interested to read the class comment > of CogMemoryManager in the latest VMMaker.oscog. > > Here's the current version of it: > > CogMemoryManager is currently a place-holder for the design of the new Cog > VM's object representation and garbage collector. The goals for the GC are > > - efficient object representation a la Eliot Miranda's VisualWorks 64-bit > object representation that uses a 64-bit header, eliminating direct class > references so that all objects refer to their classes indirectly. Instead > the header contains a constant class index, in a field smaller than a full > pointer, These class indices are used in inline and first-level method > caches, hence they do not have to be updated on GC (although they do have to > be traced to be able to GC classes). Classes are held in a sparse weak > table. The class table needs only to be indexed by an instance's class index > in class hierarchy search, in the class primitive, and in tracing live > objects in the heap. The additional header space is allocated to a much > expanded identity hash field, reducing hash efficiency problems in identity > collections due to the extremely small (11 bit) hash field in the old Squeak > GC. The identity hash field is also a key element of the class index scheme. > A class's identity hash is its index into the class table, so to create an > instance of a class one merely copies its identity hash into the class index > field of the new instance. This implies that when classes gain their > identity hash they are entered into the class table and their identity hash > is that of a previously unused index in the table. It also implies that > there is a maximum number of classes in the table. At least for a few years > 64k classes should be enough. A class is entered into the class table in the > following operations: > behaviorHash > adoptInstance > instantiate > become (i.e. if an old class becomes a new class) > if target class field's = to original's id hash > and replacement's id hash is zero > enter replacement in class table > behaviorHash is a special version of identityHash that must be implemented in > the image by any object that can function as a class (i.e. Behavior). > > - more immediate classes. An immediate Character class would speed up String > accessing, especially for WideString, since no instatiation needs to be done > on at:put: and no dereference need be done on at:. In a 32-bit system tag > checking is complex since it is thought important to retain 31-bit > SmallIntegers. Hence, as in current Squeak, the least significant bit set > implies a SmallInteger, but Characters would likely have a tag pattern of > xxx10. Hence masking with 11 results in two values for SmallInteger, xxx01 > and xxx11. 30-bit characters are more than adequate for Unicode. In a > 64-bit system we can use the full three bits and usefully implement an > immediate Float. As in VisualWorks a functional representation takes three > bits away from the exponent. Rotating to put the sign bit in the least > significant non-tag bit makes expanding and contracting the 8-bit exponent to > the 11-bit IEEE double exponent easy ad makes comparing negative and positive > zero easier (an immediate Float is zero if its unsigned 64-bits are < 16). > So the representation looks like > | 8 bit exponent | 52 bit mantissa | sign bit | 3 tag bits | > For details see "60-bit immediate Floats" below. > > > - efficient scavenging. The current Squeak GC uses a slow pointer-reversal > collector that writes every field in live objects three times in each > collection, twice in the pointer-reversing heap traversal to mark live > objects and once to update the pointer to its new location. A scavenger > writes every field of live data twice in each collection, once as it does a > block copy of the object when copying to to space, once as it traverses the > live pointers in the to space objects. Of course the block copy is a > relatively cheap write. > > - lazy become. The JIT's use of inline cacheing provides a cheap way of > avoiding scanning the heap as part of a become (which is the simple approach > to implementing become in a system with direct pointers). A becomeForward: > on a (set of) non-zero-sized object(s) turns the object into a "corpse" or > "forwarding object" whose first (non-header) word/slot is replaced by a > pointer to the target of the becomeForward:. The corpse's class index is set > to one that identifies corpses and, because it is a hidden class index, will > always fail an inline cache test. The inline cache failure code is then > responsible for following the forwarding pointer chain (these are Iliffe > vectors :) ) and resolving to the actual target. We have yet to determine > exactly how this is done (e.g. change the receiver register and/or stack > contents and retry the send, perhaps scanning the current activation). See > below on how we deal with becomes on objects with named inst vars. Note that > we probably don't have to worry about zero-sized objects. These are unlikely > to be passed through the FFI (there is nothing to pass :) ) and so will > rarely be becommed. If they do, they can become slowly. Alternatively we > can insist that objects are at least 16 bytes in size (see a8-byte alignment > below) so that there will always be space for a forwarding pointer. Since > none of the immediate classes can have non-immediate instances and since we > allocate the immediate classes indices corresponding to their tag pattern > (SmallInteger = 1, Character = 3, SmallFloat = 4?) we can use all the class > indices from 0 to 7 for special uses, 0 = forward, and e.g. 1 = header-sized > filler. > > - pinning. To support a robust and easy-to-use FFI the memory manager must > support temporary pinning where individual objects can be prevented from > being moved by the GC for as long as required, either by being one of an > in-progress FFI call's arguments, or by having pinning asserted by a > primitive (allowing objects to be passed to external code that retains a > reference to the object after returning). Pinning probably implies a > per-object "is-pinned" bit in the object header. Pinning will be done via > lazy become; i..e an object in new space will be becommed into a pinned > object in old space. We will only support pinning in old space. > > - efficient old space collection. An incremental collector (a la Dijkstra's > three colour algorithm) collects old space, e.g. via an amount of tracing > being hung off scavenges and/or old space allocations at an adaptive rate > that keeps full garbage collections to a minimum. (see free space/free list > below) > > - 8-byte alignment. It is advantageous for the FFI, for floating-point > access, for object movement and for 32/64-bit compatibility to keep object > sizes in units of 8 bytes. For the FFI, 8-byte alignment means passing > objects to code that expects that requirement (such as modern x86 numeric > processing instructions). This implies that > - the starts of all spaces are aligned on 8-byte boundaries > - object allocation rounds up the requested size to a multiple of 8 > bytes > - the overflow size field is also 8 bytes > We shall probably keep the minimum object size at 16 bytes so that there is > always room for a forwarding pointer. But this implies that we will need to > implement an 8-byte filler to fill holes between objects > 16 bytes whose > length mod 16 bytes is 8 bytes and following pinned objects. We can do this > using a special class index, e.g. 1, so that the method that answers the size > of an object looks like, e.g. > chunkSizeOf: oop > <var: #oop type: #'object *'> > ^object classIndex = 1 > ifTrue: [BaseHeaderSize] > ifFalse: [BaseHeaderSize > + (object slotSize = OverflowSlotSize > ifTrue: [OverflowSizeBytes] > ifFalse: [0]) > + (object slotSize * BytesPerSlot)] > > chunkStartOf: oop > <var: #oop type: #'object *'> > ^(self cCoerceSimple: oop to: #'char *') > - ((object classIndex = 1 > or: [object slotSize ~= OverflowSlotSize]) > ifTrue: [0] > ifFalse: [OverflowSizeBytes]) > > For the moment we do not tackle the issue of heap growth and shrinkage with > the ability to allocate and deallocate heap segments via memory-mapping. > This technique allows space to be released back to the OS by unmapping empty > segments. We may revisit this but it is not a key requirement for the first > implementation. > > The basic approach is to use a fixed size new space and a growable old space. > The new space is a classic three-space nursery a la Ungar's Generation > Scavenging, a large eden for new objects and two smaller survivor spaces that > exchange roles on each collection, one being the to space to which surviving > objects are copied, the other being the from space of the survivors of the > previous collection, i.e. the previous to space. > > To provide apparent pinning in new space we rely on lazy become. Since most > pinned objects will be byte data and these do not require stack zone > activation scanning, the overhead is simply an old space allocation and > corpsing. > > To provide pinning in old space, large objects are implicitly pinned (because > it is expensive to move large objects and, because they are both large and > relatively rare, they contribute little to overall fragmentation - as in > aggregates, small objects can be used to fill-in the spaces between karge > objects). Hence, objects above a particular size are automatically allocated > in old space, rather than new space. Small objects are pinned as per objects > in new space, by asserting the pin bit, which will be set automaticaly when > allocating a large object. As a last resort, or by programmer control (the > fullGC primitive) old space is collected via mark-sweep (mark-compact) and so > the mark phase must build the list of pinned objects around which the > sweep/compact phase must carefully step. > > Free space in old space is organized by a free list/free tree as in Eliot's > VisualWorks 5i old space allocator. There are 64 free lists, indices 1 > through 63 holding blocks of space of that size, index 0 holding a > semi-balanced ordered tree of free blocks, each node being the head of the > list of free blocks of that size. At the start of the mark phase the free > list is thrown away and the sweep phase coallesces free space and steps over > pinned objects as it proceeds. We can reuse the forwarding pointer > compaction scheme used in the old collector. Incremental collections merely > move unmarked objects to the free lists (as well as nilling weak pointers in > weak arrays and scheduling them for finalization). The occupancy of the free > lists is represented by a bitmap in a 64-bit integer so that an allocation of > size 63 or less can know whether there exists a free chunk of that size, but > more importantly can know whether a free chunk larger than it exists in the > fixed size free lists without having to search all larger free list heads. > > The incremental collector (a la Dijkstra's three colour algorithm) collects > old space via an amount of tracing being hung off scavenges and/or old space > allocations at an adaptive rate that keeps full garbage collections to a > minimum. [N.B. Not sure how to do this yet. The incremental collector needs > to complete a pass often enough to reclaim objects, but infrequent enough not > to waste time. So some form of feedback should work. In VisualWorks tracing > is broken into quanta or work where image-level code determines the size of a > quantum based on how fast the machine is, and how big the heap is. This code > could easily live in the VM, controllable through vmParameterAt:put:. An > alternative would be to use the heartbeat to bound quanta by time. But in > any case some amount of incremental collection would be done on old space > allocation and scavenging, the ammount being chosen to keep pause times > acceptably short, and at a rate to reclaim old space before a full GC is > required, i.e. at a rate proportional to the growth in old space]. The > incemental collector is a state machine, being either marking, nilling weak > pointers, or freeing. If nilling weak pointers is not done atomically then > there must be a read barrier in weak array at: so that reading from an old > space weak array that is holding stale un-nilled references to unmarked > objects. Tricks such as including the weak bit in bounds calculations can > make this cheap for non-weak arrays. Alternatively nilling weak pointers can > be made an atomic part of incremental collection, which can be made cheaper > by maintaining the set of weak arrays (e.g. on a list). > > The incremental collector implies a more complex write barrier. Objects are > of three colours, black, having been scanned, grey, being scanned, and white, > unreached. A mark stack holds the grey objects. If the incremental > collector is marking and an unmarked white object is stored into a black > object then the stored object must become grey, being added to the mark > stack. So the wrte barrier is essentially > target isYoung ifFalse: > [newValue isYoung > ifTrue: [target isInRememberedSet ifFalse: > [target addToRememberedSet]] "target > now refers to a young object; it is a root for scavenges" > ifFalse: > [(target isBlack > and: [igc marking > and: [newValue isWhite]]) ifTrue: > [newValue beGrey]]] "add newValue to > IGC's markStack for subsequent scanning" > > The incremental collector does not detect already marked objects all of whose > references have been overwritten by other stores (e.g. in the above if > newValue overwrites the sole remaining reference to a marked object). So the > incremental collector only guarantees to collect all garbage created in cycle > N at the end of cycle N + 1. The cost is hence slightly worse memory density > but the benefit, provided the IGC works hard enough, is the elimination of > long pauses due to full garbage collections, which become actions of last > resort or programmer desire. > > Lazy become. > > As described earlier the basic idea behind lazy become is to use corpses > (forwarding objects) that are followed lazily during GC and inline cache > miss. However, a lazy scheme cannot be used on objects with named inst vars > without adding checking to all inst var accesses, which we judge too > expensive. Instead, when becomming objects with named inst vars, we scan all > activations in the stack zone, eagerly becomming these references, and we > check for corpses when faulting in a context into the stack zone. > Essentially, the invariant is that there are no references to corpses from > the receiver slots of stack activations. A detail is whether we allow or > forbid pinning of closure indirection vectors, or scan the entire stack of > each activation. Using a special class index pun for indirection vectors is > a cheap way of preventing their becomming/pinning etc. Although "don't do > that" (don't attempt to pin/become indirection vectors) is also an acceptable > response. > > 60-bit immediate Floats > Representation for immediate doubles, only used in the 64-bit implementation. > Immediate doubles have the same 52 bit mantissa as IEEE double-precision > floating-point, but only have 8 bits of exponent. So they occupy just less > than the middle 1/8th of the double range. They overlap the normal > single-precision floats which also have 8 bit exponents, but exclude the > single-precision denormals (exponent-127) and the single-precsion NaNs > (exponent +127). +/- zero is just a pair of values with both exponent and > mantissa 0. > So the non-zero immediate doubles range from > +/- 0x3800,0000,0000,0001 / 5.8774717541114d-39 > to +/- 0x47ff,ffff,ffff,ffff / 6.8056473384188d+38 > The encoded tagged form has the sign bit moved to the least significant bit, > which allows for faster encode/decode because offsetting the exponent can't > overflow into the sign bit and because testing for +/- 0 is an unsigned > compare for <= 0xf: > msb > lsb > [8 exponent subset bits][52 mantissa bits ][1 sign bit][3 tag bits] > So assuming the tag is 5, the tagged non-zero bit patterns are > 0x0000,0000,0000,001[d/5] > to 0xffff,ffff,ffff,fff[d/5] > and +/- 0d is 0x0000,0000,0000,000[d/5] > Encode/decode of non-zero values in machine code looks like: > msb > lsb > Decode: [8expsubset][52mantissa][1s][3tags] > shift away tags: [ 000 ][8expsubset][52mantissa][1s] > add exponent offset: [ 11 exponent ][52mantissa][1s] > rot sign: [1s][ 11 exponent ][52mantissa] > > Encode: [1s][ 11 exponent > ][52mantissa] > rot sign: [ 11 exponent ][52mantissa][1s] > sub exponent offset: [ 000 ][8expsubset][52 mantissa][1s] > shift: [8expsubset][52 mantissa][1s][ > 000 ] > or/add tags: [8expsubset][52mantissa][1s][3tags] > but is slower in C because > a) there is no rotate, and > b) raw conversion between double and quadword must (at least in the source) > move bits through memory ( quadword = *(q64 *)&doubleVariable). > > Issues: > How do we avoid the Size4Bit for 64-bits? The format word encodes the number > of odd bytes, but currently has only 4 bits and hence only supports odd bytes > of 0 - 3. We need odd bytes of 0 - 7. But I don't like the separate > Size4Bit. Best to change the VI code and have a 5 bit format? We loose one > bit but save two bits (isEphemeron and isWeak (or three, if isPointers)) for > a net gain of one (or two) > > Further, keep Squeak's format idea or go for separate bits? For 64-bits we > need a 5 bit format field. This contrasts with isPointers, isWeak, > isEphemeron, 3 odd size bits (or byte size).. format field is quite > economical. > > Are class indices in inline caches strong references to classes or weak > references? > If strong then they must be scanned during GC and the methodZone must be > flushed on fullGC to reclaim all classes (this looks to be a bug in the V3 > Cogit). > If weak then when the class table loses references, PICs containing freed > classes must be freed and then sends to freed PICs or containing freed clases > must be unlinked. > The second approach is faster; the common case is scanning the class table, > the uncommon case is freeing classes. The second approach is better; in-line > caches do not prevent reclamation of classes. > > > Stef > > > > -- > best, > Eliot >
