Hi Eliot: From my experience with the RoarVM, it seems to be a rather simple exercise to enable the VM to support a custom 'pre-header' for objects. That is, a constant offset in the memory that comes from the allocator, and is normally ignored by the GC.
That allows me to do all kind of stuff. Of course at a cost of a word per object, and at the cost of recompiling the VM. But, that should be a reasonable price to pay for someone doing research on these kind of things. Sometimes a few bits are just not enough, and such a pre-header gives much much more flexibility. For the people interested in that, I could dig out the details (I think, I did that already once on this list). Best regards Stefan On 30 May 2012, at 22:22, Eliot Miranda wrote: > > > On Wed, May 30, 2012 at 12:59 PM, Stéphane Ducasse > <[email protected]> wrote: > I would like to be sure that we can have > - bit for immutable objects > - bits for experimenting. > > There will be quite a few. And one will be able to steal bits from the class > field if one needs fewer classes. I'm not absolutely sure of the layout yet. > But for example > > 8: slot size (255 => extra header word with large size) > 3: odd bytes/fixed fields (odd bytes for non-pointer, fixed fields for > pointer, 7 => # fixed fields is in the class) > 4 bits: format (pointers, indexable, bytes/shorts/longs/doubles indexable, > compiled method, ephemerons, weak, etc) > 1: immutability > 3: GC 2 mark bits. 1 forwarded bit > 20: identity hash > 20: class index > > still leaves 5 bits unused. And stealing 4 bits each from class index still > leaves 64k classes. So this format is simple and provides lots of unused > bits. The format field is a great idea as it combines a number of orthogonal > properties in very few bits. I don't want to include odd bytes in format > because I think a separate field that holds odd bytes and fixed fields is > better use of space. But we can gather statistics before deciding. > > > Stef > > On May 30, 2012, at 8:48 AM, Stéphane Ducasse wrote: > > > Hi guys > > > > Here is an important topic I would like to see discussed so that we see how > > we can improve and join forces. > > May a mail discussion then a wiki for the summary would be good. > > > > > > stef > > > > > > > > Begin forwarded message: > > > >> From: Eliot Miranda <[email protected]> > >> Subject: Re: Plan/discussion/communication around new object format > >> Date: May 27, 2012 10:49:54 PM GMT+02:00 > >> To: Stéphane Ducasse <[email protected]> > >> > >> > >> > >> On Sat, May 26, 2012 at 1:46 AM, Stéphane Ducasse > >> <[email protected]> wrote: > >> Hi eliot > >> > >> do you have a description of the new object format you want to introduce? > >> > >> The design is in the class comment of CogMemoryManager in the Cog VMMaker > >> packages. > >> > >> Then what is your schedule? > >> > >> This is difficult. I have made a small start and should be able to spend > >> time on it starting soon. I want to have it finished by early next year. > >> But it depends on work schedules etc. > >> > >> > >> I would like to see if we can allocate igor/esteban time before we run out > >> of money > >> to help on that important topic. > >> Now the solution is unclear and I did not see any document where we can > >> evaluate > >> and plan how we can help. So do you want help on that topic? Then how can > >> people > >> contribute if any? > >> > >> The first thing to do is to read the design document, to see if the Pharo > >> community thinks it is the right direction, and to review it, spot > >> deficiencies etc. So please get those interested to read the class > >> comment of CogMemoryManager in the latest VMMaker.oscog. > >> > >> Here's the current version of it: > >> > >> CogMemoryManager is currently a place-holder for the design of the new Cog > >> VM's object representation and garbage collector. The goals for the GC are > >> > >> - efficient object representation a la Eliot Miranda's VisualWorks 64-bit > >> object representation that uses a 64-bit header, eliminating direct class > >> references so that all objects refer to their classes indirectly. Instead > >> the header contains a constant class index, in a field smaller than a full > >> pointer, These class indices are used in inline and first-level method > >> caches, hence they do not have to be updated on GC (although they do have > >> to be traced to be able to GC classes). Classes are held in a sparse weak > >> table. The class table needs only to be indexed by an instance's class > >> index in class hierarchy search, in the class primitive, and in tracing > >> live objects in the heap. The additional header space is allocated to a > >> much expanded identity hash field, reducing hash efficiency problems in > >> identity collections due to the extremely small (11 bit) hash field in the > >> old Squeak GC. The identity hash field is also a key element of the class > >> index scheme. A class's identity hash is its index into the class table, > >> so to create an instance of a class one merely copies its identity hash > >> into the class index field of the new instance. This implies that when > >> classes gain their identity hash they are entered into the class table and > >> their identity hash is that of a previously unused index in the table. It > >> also implies that there is a maximum number of classes in the table. At > >> least for a few years 64k classes should be enough. A class is entered > >> into the class table in the following operations: > >> behaviorHash > >> adoptInstance > >> instantiate > >> become (i.e. if an old class becomes a new class) > >> if target class field's = to original's id hash > >> and replacement's id hash is zero > >> enter replacement in class table > >> behaviorHash is a special version of identityHash that must be implemented > >> in the image by any object that can function as a class (i.e. Behavior). > >> > >> - more immediate classes. An immediate Character class would speed up > >> String accessing, especially for WideString, since no instatiation needs > >> to be done on at:put: and no dereference need be done on at:. In a 32-bit > >> system tag checking is complex since it is thought important to retain > >> 31-bit SmallIntegers. Hence, as in current Squeak, the least significant > >> bit set implies a SmallInteger, but Characters would likely have a tag > >> pattern of xxx10. Hence masking with 11 results in two values for > >> SmallInteger, xxx01 and xxx11. 30-bit characters are more than adequate > >> for Unicode. In a 64-bit system we can use the full three bits and > >> usefully implement an immediate Float. As in VisualWorks a functional > >> representation takes three bits away from the exponent. Rotating to put > >> the sign bit in the least significant non-tag bit makes expanding and > >> contracting the 8-bit exponent to the 11-bit IEEE double exponent easy ad > >> makes comparing negative and positive zero easier (an immediate Float is > >> zero if its unsigned 64-bits are < 16). So the representation looks like > >> | 8 bit exponent | 52 bit mantissa | sign bit | 3 tag bits | > >> For details see "60-bit immediate Floats" below. > >> > >> > >> - efficient scavenging. The current Squeak GC uses a slow > >> pointer-reversal collector that writes every field in live objects three > >> times in each collection, twice in the pointer-reversing heap traversal to > >> mark live objects and once to update the pointer to its new location. A > >> scavenger writes every field of live data twice in each collection, once > >> as it does a block copy of the object when copying to to space, once as it > >> traverses the live pointers in the to space objects. Of course the block > >> copy is a relatively cheap write. > >> > >> - lazy become. The JIT's use of inline cacheing provides a cheap way of > >> avoiding scanning the heap as part of a become (which is the simple > >> approach to implementing become in a system with direct pointers). A > >> becomeForward: on a (set of) non-zero-sized object(s) turns the object > >> into a "corpse" or "forwarding object" whose first (non-header) word/slot > >> is replaced by a pointer to the target of the becomeForward:. The > >> corpse's class index is set to one that identifies corpses and, because it > >> is a hidden class index, will always fail an inline cache test. The > >> inline cache failure code is then responsible for following the forwarding > >> pointer chain (these are Iliffe vectors :) ) and resolving to the actual > >> target. We have yet to determine exactly how this is done (e.g. change > >> the receiver register and/or stack contents and retry the send, perhaps > >> scanning the current activation). See below on how we deal with becomes > >> on objects with named inst vars. Note that we probably don't have to > >> worry about zero-sized objects. These are unlikely to be passed through > >> the FFI (there is nothing to pass :) ) and so will rarely be becommed. If > >> they do, they can become slowly. Alternatively we can insist that objects > >> are at least 16 bytes in size (see a8-byte alignment below) so that there > >> will always be space for a forwarding pointer. Since none of the > >> immediate classes can have non-immediate instances and since we allocate > >> the immediate classes indices corresponding to their tag pattern > >> (SmallInteger = 1, Character = 3, SmallFloat = 4?) we can use all the > >> class indices from 0 to 7 for special uses, 0 = forward, and e.g. 1 = > >> header-sized filler. > >> > >> - pinning. To support a robust and easy-to-use FFI the memory manager > >> must support temporary pinning where individual objects can be prevented > >> from being moved by the GC for as long as required, either by being one of > >> an in-progress FFI call's arguments, or by having pinning asserted by a > >> primitive (allowing objects to be passed to external code that retains a > >> reference to the object after returning). Pinning probably implies a > >> per-object "is-pinned" bit in the object header. Pinning will be done via > >> lazy become; i..e an object in new space will be becommed into a pinned > >> object in old space. We will only support pinning in old space. > >> > >> - efficient old space collection. An incremental collector (a la > >> Dijkstra's three colour algorithm) collects old space, e.g. via an amount > >> of tracing being hung off scavenges and/or old space allocations at an > >> adaptive rate that keeps full garbage collections to a minimum. (see free > >> space/free list below) > >> > >> - 8-byte alignment. It is advantageous for the FFI, for floating-point > >> access, for object movement and for 32/64-bit compatibility to keep object > >> sizes in units of 8 bytes. For the FFI, 8-byte alignment means passing > >> objects to code that expects that requirement (such as modern x86 numeric > >> processing instructions). This implies that > >> - the starts of all spaces are aligned on 8-byte boundaries > >> - object allocation rounds up the requested size to a multiple of 8 > >> bytes > >> - the overflow size field is also 8 bytes > >> We shall probably keep the minimum object size at 16 bytes so that there > >> is always room for a forwarding pointer. But this implies that we will > >> need to implement an 8-byte filler to fill holes between objects > 16 > >> bytes whose length mod 16 bytes is 8 bytes and following pinned objects. > >> We can do this using a special class index, e.g. 1, so that the method > >> that answers the size of an object looks like, e.g. > >> chunkSizeOf: oop > >> <var: #oop type: #'object *'> > >> ^object classIndex = 1 > >> ifTrue: [BaseHeaderSize] > >> ifFalse: [BaseHeaderSize > >> + (object slotSize = OverflowSlotSize > >> ifTrue: [OverflowSizeBytes] > >> ifFalse: [0]) > >> + (object slotSize * BytesPerSlot)] > >> > >> chunkStartOf: oop > >> <var: #oop type: #'object *'> > >> ^(self cCoerceSimple: oop to: #'char *') > >> - ((object classIndex = 1 > >> or: [object slotSize ~= OverflowSlotSize]) > >> ifTrue: [0] > >> ifFalse: [OverflowSizeBytes]) > >> > >> For the moment we do not tackle the issue of heap growth and shrinkage > >> with the ability to allocate and deallocate heap segments via > >> memory-mapping. This technique allows space to be released back to the OS > >> by unmapping empty segments. We may revisit this but it is not a key > >> requirement for the first implementation. > >> > >> The basic approach is to use a fixed size new space and a growable old > >> space. The new space is a classic three-space nursery a la Ungar's > >> Generation Scavenging, a large eden for new objects and two smaller > >> survivor spaces that exchange roles on each collection, one being the to > >> space to which surviving objects are copied, the other being the from > >> space of the survivors of the previous collection, i.e. the previous to > >> space. > >> > >> To provide apparent pinning in new space we rely on lazy become. Since > >> most pinned objects will be byte data and these do not require stack zone > >> activation scanning, the overhead is simply an old space allocation and > >> corpsing. > >> > >> To provide pinning in old space, large objects are implicitly pinned > >> (because it is expensive to move large objects and, because they are both > >> large and relatively rare, they contribute little to overall fragmentation > >> - as in aggregates, small objects can be used to fill-in the spaces > >> between karge objects). Hence, objects above a particular size are > >> automatically allocated in old space, rather than new space. Small > >> objects are pinned as per objects in new space, by asserting the pin bit, > >> which will be set automaticaly when allocating a large object. As a last > >> resort, or by programmer control (the fullGC primitive) old space is > >> collected via mark-sweep (mark-compact) and so the mark phase must build > >> the list of pinned objects around which the sweep/compact phase must > >> carefully step. > >> > >> Free space in old space is organized by a free list/free tree as in > >> Eliot's VisualWorks 5i old space allocator. There are 64 free lists, > >> indices 1 through 63 holding blocks of space of that size, index 0 holding > >> a semi-balanced ordered tree of free blocks, each node being the head of > >> the list of free blocks of that size. At the start of the mark phase the > >> free list is thrown away and the sweep phase coallesces free space and > >> steps over pinned objects as it proceeds. We can reuse the forwarding > >> pointer compaction scheme used in the old collector. Incremental > >> collections merely move unmarked objects to the free lists (as well as > >> nilling weak pointers in weak arrays and scheduling them for > >> finalization). The occupancy of the free lists is represented by a bitmap > >> in a 64-bit integer so that an allocation of size 63 or less can know > >> whether there exists a free chunk of that size, but more importantly can > >> know whether a free chunk larger than it exists in the fixed size free > >> lists without having to search all larger free list heads. > >> > >> The incremental collector (a la Dijkstra's three colour algorithm) > >> collects old space via an amount of tracing being hung off scavenges > >> and/or old space allocations at an adaptive rate that keeps full garbage > >> collections to a minimum. [N.B. Not sure how to do this yet. The > >> incremental collector needs to complete a pass often enough to reclaim > >> objects, but infrequent enough not to waste time. So some form of > >> feedback should work. In VisualWorks tracing is broken into quanta or > >> work where image-level code determines the size of a quantum based on how > >> fast the machine is, and how big the heap is. This code could easily live > >> in the VM, controllable through vmParameterAt:put:. An alternative would > >> be to use the heartbeat to bound quanta by time. But in any case some > >> amount of incremental collection would be done on old space allocation and > >> scavenging, the ammount being chosen to keep pause times acceptably short, > >> and at a rate to reclaim old space before a full GC is required, i.e. at a > >> rate proportional to the growth in old space]. The incemental collector is > >> a state machine, being either marking, nilling weak pointers, or freeing. > >> If nilling weak pointers is not done atomically then there must be a read > >> barrier in weak array at: so that reading from an old space weak array > >> that is holding stale un-nilled references to unmarked objects. Tricks > >> such as including the weak bit in bounds calculations can make this cheap > >> for non-weak arrays. Alternatively nilling weak pointers can be made an > >> atomic part of incremental collection, which can be made cheaper by > >> maintaining the set of weak arrays (e.g. on a list). > >> > >> The incremental collector implies a more complex write barrier. Objects > >> are of three colours, black, having been scanned, grey, being scanned, and > >> white, unreached. A mark stack holds the grey objects. If the > >> incremental collector is marking and an unmarked white object is stored > >> into a black object then the stored object must become grey, being added > >> to the mark stack. So the wrte barrier is essentially > >> target isYoung ifFalse: > >> [newValue isYoung > >> ifTrue: [target isInRememberedSet ifFalse: > >> [target addToRememberedSet]] "target > >> now refers to a young object; it is a root for scavenges" > >> ifFalse: > >> [(target isBlack > >> and: [igc marking > >> and: [newValue isWhite]]) ifTrue: > >> [newValue beGrey]]] "add newValue to > >> IGC's markStack for subsequent scanning" > >> > >> The incremental collector does not detect already marked objects all of > >> whose references have been overwritten by other stores (e.g. in the above > >> if newValue overwrites the sole remaining reference to a marked object). > >> So the incremental collector only guarantees to collect all garbage > >> created in cycle N at the end of cycle N + 1. The cost is hence slightly > >> worse memory density but the benefit, provided the IGC works hard enough, > >> is the elimination of long pauses due to full garbage collections, which > >> become actions of last resort or programmer desire. > >> > >> Lazy become. > >> > >> As described earlier the basic idea behind lazy become is to use corpses > >> (forwarding objects) that are followed lazily during GC and inline cache > >> miss. However, a lazy scheme cannot be used on objects with named inst > >> vars without adding checking to all inst var accesses, which we judge too > >> expensive. Instead, when becomming objects with named inst vars, we scan > >> all activations in the stack zone, eagerly becomming these references, and > >> we check for corpses when faulting in a context into the stack zone. > >> Essentially, the invariant is that there are no references to corpses from > >> the receiver slots of stack activations. A detail is whether we allow or > >> forbid pinning of closure indirection vectors, or scan the entire stack of > >> each activation. Using a special class index pun for indirection vectors > >> is a cheap way of preventing their becomming/pinning etc. Although "don't > >> do that" (don't attempt to pin/become indirection vectors) is also an > >> acceptable response. > >> > >> 60-bit immediate Floats > >> Representation for immediate doubles, only used in the 64-bit > >> implementation. Immediate doubles have the same 52 bit mantissa as IEEE > >> double-precision floating-point, but only have 8 bits of exponent. So > >> they occupy just less than the middle 1/8th of the double range. They > >> overlap the normal single-precision floats which also have 8 bit > >> exponents, but exclude the single-precision denormals (exponent-127) and > >> the single-precsion NaNs (exponent +127). +/- zero is just a pair of > >> values with both exponent and mantissa 0. > >> So the non-zero immediate doubles range from > >> +/- 0x3800,0000,0000,0001 / 5.8774717541114d-39 > >> to +/- 0x47ff,ffff,ffff,ffff / 6.8056473384188d+38 > >> The encoded tagged form has the sign bit moved to the least significant > >> bit, which allows for faster encode/decode because offsetting the exponent > >> can't overflow into the sign bit and because testing for +/- 0 is an > >> unsigned compare for <= 0xf: > >> msb > >> lsb > >> [8 exponent subset bits][52 mantissa bits ][1 sign bit][3 tag bits] > >> So assuming the tag is 5, the tagged non-zero bit patterns are > >> 0x0000,0000,0000,001[d/5] > >> to 0xffff,ffff,ffff,fff[d/5] > >> and +/- 0d is 0x0000,0000,0000,000[d/5] > >> Encode/decode of non-zero values in machine code looks like: > >> msb > >> lsb > >> Decode: [8expsubset][52mantissa][1s][3tags] > >> shift away tags: [ 000 ][8expsubset][52mantissa][1s] > >> add exponent offset: [ 11 exponent ][52mantissa][1s] > >> rot sign: [1s][ 11 exponent > >> ][52mantissa] > >> > >> Encode: [1s][ 11 exponent > >> ][52mantissa] > >> rot sign: [ 11 exponent > >> ][52mantissa][1s] > >> sub exponent offset: [ 000 ][8expsubset][52 mantissa][1s] > >> shift: [8expsubset][52 > >> mantissa][1s][ 000 ] > >> or/add tags: [8expsubset][52mantissa][1s][3tags] > >> but is slower in C because > >> a) there is no rotate, and > >> b) raw conversion between double and quadword must (at least in the > >> source) move bits through memory ( quadword = *(q64 *)&doubleVariable). > >> > >> Issues: > >> How do we avoid the Size4Bit for 64-bits? The format word encodes the > >> number of odd bytes, but currently has only 4 bits and hence only supports > >> odd bytes of 0 - 3. We need odd bytes of 0 - 7. But I don't like the > >> separate Size4Bit. Best to change the VI code and have a 5 bit format? > >> We loose one bit but save two bits (isEphemeron and isWeak (or three, if > >> isPointers)) for a net gain of one (or two) > >> > >> Further, keep Squeak's format idea or go for separate bits? For 64-bits > >> we need a 5 bit format field. This contrasts with isPointers, isWeak, > >> isEphemeron, 3 odd size bits (or byte size).. format field is quite > >> economical. > >> > >> Are class indices in inline caches strong references to classes or weak > >> references? > >> If strong then they must be scanned during GC and the methodZone must be > >> flushed on fullGC to reclaim all classes (this looks to be a bug in the V3 > >> Cogit). > >> If weak then when the class table loses references, PICs containing freed > >> classes must be freed and then sends to freed PICs or containing freed > >> clases must be unlinked. > >> The second approach is faster; the common case is scanning the class > >> table, the uncommon case is freeing classes. The second approach is > >> better; in-line caches do not prevent reclamation of classes. > >> > >> > >> Stef > >> > >> > >> > >> -- > >> best, > >> Eliot > >> > > > > > > > > -- > best, > Eliot > -- Stefan Marr Software Languages Lab Vrije Universiteit Brussel Pleinlaan 2 / B-1050 Brussels / Belgium http://soft.vub.ac.be/~smarr Phone: +32 2 629 2974 Fax: +32 2 629 3525
