On Aug 22, 2016, at 9:30 PM, Mandy Chung <mandy.ch...@oracle.com> wrote: > > We need to follow up this issue to understand what the interpreter and > compiler do for this unused slot and whether it’s always zero out.
These slot pairs are a curse, in the same league as endian-ness. Suppose a 64-bit long x lives in L[0] and L[1]. Now suppose that the interpreter (as well it might) has adjacent 32-bit words for those locals. There are four reasonable conventions for apportioning the bits of x into L[0:1]. Call HI(x) the arithmetically high part of x, and LO(x) the other part. Also, call FST(x) the lower-addressed 32-bit component of x, when stored in memory, and SND(x) the other part. Depending on your machine's endian-ness, HI=FST or HI=SND (little-endian, x86). For portable code there are obviously four ways to pack L[0:1]. I've personally seen them all, sometimes as hard-to-find VM bugs. We're just getting started, though. Now let the interpreter generously allocate 64 bits to each local. The above four cases are still possible, but now we have 4 32-bit storage units to play with. That makes (if you do the math) 4x3=12 more theoretically possible ways to store the bits of x into the 128 bits of L[0:1]. I've not seen all 12, but there are several variations that HotSpot has used over time. Confused yet? There's more: All current HotSpot implementations grow the stack downward, which means that the address of L[0] is *higher* than L[1]. This means that the pair of storage units for L[0:1] can be viewed as a memory buffer, but the bits of L[1] come at a lower address. (Once we had a tagged-stack interpreter in which there were extra tag words between the words of L[0] and L[1], for extra fun. We got tired of that.) There's one more annoyance: The memory block located at L[0:1] must be at least 64 bits wide, but it need not be 64-bit aligned, if the size of a local slot is 32 bits. So on machines that cannot perform unaligned 64-bit access, the interpreter needs to load and store 64-bit values as 32-bit halves. But we can put that aside for now; that's a separable cost borne by 32-bit RISCs. How do we simplify this? For one thing, view all reference to HI and LO with extreme suspicion. That goes for misleadingly simple terms like "the low half of x". On Intel everybody knows that's also FST (the first memory word of x), and nods in agreement, and then when you port to SPARC (that was my job) the nods turn into glassy-eyed stares. Next, don't trust L[0] and L[1] to work like array elements. Although the bytecode interpreter refers directly to L[0] and indirectly to L[1], when storing 'x', realize that you don't know exactly how those guys are laid out in memory. The interpreter will make some local decision to avoid the obvious-in-retrospect bug of storing 64-bits to L[0] on a 32-bit machine. The decision might be to form the address of L[1] and treat *that* as the base address of a memory block. The more subtle and principled thing to do would be to form the address of the *end* of L[0] and treat that as the *end* address of a memory block. The two approaches are equivalent on 32-bit machine, but on a 64-bit machine one puts the payload only in L[1] and one only in L[0]. Meanwhile, the JIT, with its free-wheeling approach to storage allocation, will probably try its best to ignore and forget stupid L[1], allocating a right-sized register or stack slot for L[0]. Thus interpreter and JIT can have independent internal conventions for how they assign storage units to L[0:1] and how they use those units to store a 64-bit value. Those independent schemes have to be reconciled along mode change paths: C2I and I2C adapters, deoptimization, and on-stack replacement (= reoptimization). The vframe_hp code does this. A strong global convention would be best, such as always using L[0] and always storing all of x in L[0] if it fits, else SND(x) in L[0] and FST(x) in L[1]. I'm not sure (and I doubt) that we are actually that clean. Any reasonable high-level API for dealing with this stuff will do like the JIT does, and pretend that, whatever the size of L[0] is physically, it contains the whole value assigned to it, without any need to inspect L[1]. That's the best policy for virtualizing stack frames, because it aligns with the plain meaning of bytecodes like "lload0", which don't mention L[1]. The role of L[1] is to provide "slop space" for internal storage in a tiny interpreter; it has no external role. The convention used in HotSpot and the JVM verifier is to assign a special type to L[1], "Top" which means "do not look at me; I contain no bits". A virtualized API which produces a view on such an L[1] needs to return some default value (if pressed), and to indicate that the slot has no payload. HTH — John P.S. If all goes well with Valhalla, we will probably get rid of slot pairs altogether in a future version of the JVM bytecodes. They spoil generics over longs and doubles. The 32-bit implementations of JVM interpreters will have to do extra work, such as have 64-bit slot sizes for methods that work with longs or doubles, but it's worth it.