All, This is a good discussion that has surfaced many topics related to writing inlinable vm helpers in java/vmmagic. I leave out all the email replies to reduce clutter.
Ultimately we will need to solve all the problems that have surfaced including making changes to GC/JIT/VM interfaces. I suggest that for right now we focus only on demonstrating the benefit of inlining one specific existing API, gc_alloc_fast(). The debate on interface mods can happen later. How about the following steps? 1) Confirm that Mikhail's translation into java/vmmagic is accurate. 2) Get Jitrino.OPT to inline and optimize this code and generate correct binary image 3) Show the performance delta for some workloads More comments inlined below -- On 10/11/06, Mikhail Fursov <[EMAIL PROTECTED]> wrote:
GC, VM gurus! I need your help in implementation of the first our helper written with magic. I've started with GCv41 allocation helper for objects. Please review the way I'm going to implement it and correct me if I have misunderstood something or confirm if everything is OK. The native fast path: Managed_Object_Handle gc_alloc_fast(unsigned in_size, Allocation_Handle ah, void *thread_pointer) { C1. assert((in_size % GC_OBJECT_ALIGNMENT) == 0); C2. assert (ah); C3. unsigned char *next; C4. GC_Thread_Info *info = (GC_Thread_Info *) thread_pointer; C5. Partial_Reveal_VTable *vtable = ah_to_vtable(ah); C6. GC_VTable_Info *gcvt = vtable->get_gcvt(); C7. unsigned char *cleaned = info->tls_current_cleaned; C8. unsigned char *res = info->tls_current_free; C9. if (res + in_size <= cleaned) { C10. if (gcvt->is_finalizible()) return 0; C11. info->tls_current_free = res + in_size; C12. *(VT32*)res = ah; C13. assert(((POINTER_SIZE_INT)res & (GC_OBJECT_ALIGNMENT - 1)) == 0); C14. return res; C15. } C16. if (gcvt->is_finalizible()) return 0; C17. unsigned char *ceiling = info->tls_current_ceiling; C18. if (res + in_size <= ceiling) { C19. info->tls_current_free = next = info->tls_current_free + in_size; // cleaning required C20. unsigned char *cleaned_new = next + THREAD_LOCAL_CLEANED_AREA_SIZE; C21. if (cleaned_new > ceiling) cleaned_new = ceiling; C22. info->tls_current_cleaned = cleaned_new; C23. memset(cleaned, 0, cleaned_new - cleaned); C24. *(VT32*)res = ah; C25. assert(((POINTER_SIZE_INT)res & (GC_OBJECT_ALIGNMENT - 1)) == 0); C26. return res; C27. } C28. return 0; } The helper's code: public static Object gc_alloc(int objSize, int allocationHandle) { J1. Address tlsAddr = TLS.getGCThreadLocal(); J2. Address tlsCurrentFreeFieldAddr = tlsAddr.plus (TLS_CURRENT_FREE_OFFSET); J3. Address tlsCurrentCleanedFieldAddr = tlsAddr.plus (TLS_CURRENT_CLEANED_OFFSET); J4. Address tlsCurrentFreeAddr = tlsCurrentFreeFieldAddr.loadAddress(); J5. Address tlsCurrentCleanedAddr = tlsCurrentCleanedFieldAddr.loadAddress(); J6. Address tlsNewFreeAddr = tlsCurrentFreeAddr.plus(objSize); // the fast path without cleaning J7. if (tlsNewFreeAddr.LE(tlsCurrentCleanedAddr)) { J8. tlsCurrentFreeFieldAddr.store(tlsNewFreeAddr); J9. tlsCurrentFreeAddr.store(allocationHandle); J10. return tlsCurrentFreeAddr; J11. } J12. Address tlsCurrentCeilingFieldAddr = tlsAddr.plus (TLS_CURRENT_CEILING_OFFSET); J13. Address tlsCurrentCeilingAddr = tlsCurrentCeilingFieldAddr.loadAddress(); // the fast path with cleaning J14. if (tlsNewCurrentFreeAddr.LE(tlsCurrentCeilingAddr)) { J15. Address tlsNewCleanedAddr = tlsCurrentCeilingAddr; J16. if (tlsCurrentCeilingAddr.diff(tlsNewFreeAddr) > THREAD_LOCAL_CLEANED_AREA_SIZE) { J17. Address tlsCleanedNew = tlsNewFreeAddr.plus (THREAD_LOCAL_CLEANED_AREA_SIZE); J18. } J19. int bytesToClean = tlsNewCleanedAddr.diff(tlsNewFreeAddr); J20. org.apache.harmony.vmhelper.native.Utils.memset(tlsNewFreeAddr, bytesToClean, 0); J21. tlsCurrentCleanedFieldAddr.store(tlsNewCleanedAddr); J22. tlsCurrentFreeFieldAddr.store(tlsNewFreeAddr); J23. tlsCurrentFreeAddr.store(allocationHandle); J24. return tlsCurrentFreeAddr; } //the slow path //this call will be replaced by JIT with direct native call as VM magic org.apache.harmony.vmhelper.native.DRLVMHelper.gc_alloc(objSize, allocationHandle); } The problems I see: 1) The problem: GC helper must know GC_Thread_Info struct offsets.
If I understand correctly, you are referring to TLS_CURRENT_FREE_OFFSET and TLS_CURRENT_CEILING_OFFSET. Can we leave this as an ugly hack for right now? That is, hardcode the actual offsets. Something like: "static int TLS_CURRENT_FREE_OFFSET 0x18;" 2) The problem: Where to keep GC magic code? This code is GC specific and
must be available for bootstrap classloader. JIT can know all the details which magic code to inline (the helper type, the helper signature) from its properties (see opt.emconf file for example)
Its prototype code for now. Its not critical that we identify its final location at this point. In any case, it definitely belongs to the GC developers. 3) The problem: Is the signature for gc_alloc method : gc_alloc(int objSize,
int allocationHandle) is universal for all GCs?
Well, gc_alloc(...) is what the GC/VM interface currently supports. After working with MMTk, I now know this API is *not* universal. I think it's not. But we can extend JIT with different signatures support if needed. This is correct. We need to extend Jitrino.JET with the MMTk allocation signature. Then we need to discuss the impact on GC/VM/JIT interfaces. I will restart this discussion soon. 4) The new magic method is proposed, line J21:
org.apache.harmony.vmhelper.native.Utils.memset(tlsNewFreeAddr, bytesToClean, 0);
I agree with the previous comments that #4 is not needed. 5) The magic code in does not contain 'finalizable' check.
JIT can do this check during the compilation and do not generate the fast path. This is another option to pass to JIT from GC.
#5 is really independent of writing helpers in java/vmmagic. How about addressing #5 at a later time? I've enumerated the lines in code if you want to comment it.
Please feel free to review the code and to discuss any other problems I missed. -- Mikhail Fursov
-- Weldon Washburn Intel Middleware Products Division