Re: FFI calls: is it possible to allocate a small memory block on a stack?
On 23/04/2010 19:03, Denys Rtveliashvili wrote: Tue Dec 1 16:03:21 GMT 2009 Simon Marlowmarlo...@gmail.com mailto:marlo...@gmail.com * Make allocatePinned use local storage, and other refactorings The version I have checked out is 6.12 and that's why I haven't seen this patch. Are there any plans for including this patch in the next GHC release? It'll be in the next major release (6.14.1). Right, but these are not common cases that need to be optimised. newCAF is only called once per CAF, thereafter it is accessed without locks. Can't recall from the top of my head, but I think I had a case when newCAF was used very actively in a simple piece of code. The code looked like this: sequence_ $ replicate N $ doSmth The Cmm code showed that it produced calls to newCAF and something related to black holes. Right, but newCAF should only be called once for any given CAF, thereafter the CAF will have been updated. And when I added return () after that line, the black holes new calls to newCAF have disappeared. It was on 6.12.1, I believe. I still have no idea why it happened and why these black holes where necessary, but I'll try to reproduce it one more time and show you an example if it has any interest for you. If you find a case where newCAF is being called repeatedly, that would be interesting yes. It may be that we could find benchmarks where access to the block allocator is the performance bottleneck, indeed in the parallel GC we sometimes see contention for it. If that turns out to be a problem then we may need to think about per-CPU free lists in the block allocator, but I think it would entail a fair bit of complexity and if we're not careful extra memory overhead, e.g. where one CPU has all the free blocks in its local free list and the others have none. So I'd like to avoid going down that route unless we absolutely have to. The block allocator is nice and simple right now. I suppose I should check out the HEAD then and give it a try, because earlier I had performance issues in the threaded runtime (~20% of overhead and far more noise) in an application which was doing some slicing, reshuffling and composing text via ByteStrings with a modest amount of passing data around via Chans. I'd be interested in seeing a program that has 20% overhead with -threaded. You should watch out for bound threads though: with -threaded the main thread is a bound thread, and communication with the main thread is much slower than between unbound threads. See http://www.haskell.org/ghc/docs/latest/html/libraries/base-4.2.0.1/Control-Concurrent.html#8 On a slightly different topic: please could you point me to a place where stg_upd_frame_info is generated? I can't find it in *.c, *.cmm or *.hs and guess it is something very special. rts/Updates.cmm: INFO_TABLE_RET( stg_upd_frame, UPDATE_FRAME, UPD_FRAME_PARAMS) { ... } Cheers, Simon ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: FFI calls: is it possible to allocate a small memory block on a stack?
On Fri, Apr 23, 2010 at 07:07:29PM +0100, Denys Rtveliashvili wrote: It appears that I am on 6.12. Strange, as I thought I have check-out the HEAD by following the instructions on the wiki: darcs get --partial http://darcs.haskell.org/ghc The wiki does not tell explicitly what will be checked out, so I expected it to be HEAD. That will give you the HEAD. If you check configure.ac, it will include this line: AC_INIT([The Glorious Glasgow Haskell Compilation System], [6.13], [glasgow-haskell-b...@haskell.org], [ghc]) Thanks Ian ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: FFI calls: is it possible to allocate a small memory block on a stack?
On 23/04/2010 04:39, Denys Rtveliashvili wrote: OK, the code I have checked out from the repository contains this in rts/sm/Storage.h: extern bdescr * pinned_object_block; And in rts/sm/Storage.c: bdescr *pinned_object_block; Ah, I was looking in the HEAD, where I've already fixed this by moving pinned_object_block into the Capability and hence making it CPU-local. The patch that fixed it was Tue Dec 1 16:03:21 GMT 2009 Simon Marlow marlo...@gmail.com * Make allocatePinned use local storage, and other refactorings As for locking, here is one one of examples: StgPtr allocatePinned( lnat n ) { StgPtr p; bdescr *bd = pinned_object_block; // If the request is for a large object, then allocate() // will give us a pinned object anyway. if (n = LARGE_OBJECT_THRESHOLD/sizeof(W_)) { p = allocate(n); Bdescr(p)-flags |= BF_PINNED; return p; } *ACQUIRE_SM_LOCK; // [RTVD: here we acquire the lock]* TICK_ALLOC_HEAP_NOCTR(n); CCS_ALLOC(CCCS,n); // If we don't have a block of pinned objects yet, or the current // one isn't large enough to hold the new object, allocate a new one. if (bd == NULL || (bd-free + n) (bd-start + BLOCK_SIZE_W)) { pinned_object_block = bd = allocBlock(); dbl_link_onto(bd, g0s0-large_objects); g0s0-n_large_blocks++; bd-gen_no = 0; bd-step = g0s0; bd-flags = BF_PINNED | BF_LARGE; bd-free = bd-start; alloc_blocks++; } p = bd-free; bd-free += n; *RELEASE_SM_LOCK; // [RTVD: here we release the lock]* return p; } Yes, this was also fixed by the aforementioned patch. Bear in mind that in the vast majority of programs allocatePinned is not in the inner loop, which is why it hasn't been a priority to optimise it until now. Of course, TICK_ALLOC_HEAP_NOCTR and CCS_ALLOC may require synchronization if they use shared state (which is, again, probably unnecessary). However, in case no profiling goes on and pinned_object_block is TSO-local, isn't it possible to remove locking completely from this code? The only case when locking will be necessary is when a fresh block has to be allocated, and that can be done within the allocBlock method (or, more precisely, by using allocBlock_lock. TSO-local would be bad: TSOs are lightweight threads and in many cases are smaller than a block. Capability-local is what you want. ACQUIRE_SM_LOCK/RELEASE_SM_LOCK pair is present in other places too, but I have not analysed yet if it is really necessary there. For example, things like newCAF and newDynCAF are wrapped into it. Right, but these are not common cases that need to be optimised. newCAF is only called once per CAF, thereafter it is accessed without locks. It may be that we could find benchmarks where access to the block allocator is the performance bottleneck, indeed in the parallel GC we sometimes see contention for it. If that turns out to be a problem then we may need to think about per-CPU free lists in the block allocator, but I think it would entail a fair bit of complexity and if we're not careful extra memory overhead, e.g. where one CPU has all the free blocks in its local free list and the others have none. So I'd like to avoid going down that route unless we absolutely have to. The block allocator is nice and simple right now. Cheers, Simon ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: FFI calls: is it possible to allocate a small memory block on a stack?
Hi Simon, OK, the code I have checked out from the repository contains this in rts/sm/Storage.h: extern bdescr * pinned_object_block; And in rts/sm/Storage.c: bdescr *pinned_object_block; Ah, I was looking in the HEAD, where I've already fixed this by moving pinned_object_block into the Capability and hence making it CPU-local. The patch that fixed it was Tue Dec 1 16:03:21 GMT 2009 Simon Marlow marlo...@gmail.com * Make allocatePinned use local storage, and other refactorings The version I have checked out is 6.12 and that's why I haven't seen this patch. Are there any plans for including this patch in the next GHC release? Yes, this was also fixed by the aforementioned patch. Bear in mind that in the vast majority of programs allocatePinned is not in the inner loop, which is why it hasn't been a priority to optimise it until now. I guess the code which makes use of ByteStrings (especially, when it splits them into many smaller substrings) calls to allocatePinned very frequently even within inner loops. TSO-local would be bad: TSOs are lightweight threads and in many cases are smaller than a block. Capability-local is what you want. Ah... Yes, capabilities are a far better choice. Right, but these are not common cases that need to be optimised. newCAF is only called once per CAF, thereafter it is accessed without locks. Can't recall from the top of my head, but I think I had a case when newCAF was used very actively in a simple piece of code. The code looked like this: sequence_ $ replicate N $ doSmth The Cmm code showed that it produced calls to newCAF and something related to black holes. And when I added return () after that line, the black holes new calls to newCAF have disappeared. It was on 6.12.1, I believe. I still have no idea why it happened and why these black holes where necessary, but I'll try to reproduce it one more time and show you an example if it has any interest for you. It may be that we could find benchmarks where access to the block allocator is the performance bottleneck, indeed in the parallel GC we sometimes see contention for it. If that turns out to be a problem then we may need to think about per-CPU free lists in the block allocator, but I think it would entail a fair bit of complexity and if we're not careful extra memory overhead, e.g. where one CPU has all the free blocks in its local free list and the others have none. So I'd like to avoid going down that route unless we absolutely have to. The block allocator is nice and simple right now. I suppose I should check out the HEAD then and give it a try, because earlier I had performance issues in the threaded runtime (~20% of overhead and far more noise) in an application which was doing some slicing, reshuffling and composing text via ByteStrings with a modest amount of passing data around via Chans. On a slightly different topic: please could you point me to a place where stg_upd_frame_info is generated? I can't find it in *.c, *.cmm or *.hs and guess it is something very special. With kind regards, Denys Rtveliashvili ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: FFI calls: is it possible to allocate a small memory block on a stack?
Bertram, It appears that I am on 6.12. Strange, as I thought I have check-out the HEAD by following the instructions on the wiki: darcs get --partial http://darcs.haskell.org/ghc The wiki does not tell explicitly what will be checked out, so I expected it to be HEAD. With kind regards, Denys Rtveliashvili Denys Rtveliashvili wrote: OK, the code I have checked out from the repository contains this in rts/sm/Storage.h: [global pinned_object_block variable] Odd. This was changed in ghc head by a patch dating Dec 1st 2009: Tue Dec 1 17:03:21 CET 2009 Simon Marlow marlo...@gmail.com * Make allocatePinned use local storage, and other refactorings This is a batch of refactoring to remove some of the GC's global state, as we move towards CPU-local GC. ... - allocatePinned() was still allocating from global storage and taking a lock each time, now it uses local storage. (mallocForeignPtrBytes should be faster with -threaded). ... which turned pinned_object_block into a per-capability variable. So which version of ghc are you looking at? regards, Bertram ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: FFI calls: is it possible to allocate a small memory block on a stack?
Thank you, Simon I have identified a number of problems and have created patches for a couple of them. A ticket #4004 was raised in trac and I hope that someone would take a look and put it into repository if the patches look good. Things I did: * Inlining for a few functions * changed multiplication and division in include/Cmm.h to bit shifts Things that can be done: * optimizations in the threaded RTS. Locking is used frequently, and every locking on a normal mutex in POSIX threads costs about 20 nanoseconds on my computer. * moving some computations from Cmm code to Haskell. This requires passing an information on word size and things like that to Haskell code, but the benefit is that some computations can be performed statically as they depend primarily on the data type we allocate space for. * fix/improvement for Cmm compiler. There is some code in it already which substitutes divisions and multiplications by 2^n by bit shifts, but for some reason it does not work. Also, divisions can be replaced by multiplications with bit shifts in general case. --- Also, while looking at this thing I've got a number of questions. One of them is this: What is the meaning of pinned_object_block in rts/sm/Storage.h and why is it shared between TSOs? It looks like allocatePinned has to lock on SM_MUTEX every time it is called (in threaded RTS) because other threads can be accessing it. More than that, this block of memory is assigned to a nursery of one of the TSOs. Why should it be shared with the rest of the world then instead of being local to TSO? On the side note, is London HUG still active? The website seems to be down... With kind regards, Denys Rtveliashvili Adding an INLINE pragma is the right thing for alloca and similar functions. alloca is a small overloaded wrapper around allocaBytesAligned, and without the INLINE pragma the body of allocaBytesAligned gets inlined into alloca itself, making it too big to be inlined at the call site (you can work around it with e.g. -funfolding-use-threshold=100). This is really a case of manual worker/wrapper: we want to tell GHC that alloca is a wrapper, and the way to do that is with INLINE. Ideally GHC would manage this itself - there's a lot of scope for doing some general code splitting, I don't think anyone has explored that yet. Cheers, Simon ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: FFI calls: is it possible to allocate a small memory block on a stack?
On 22/04/10 21:25, Denys Rtveliashvili wrote: Thank you, Simon I have identified a number of problems and have created patches for a couple of them. A ticket #4004 was raised in trac and I hope that someone would take a look and put it into repository if the patches look good. Things I did: * Inlining for a few functions Thanks - I already did this for alloca/malloc, I'll add the others from your patch. * changed multiplication and division in include/Cmm.h to bit shifts This really shouldn't be required, I'll look into why the optimisation isn't working. Things that can be done: * optimizations in the threaded RTS. Locking is used frequently, and every locking on a normal mutex in POSIX threads costs about 20 nanoseconds on my computer. We go to quite a lot of trouble to avoid locking in the common cases and fast paths - most of our data structures are CPU-local. Where in particular have you encountered locking that could be reduced? * moving some computations from Cmm code to Haskell. This requires passing an information on word size and things like that to Haskell code, but the benefit is that some computations can be performed statically as they depend primarily on the data type we allocate space for. * fix/improvement for Cmm compiler. There is some code in it already which substitutes divisions and multiplications by 2^n by bit shifts, but for some reason it does not work. Also, divisions can be replaced by multiplications with bit shifts in general case. --- Also, while looking at this thing I've got a number of questions. One of them is this: What is the meaning of pinned_object_block in rts/sm/Storage.h and why is it shared between TSOs? It looks like allocatePinned has to lock on SM_MUTEX every time it is called (in threaded RTS) because other threads can be accessing it. More than that, this block of memory is assigned to a nursery of one of the TSOs. Why should it be shared with the rest of the world then instead of being local to TSO? The pinned_object_block is CPU-local, usually no locking is required. Only when the block is full do we have to get a new block from the block allocator, and that requires a lock, but it's a rare case. Cheers, Simon On the side note, is London HUG still active? The website seems to be down... With kind regards, Denys Rtveliashvili Adding an INLINE pragma is the right thing for alloca and similar functions. alloca is a small overloaded wrapper around allocaBytesAligned, and without the INLINE pragma the body of allocaBytesAligned gets inlined into alloca itself, making it too big to be inlined at the call site (you can work around it with e.g. -funfolding-use-threshold=100). This is really a case of manual worker/wrapper: we want to tell GHC that alloca is a wrapper, and the way to do that is with INLINE. Ideally GHC would manage this itself - there's a lot of scope for doing some general code splitting, I don't think anyone has explored that yet. Cheers, Simon ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: FFI calls: is it possible to allocate a small memory block on a stack?
Hi Simon, Thanks - I already did this for alloca/malloc, I'll add the others from your patch. Thank you. We go to quite a lot of trouble to avoid locking in the common cases and fast paths - most of our data structures are CPU-local. Where in particular have you encountered locking that could be reduced? The pinned_object_block is CPU-local, usually no locking is required. Only when the block is full do we have to get a new block from the block allocator, and that requires a lock, but it's a rare case. OK, the code I have checked out from the repository contains this in rts/sm/Storage.h: extern bdescr * pinned_object_block; And in rts/sm/Storage.c: bdescr *pinned_object_block; My C might be rusty, but I see no way for pinned_object_block to be CPU local. If it is truly CPU local then what makes it to be that kind? As for locking, here is one one of examples: StgPtr allocatePinned( lnat n ) { StgPtr p; bdescr *bd = pinned_object_block; // If the request is for a large object, then allocate() // will give us a pinned object anyway. if (n = LARGE_OBJECT_THRESHOLD/sizeof(W_)) { p = allocate(n); Bdescr(p)-flags |= BF_PINNED; return p; } ACQUIRE_SM_LOCK; // [RTVD: here we acquire the lock] TICK_ALLOC_HEAP_NOCTR(n); CCS_ALLOC(CCCS,n); // If we don't have a block of pinned objects yet, or the current // one isn't large enough to hold the new object, allocate a new one. if (bd == NULL || (bd-free + n) (bd-start + BLOCK_SIZE_W)) { pinned_object_block = bd = allocBlock(); dbl_link_onto(bd, g0s0-large_objects); g0s0-n_large_blocks++; bd-gen_no = 0; bd-step = g0s0; bd-flags = BF_PINNED | BF_LARGE; bd-free = bd-start; alloc_blocks++; } p = bd-free; bd-free += n; RELEASE_SM_LOCK; // [RTVD: here we release the lock] return p; } Of course, TICK_ALLOC_HEAP_NOCTR and CCS_ALLOC may require synchronization if they use shared state (which is, again, probably unnecessary). However, in case no profiling goes on and pinned_object_block is TSO-local, isn't it possible to remove locking completely from this code? The only case when locking will be necessary is when a fresh block has to be allocated, and that can be done within the allocBlock method (or, more precisely, by using allocBlock_lock. ACQUIRE_SM_LOCK/RELEASE_SM_LOCK pair is present in other places too, but I have not analysed yet if it is really necessary there. For example, things like newCAF and newDynCAF are wrapped into it. With kind regards, Denys Rtveliashvili ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: FFI calls: is it possible to allocate a small memory block on a stack?
On 18/04/2010 10:28, Denys Rtveliashvili wrote: While alloca is not as cheap as, say, C's alloca, you should find that it is much quicker than C's malloc. I'm sure there's room for optimisation if it's critical for you. There may well be low-hanging fruit: take a look at the Core for alloca. Thank you, Simon. Indeed, there is a low-hanging fruit. alloca's type is Storable a = (Ptr a - IO b) - IO b and it is not inlined even though the function is small. And calls to functions of such signature are expensive (I suppose that's because of look-up into typeclass dictionary). However, when I added an INLINE pragma for the function into Foreign.Marshal.Alloc the time of execution dropped from 40 to 20 nanoseconds. I guess the same effect will take place if other similar functions get marked with INLINE. Is there a reason why we do not want small FFI-related functions with typeclass arguments be marked with INLINE pragma and gain a performance improvement? The only reason that comes to my mind is the size of code, but actually the resulting code looks very small and neat. Adding an INLINE pragma is the right thing for alloca and similar functions. alloca is a small overloaded wrapper around allocaBytesAligned, and without the INLINE pragma the body of allocaBytesAligned gets inlined into alloca itself, making it too big to be inlined at the call site (you can work around it with e.g. -funfolding-use-threshold=100). This is really a case of manual worker/wrapper: we want to tell GHC that alloca is a wrapper, and the way to do that is with INLINE. Ideally GHC would manage this itself - there's a lot of scope for doing some general code splitting, I don't think anyone has explored that yet. Cheers, Simon ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: FFI calls: is it possible to allocate a small memory block on a stack?
While alloca is not as cheap as, say, C's alloca, you should find that it is much quicker than C's malloc. I'm sure there's room for optimisation if it's critical for you. There may well be low-hanging fruit: take a look at the Core for alloca. Thank you, Simon. Indeed, there is a low-hanging fruit. alloca's type is Storable a = (Ptr a - IO b) - IO b and it is not inlined even though the function is small. And calls to functions of such signature are expensive (I suppose that's because of look-up into typeclass dictionary). However, when I added an INLINE pragma for the function into Foreign.Marshal.Alloc the time of execution dropped from 40 to 20 nanoseconds. I guess the same effect will take place if other similar functions get marked with INLINE. Is there a reason why we do not want small FFI-related functions with typeclass arguments be marked with INLINE pragma and gain a performance improvement? The only reason that comes to my mind is the size of code, but actually the resulting code looks very small and neat. With kind regards, Denys Rtveliashvili ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: FFI calls: is it possible to allocate a small memory block on a stack?
On Apr 15, 2010, at 15:34 , Denys Rtveliashvili wrote: As for the performance of alloca, I though it would be faster than malloc. However, in a simple test I have just written it is actually slower. The test allocates 16-bytes arrays and immediately de-allocates them. This operation is repeated 10 times. On my computer the C program takes 27 seconds to complete while Haskell version takes about 41. Your C program is doing nothing but adjusting linked lists after the first time the block is allocated. free() puts the block at the head of a linked list of free blocks, and malloc() will find it there, unlink and return it. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon universityKF8NH PGP.sig Description: This is a digitally signed message part ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: FFI calls: is it possible to allocate a small memory block on a stack?
While alloca is not as cheap as, say, C's alloca, you should find that it is much quicker than C's malloc. I'm sure there's room for optimisation if it's critical for you. There may well be low-hanging fruit: take a look at the Core for alloca. The problem with using the stack is that alloca needs to allocate non-movable memory, and in GHC thread stacks are movable. Cheers, Simon Thank you for reply. I think I have had a few wrong assumptions. One of them is that stack is non-movable. Of course, for this purpose I need a non-movable region and a pinned array on a heap is probably the only choice. Also, I was hoping it is possible to use the low-level stack (the one which is being used when instructions such as push and pop are executed), but I guess it is not possible in case of GHC-generated code. As for the performance of alloca, I though it would be faster than malloc. However, in a simple test I have just written it is actually slower. The test allocates 16-bytes arrays and immediately de-allocates them. This operation is repeated 10 times. On my computer the C program takes 27 seconds to complete while Haskell version takes about 41. #include stdlib.h int main (int argc, char **argv) { for(long i = 0; i 10; i ++) { free(malloc(16)); } } module Main where import Control.Monad import Foreign.Storable import Foreign.Marshal.Alloc import Foreign.Ptr data Data = Data instance Storable Data where sizeOf _ = 16 alignment _ = 16 peek _ = return Data poke _ _ = return () main = sequence_ $ replicate 10 $ alloca $ \ptr - if (nullPtr::Ptr Data) == ptr then fail Can't be else return I would gladly take a look at the Core of alloca. But frankly, I am not sure how to tell ghc to show me that. With the help of -ddump-simpl and -fext-core I can make it show me the Core, but it does not have the body of the alloca itself, just a call to it. And when I look at C-- source with the help of -ddump-cmm the source is transformed too much already to tell where alloca is. ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: FFI calls: is it possible to allocate a small memory block on a stack?
On 14/04/10 06:02, Denys Rtveliashvili wrote: Good morning, Yesterday I did a few tests to measure the performance of FFI calls and found that the calls themselves are very quick (1-2 nanosecond). However, there is a kind of FFI calls when one have to allocate a temporary memory block (for a struct, or a temporary buffer). One of examples is a call to gettimeofday or clock_gettime. Unfortunately, the usual way of doing it (using the alloca function) is quite slow (~40 nanoseconds). I was wondering, is there any way to allocate a small chunk of data on a thread's stack? That should be cheap, as by large it is just a shift of a pointer and, perhaps, a few other trivial operations. I understand that it is not safe to allocate large blocks on the stack. But we could possibly have a function similar to alloca which would allocate small blocks on stack and would use the alloca for big ones. While alloca is not as cheap as, say, C's alloca, you should find that it is much quicker than C's malloc. I'm sure there's room for optimisation if it's critical for you. There may well be low-hanging fruit: take a look at the Core for alloca. The problem with using the stack is that alloca needs to allocate non-movable memory, and in GHC thread stacks are movable. Cheers, Simon ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
FFI calls: is it possible to allocate a small memory block on a stack?
Good morning, Yesterday I did a few tests to measure the performance of FFI calls and found that the calls themselves are very quick (1-2 nanosecond). However, there is a kind of FFI calls when one have to allocate a temporary memory block (for a struct, or a temporary buffer). One of examples is a call to gettimeofday or clock_gettime. Unfortunately, the usual way of doing it (using the alloca function) is quite slow (~40 nanoseconds). I was wondering, is there any way to allocate a small chunk of data on a thread's stack? That should be cheap, as by large it is just a shift of a pointer and, perhaps, a few other trivial operations. I understand that it is not safe to allocate large blocks on the stack. But we could possibly have a function similar to alloca which would allocate small blocks on stack and would use the alloca for big ones. With kind regards, Denys Rtveliashvili ___ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users