Re: FFI calls: is it possible to allocate a small memory block on a stack?

2010-04-27 Thread Simon Marlow

On 23/04/2010 19:03, Denys Rtveliashvili wrote:


Tue Dec  1 16:03:21 GMT 2009  Simon Marlowmarlo...@gmail.com  
mailto:marlo...@gmail.com
  * Make allocatePinned use local storage, and other refactorings


The version I have checked out is 6.12 and that's why I haven't seen
this patch.
Are there any plans for including this patch in the next GHC release?


It'll be in the next major release (6.14.1).


Right, but these are not common cases that need to be optimised.  newCAF
is only called once per CAF, thereafter it is accessed without locks.

Can't recall from the top of my head, but I think I had a case when
newCAF was used very actively in a simple piece of code. The code looked
like this:

sequence_ $ replicate N $ doSmth

The Cmm code showed that it produced calls to newCAF and something
related to black holes.


Right, but newCAF should only be called once for any given CAF, 
thereafter the CAF will have been updated.



And when I added return () after that line,
the black holes new calls to newCAF have disappeared. It was on
6.12.1, I believe. I still have no idea why it happened and why these
black holes where necessary, but I'll try to reproduce it one more time
and show you an example if it has any interest for you.


If you find a case where newCAF is being called repeatedly, that would 
be interesting yes.



It may be that we could find benchmarks where access to the block
allocator is the performance bottleneck, indeed in the parallel GC we
sometimes see contention for it.  If that turns out to be a problem then
we may need to think about per-CPU free lists in the block allocator,
but I think it would entail a fair bit of complexity and if we're not
careful extra memory overhead, e.g. where one CPU has all the free
blocks in its local free list and the others have none.  So I'd like to
avoid going down that route unless we absolutely have to.  The block
allocator is nice and simple right now.


I suppose I should check out the HEAD then and give it a try, because
earlier I had performance issues in the threaded runtime (~20% of
overhead and far more noise) in an application which was doing some
slicing, reshuffling and composing text via ByteStrings with a modest
amount of passing data around via Chans.


I'd be interested in seeing a program that has 20% overhead with 
-threaded.  You should watch out for bound threads though: with 
-threaded the main thread is a bound thread, and communication with the 
main thread is much slower than between unbound threads. See


http://www.haskell.org/ghc/docs/latest/html/libraries/base-4.2.0.1/Control-Concurrent.html#8


On a slightly different topic: please could you point me to a place
where stg_upd_frame_info is generated? I can't find it in *.c, *.cmm or
*.hs and guess it is something very special.


rts/Updates.cmm:

INFO_TABLE_RET( stg_upd_frame, UPDATE_FRAME, UPD_FRAME_PARAMS)
{
...
}


Cheers,
Simon

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: FFI calls: is it possible to allocate a small memory block on a stack?

2010-04-24 Thread Ian Lynagh
On Fri, Apr 23, 2010 at 07:07:29PM +0100, Denys Rtveliashvili wrote:
 
 It appears that I am on 6.12.
 Strange, as I thought I have check-out the HEAD by following the
 instructions on the wiki:
 
 darcs get --partial http://darcs.haskell.org/ghc
 
 The wiki does not tell explicitly what will be checked out, so I
 expected it to be HEAD.

That will give you the HEAD. If you check configure.ac, it will include
this line:
AC_INIT([The Glorious Glasgow Haskell Compilation System], [6.13], 
[glasgow-haskell-b...@haskell.org], [ghc])


Thanks
Ian

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: FFI calls: is it possible to allocate a small memory block on a stack?

2010-04-23 Thread Simon Marlow

On 23/04/2010 04:39, Denys Rtveliashvili wrote:


OK, the code I have checked out from the repository contains this in
rts/sm/Storage.h:

extern bdescr * pinned_object_block;


And in rts/sm/Storage.c:

bdescr *pinned_object_block;


Ah, I was looking in the HEAD, where I've already fixed this by moving 
pinned_object_block into the Capability and hence making it CPU-local. 
The patch that fixed it was


Tue Dec  1 16:03:21 GMT 2009  Simon Marlow marlo...@gmail.com
  * Make allocatePinned use local storage, and other refactorings


As for locking, here is one one of examples:

StgPtr
allocatePinned( lnat n )
{
StgPtr p;
bdescr *bd = pinned_object_block;

// If the request is for a large object, then allocate()
// will give us a pinned object anyway.
if (n = LARGE_OBJECT_THRESHOLD/sizeof(W_)) {
p = allocate(n);
Bdescr(p)-flags |= BF_PINNED;
return p;
}

*ACQUIRE_SM_LOCK; // [RTVD: here we acquire the lock]*

TICK_ALLOC_HEAP_NOCTR(n);
CCS_ALLOC(CCCS,n);

// If we don't have a block of pinned objects yet, or the current
// one isn't large enough to hold the new object, allocate a new one.
if (bd == NULL || (bd-free + n)  (bd-start + BLOCK_SIZE_W)) {
pinned_object_block = bd = allocBlock();
dbl_link_onto(bd, g0s0-large_objects);
g0s0-n_large_blocks++;
bd-gen_no = 0;
bd-step = g0s0;
bd-flags = BF_PINNED | BF_LARGE;
bd-free = bd-start;
alloc_blocks++;
}

p = bd-free;
bd-free += n;
*RELEASE_SM_LOCK; // [RTVD: here we release the lock]*
return p;
}


Yes, this was also fixed by the aforementioned patch.

Bear in mind that in the vast majority of programs allocatePinned is not 
in the inner loop, which is why it hasn't been a priority to optimise it 
until now.



Of course, TICK_ALLOC_HEAP_NOCTR and CCS_ALLOC may require
synchronization if they use shared state (which is, again, probably
unnecessary). However, in case no profiling goes on and
pinned_object_block is TSO-local, isn't it possible to remove
locking completely from this code? The only case when locking will
be necessary is when a fresh block has to be allocated, and that can
be done within the allocBlock method (or, more precisely, by using
allocBlock_lock.


TSO-local would be bad: TSOs are lightweight threads and in many cases 
are smaller than a block.  Capability-local is what you want.



ACQUIRE_SM_LOCK/RELEASE_SM_LOCK pair is present in other places too,
but I have not analysed yet if it is really necessary there. For
example, things like newCAF and newDynCAF are wrapped into it.


Right, but these are not common cases that need to be optimised.  newCAF 
is only called once per CAF, thereafter it is accessed without locks.


It may be that we could find benchmarks where access to the block 
allocator is the performance bottleneck, indeed in the parallel GC we 
sometimes see contention for it.  If that turns out to be a problem then 
we may need to think about per-CPU free lists in the block allocator, 
but I think it would entail a fair bit of complexity and if we're not 
careful extra memory overhead, e.g. where one CPU has all the free 
blocks in its local free list and the others have none.  So I'd like to 
avoid going down that route unless we absolutely have to.  The block 
allocator is nice and simple right now.


Cheers,
Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: FFI calls: is it possible to allocate a small memory block on a stack?

2010-04-23 Thread Denys Rtveliashvili
Hi Simon,

  OK, the code I have checked out from the repository contains this in
  rts/sm/Storage.h:
 
  extern bdescr * pinned_object_block;
 
 
  And in rts/sm/Storage.c:
 
  bdescr *pinned_object_block;
 
 Ah, I was looking in the HEAD, where I've already fixed this by moving 
 pinned_object_block into the Capability and hence making it CPU-local. 
 The patch that fixed it was
 
 Tue Dec  1 16:03:21 GMT 2009  Simon Marlow marlo...@gmail.com
* Make allocatePinned use local storage, and other refactorings


The version I have checked out is 6.12 and that's why I haven't seen
this patch.
Are there any plans for including this patch in the next GHC release?


 Yes, this was also fixed by the aforementioned patch.
 
 Bear in mind that in the vast majority of programs allocatePinned is not 
 in the inner loop, which is why it hasn't been a priority to optimise it 
 until now.

I guess the code which makes use of ByteStrings (especially, when it
splits them into many smaller substrings) calls to allocatePinned very
frequently even within inner loops.


 TSO-local would be bad: TSOs are lightweight threads and in many cases 
 are smaller than a block.  Capability-local is what you want.

Ah... Yes, capabilities are a far better choice.


 Right, but these are not common cases that need to be optimised.  newCAF 
 is only called once per CAF, thereafter it is accessed without locks.

Can't recall from the top of my head, but I think I had a case when
newCAF was used very actively in a simple piece of code. The code looked
like this:

sequence_ $ replicate N $ doSmth

The Cmm code showed that it produced calls to newCAF and something
related to black holes. And when I added return ()  after that line,
the black holes new calls to newCAF have disappeared. It was on
6.12.1, I believe. I still have no idea why it happened and why these
black holes where necessary, but I'll try to reproduce it one more time
and show you an example if it has any interest for you.


 It may be that we could find benchmarks where access to the block 
 allocator is the performance bottleneck, indeed in the parallel GC we 
 sometimes see contention for it.  If that turns out to be a problem then 
 we may need to think about per-CPU free lists in the block allocator, 
 but I think it would entail a fair bit of complexity and if we're not 
 careful extra memory overhead, e.g. where one CPU has all the free 
 blocks in its local free list and the others have none.  So I'd like to 
 avoid going down that route unless we absolutely have to.  The block 
 allocator is nice and simple right now.


I suppose I should check out the HEAD then and give it a try, because
earlier I had performance issues in the threaded runtime (~20% of
overhead and far more noise) in an application which was doing some
slicing, reshuffling and composing text via ByteStrings with a modest
amount of passing data around via Chans.

On a slightly different topic: please could you point me to a place
where stg_upd_frame_info is generated? I can't find it in *.c, *.cmm or
*.hs and guess it is something very special.

With kind regards,
Denys Rtveliashvili
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: FFI calls: is it possible to allocate a small memory block on a stack?

2010-04-23 Thread Denys Rtveliashvili
Bertram,

It appears that I am on 6.12.
Strange, as I thought I have check-out the HEAD by following the
instructions on the wiki:


darcs get --partial http://darcs.haskell.org/ghc


The wiki does not tell explicitly what will be checked out, so I
expected it to be HEAD.

With kind regards,
Denys Rtveliashvili


 Denys Rtveliashvili wrote:
  OK, the code I have checked out from the repository contains this in
  rts/sm/Storage.h:
 [global pinned_object_block variable]
 
 Odd. This was changed in ghc head by a patch dating Dec 1st 2009:
 
 Tue Dec  1 17:03:21 CET 2009  Simon Marlow marlo...@gmail.com
   * Make allocatePinned use local storage, and other refactorings
   
   This is a batch of refactoring to remove some of the GC's global
   state, as we move towards CPU-local GC.  
 ...
 - allocatePinned() was still allocating from global storage and
   taking a lock each time, now it uses local storage. 
   (mallocForeignPtrBytes should be faster with -threaded).
 ...
 
 which turned pinned_object_block into a per-capability variable.
 
 So which version of ghc are you looking at?
 
 regards,
 
 Bertram
 ___
 Glasgow-haskell-users mailing list
 Glasgow-haskell-users@haskell.org
 http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: FFI calls: is it possible to allocate a small memory block on a stack?

2010-04-22 Thread Denys Rtveliashvili
Thank you, Simon

I have identified a number of problems and have created patches for a
couple of them. A ticket #4004 was raised in trac and I hope that
someone would take a look and put it into repository if the patches look
good.

Things I did:
* Inlining for a few functions
* changed multiplication and division in include/Cmm.h to bit shifts

Things that can be done:
* optimizations in the threaded RTS. Locking is used frequently, and
every locking on a normal mutex in POSIX threads costs about 20
nanoseconds on my computer.
* moving some computations from Cmm code to Haskell. This requires
passing an information on word size and things like that to Haskell
code, but the benefit is that some computations can be performed
statically as they depend primarily on the data type we allocate space
for.
* fix/improvement for Cmm compiler. There is some code in it already
which substitutes divisions and multiplications by 2^n by bit shifts,
but for some reason it does not work. Also, divisions can be replaced by
multiplications with bit shifts in general case.

---

Also, while looking at this thing I've got a number of questions. One of
them is this:

What is the meaning of pinned_object_block in rts/sm/Storage.h and why
is it shared between TSOs? It looks like allocatePinned has to lock on
SM_MUTEX every time it is called (in threaded RTS) because other threads
can be accessing it. More than that, this block of memory is assigned to
a nursery of one of the TSOs. Why should it be shared with the rest of
the world then instead of being local to TSO?


On the side note, is London HUG still active? The website seems to be
down...


With kind regards,
Denys Rtveliashvili


 Adding an INLINE pragma is the right thing for alloca and similar functions.
 
 alloca is a small overloaded wrapper around allocaBytesAligned, and 
 without the INLINE pragma the body of allocaBytesAligned gets inlined 
 into alloca itself, making it too big to be inlined at the call site 
 (you can work around it with e.g. -funfolding-use-threshold=100).  This 
 is really a case of manual worker/wrapper: we want to tell GHC that 
 alloca is a wrapper, and the way to do that is with INLINE.  Ideally GHC 
 would manage this itself - there's a lot of scope for doing some general 
 code splitting, I don't think anyone has explored that yet.
 
 Cheers,
   Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: FFI calls: is it possible to allocate a small memory block on a stack?

2010-04-22 Thread Simon Marlow

On 22/04/10 21:25, Denys Rtveliashvili wrote:

Thank you, Simon

I have identified a number of problems and have created patches for a
couple of them. A ticket #4004 was raised in trac and I hope that
someone would take a look and put it into repository if the patches look
good.

Things I did:
* Inlining for a few functions


Thanks - I already did this for alloca/malloc, I'll add the others from 
your patch.



* changed multiplication and division in include/Cmm.h to bit shifts


This really shouldn't be required, I'll look into why the optimisation 
isn't working.



Things that can be done:
* optimizations in the threaded RTS. Locking is used frequently, and
every locking on a normal mutex in POSIX threads costs about 20
nanoseconds on my computer.


We go to quite a lot of trouble to avoid locking in the common cases and 
fast paths - most of our data structures are CPU-local.  Where in 
particular have you encountered locking that could be reduced?



* moving some computations from Cmm code to Haskell. This requires
passing an information on word size and things like that to Haskell
code, but the benefit is that some computations can be performed
statically as they depend primarily on the data type we allocate space for.
* fix/improvement for Cmm compiler. There is some code in it already
which substitutes divisions and multiplications by 2^n by bit shifts,
but for some reason it does not work. Also, divisions can be replaced by
multiplications with bit shifts in general case.

---

Also, while looking at this thing I've got a number of questions. One of
them is this:

What is the meaning of pinned_object_block in rts/sm/Storage.h and why
is it shared between TSOs? It looks like allocatePinned has to lock on
SM_MUTEX every time it is called (in threaded RTS) because other threads
can be accessing it. More than that, this block of memory is assigned to
a nursery of one of the TSOs. Why should it be shared with the rest of
the world then instead of being local to TSO?


The pinned_object_block is CPU-local, usually no locking is required. 
Only when the block is full do we have to get a new block from the block 
allocator, and that requires a lock, but it's a rare case.


Cheers,
Simon


On the side note, is London HUG still active? The website seems to be
down...


With kind regards,
Denys Rtveliashvili


Adding an INLINE pragma is the right thing for alloca and similar functions.

alloca is a small overloaded wrapper around allocaBytesAligned, and
without the INLINE pragma the body of allocaBytesAligned gets inlined
into alloca itself, making it too big to be inlined at the call site
(you can work around it with e.g. -funfolding-use-threshold=100).  This
is really a case of manual worker/wrapper: we want to tell GHC that
alloca is a wrapper, and the way to do that is with INLINE.  Ideally GHC
would manage this itself - there's a lot of scope for doing some general
code splitting, I don't think anyone has explored that yet.

Cheers,
Simon


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: FFI calls: is it possible to allocate a small memory block on a stack?

2010-04-22 Thread Denys Rtveliashvili
Hi Simon,


 Thanks - I already did this for alloca/malloc, I'll add the others from 
 your patch.


Thank you.


 We go to quite a lot of trouble to avoid locking in the common cases and 
 fast paths - most of our data structures are CPU-local.  Where in 
 particular have you encountered locking that could be reduced?



 The pinned_object_block is CPU-local, usually no locking is required. 
 Only when the block is full do we have to get a new block from the block 
 allocator, and that requires a lock, but it's a rare case.


OK, the code I have checked out from the repository contains this in
rts/sm/Storage.h:


extern bdescr * pinned_object_block;


And in rts/sm/Storage.c:


bdescr *pinned_object_block;


My C might be rusty, but I see no way for pinned_object_block to be CPU
local. If it is truly CPU local then what makes it to be that kind?

As for locking, here is one one of examples:


StgPtr
allocatePinned( lnat n )
{
StgPtr p;
bdescr *bd = pinned_object_block;

// If the request is for a large object, then allocate()
// will give us a pinned object anyway.
if (n = LARGE_OBJECT_THRESHOLD/sizeof(W_)) {
  p = allocate(n);
Bdescr(p)-flags |= BF_PINNED;
return p;
}

ACQUIRE_SM_LOCK; // [RTVD: here we acquire the lock]

TICK_ALLOC_HEAP_NOCTR(n);
CCS_ALLOC(CCCS,n);

// If we don't have a block of pinned objects yet, or the
current
// one isn't large enough to hold the new object, allocate a
new one.
if (bd == NULL || (bd-free + n)  (bd-start +
BLOCK_SIZE_W)) {
  pinned_object_block = bd = allocBlock();
  dbl_link_onto(bd, g0s0-large_objects);
  g0s0-n_large_blocks++;
  bd-gen_no = 0;
  bd-step   = g0s0;
  bd-flags  = BF_PINNED | BF_LARGE;
  bd-free   = bd-start;
  alloc_blocks++;
}

p = bd-free;
bd-free += n;
RELEASE_SM_LOCK; // [RTVD: here we release the lock]
return p;
}

Of course, TICK_ALLOC_HEAP_NOCTR and CCS_ALLOC may require
synchronization if they use shared state (which is, again,
probably unnecessary). However, in case no profiling goes on and
pinned_object_block is TSO-local, isn't it possible to remove
locking completely from this code? The only case when locking
will be necessary is when a fresh block has to be allocated, and
that can be done within the allocBlock method (or, more
precisely, by using allocBlock_lock.

ACQUIRE_SM_LOCK/RELEASE_SM_LOCK pair is present in other places
too, but I have not analysed yet if it is really necessary
there. For example, things like newCAF and newDynCAF are wrapped
into it.

With kind regards,
Denys Rtveliashvili
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: FFI calls: is it possible to allocate a small memory block on a stack?

2010-04-19 Thread Simon Marlow

On 18/04/2010 10:28, Denys Rtveliashvili wrote:



While alloca is not as cheap as, say, C's alloca, you should find that
it is much quicker than C's malloc.  I'm sure there's room for
optimisation if it's critical for you.  There may well be low-hanging
fruit: take a look at the Core for alloca.

Thank you, Simon.

Indeed, there is a low-hanging fruit.

alloca's type is Storable a = (Ptr a - IO b) - IO b and it is not
inlined even though the function is small. And calls to functions of
such signature are expensive (I suppose that's because of look-up into
typeclass dictionary). However, when I added an INLINE pragma for the
function into Foreign.Marshal.Alloc the time of execution dropped from
40 to 20 nanoseconds. I guess the same effect will take place if other
similar functions get marked with INLINE.

Is there a reason why we do not want small FFI-related functions with
typeclass arguments be marked with INLINE pragma and gain a
performance improvement?
The only reason that comes to my mind is the size of code, but actually
the resulting code looks very small and neat.


Adding an INLINE pragma is the right thing for alloca and similar functions.

alloca is a small overloaded wrapper around allocaBytesAligned, and 
without the INLINE pragma the body of allocaBytesAligned gets inlined 
into alloca itself, making it too big to be inlined at the call site 
(you can work around it with e.g. -funfolding-use-threshold=100).  This 
is really a case of manual worker/wrapper: we want to tell GHC that 
alloca is a wrapper, and the way to do that is with INLINE.  Ideally GHC 
would manage this itself - there's a lot of scope for doing some general 
code splitting, I don't think anyone has explored that yet.


Cheers,
Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: FFI calls: is it possible to allocate a small memory block on a stack?

2010-04-18 Thread Denys Rtveliashvili


 While alloca is not as cheap as, say, C's alloca, you should find that 
 it is much quicker than C's malloc.  I'm sure there's room for 
 optimisation if it's critical for you.  There may well be low-hanging 
 fruit: take a look at the Core for alloca.

Thank you, Simon.

Indeed, there is a low-hanging fruit.

alloca's type is Storable a = (Ptr a - IO b) - IO b and it is not
inlined even though the function is small. And calls to functions of
such signature are expensive (I suppose that's because of look-up into
typeclass dictionary). However, when I added an INLINE pragma for the
function into Foreign.Marshal.Alloc the time of execution dropped from
40 to 20 nanoseconds. I guess the same effect will take place if other
similar functions get marked with INLINE.

Is there a reason why we do not want small FFI-related functions with
typeclass arguments be marked with INLINE pragma and gain a
performance improvement?
The only reason that comes to my mind is the size of code, but actually
the resulting code looks very small and neat.

With kind regards,
Denys Rtveliashvili
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: FFI calls: is it possible to allocate a small memory block on a stack?

2010-04-17 Thread Brandon S. Allbery KF8NH

On Apr 15, 2010, at 15:34 , Denys Rtveliashvili wrote:
As for the performance of alloca, I though it would be faster than  
malloc. However, in a simple test I have just written it is  
actually slower. The test allocates 16-bytes arrays and immediately  
de-allocates them. This operation is repeated 10 times. On  
my computer the C program takes 27 seconds to complete while Haskell  
version takes about 41.



Your C program is doing nothing but adjusting linked lists after the  
first time the block is allocated.  free() puts the block at the head  
of a linked list of free blocks, and malloc() will find it there,  
unlink and return it.


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: FFI calls: is it possible to allocate a small memory block on a stack?

2010-04-15 Thread Denys Rtveliashvili
 While alloca is not as cheap as, say, C's alloca, you should find that 
 it is much quicker than C's malloc.  I'm sure there's room for 
 optimisation if it's critical for you.  There may well be low-hanging 
 fruit: take a look at the Core for alloca.
 
 The problem with using the stack is that alloca needs to allocate 
 non-movable memory, and in GHC thread stacks are movable.
 
 Cheers,
   Simon


Thank you for reply.

I think I have had a few wrong assumptions. One of them is that stack is
non-movable. Of course, for this purpose I need a non-movable region and
a pinned array on a heap is probably the only choice.
Also, I was hoping it is possible to use the low-level stack (the one
which is being used when instructions such as push and pop are
executed), but I guess it is not possible in case of GHC-generated code.

As for the performance of alloca, I though it would be faster than
malloc. However, in a simple test I have just written it is actually
slower. The test allocates 16-bytes arrays and immediately de-allocates
them. This operation is repeated 10 times. On my computer the C
program takes 27 seconds to complete while Haskell version takes about
41.


#include stdlib.h

int main (int argc, char **argv) {
for(long i = 0; i  10; i ++) {
free(malloc(16));
}
}

module Main where

import Control.Monad
import Foreign.Storable
import Foreign.Marshal.Alloc
import Foreign.Ptr

data Data = Data
instance Storable Data where
  sizeOf _ = 16
  alignment _ = 16
  peek _ = return Data
  poke _ _ = return ()

main = sequence_ $ replicate 10 $ alloca $ \ptr -
  if (nullPtr::Ptr Data) == ptr then fail Can't be else return 


I would gladly take a look at the Core of alloca. But frankly, I am
not sure how to tell ghc to show me that. With the help of -ddump-simpl
and -fext-core I can make it show me the Core, but it does not have the
body of the alloca itself, just a call to it. And when I look at C--
source with the help of -ddump-cmm the source is transformed too much
already to tell where alloca is.
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: FFI calls: is it possible to allocate a small memory block on a stack?

2010-04-14 Thread Simon Marlow

On 14/04/10 06:02, Denys Rtveliashvili wrote:

Good morning,

Yesterday I did a few tests to measure the performance of FFI calls and
found that the calls themselves are very quick (1-2 nanosecond).
However, there is a kind of FFI calls when one have to allocate a
temporary memory block (for a struct, or a temporary buffer). One of
examples is a call to gettimeofday or clock_gettime. Unfortunately,
the usual way of doing it (using the alloca function) is quite slow
(~40 nanoseconds).

I was wondering, is there any way to allocate a small chunk of data on a
thread's stack?
That should be cheap, as by large it is just a shift of a pointer and,
perhaps, a few other trivial operations.
I understand that it is not safe to allocate large blocks on the stack.
But we could possibly have a function similar to alloca which would
allocate small blocks on stack and would use the alloca for big ones.


While alloca is not as cheap as, say, C's alloca, you should find that 
it is much quicker than C's malloc.  I'm sure there's room for 
optimisation if it's critical for you.  There may well be low-hanging 
fruit: take a look at the Core for alloca.


The problem with using the stack is that alloca needs to allocate 
non-movable memory, and in GHC thread stacks are movable.


Cheers,
Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


FFI calls: is it possible to allocate a small memory block on a stack?

2010-04-13 Thread Denys Rtveliashvili
Good morning,

Yesterday I did a few tests to measure the performance of FFI calls and
found that the calls themselves are very quick (1-2 nanosecond).
However, there is a kind of FFI calls when one have to allocate a
temporary memory block (for a struct, or a temporary buffer). One of
examples is a call to gettimeofday or clock_gettime. Unfortunately,
the usual way of doing it (using the alloca function) is quite slow
(~40 nanoseconds).

I was wondering, is there any way to allocate a small chunk of data on a
thread's stack?
That should be cheap, as by large it is just a shift of a pointer and,
perhaps, a few other trivial operations.
I understand that it is not safe to allocate large blocks on the stack.
But we could possibly have a function similar to alloca which would
allocate small blocks on stack and would use the alloca for big ones.

With kind regards,
Denys Rtveliashvili
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users