On 22/04/10 21:25, Denys Rtveliashvili wrote:
Thank you, Simon

I have identified a number of problems and have created patches for a
couple of them. A ticket #4004 was raised in trac and I hope that
someone would take a look and put it into repository if the patches look
good.

Things I did:
* Inlining for a few functions

Thanks - I already did this for alloca/malloc, I'll add the others from your patch.

* changed multiplication and division in include/Cmm.h to bit shifts

This really shouldn't be required, I'll look into why the optimisation isn't working.

Things that can be done:
* optimizations in the threaded RTS. Locking is used frequently, and
every locking on a normal mutex in "POSIX threads" costs about 20
nanoseconds on my computer.

We go to quite a lot of trouble to avoid locking in the common cases and fast paths - most of our data structures are CPU-local. Where in particular have you encountered locking that could be reduced?

* moving some computations from Cmm code to Haskell. This requires
passing an information on word size and things like that to Haskell
code, but the benefit is that some computations can be performed
statically as they depend primarily on the data type we allocate space for.
* fix/improvement for Cmm compiler. There is some code in it already
which substitutes divisions and multiplications by 2^n by bit shifts,
but for some reason it does not work. Also, divisions can be replaced by
multiplications with bit shifts in general case.

---

Also, while looking at this thing I've got a number of questions. One of
them is this:

What is the meaning of "pinned_object_block" in rts/sm/Storage.h and why
is it shared between TSOs? It looks like "allocatePinned" has to lock on
SM_MUTEX every time it is called (in threaded RTS) because other threads
can be accessing it. More than that, this block of memory is assigned to
a nursery of one of the TSOs. Why should it be shared with the rest of
the world then instead of being local to TSO?

The pinned_object_block is CPU-local, usually no locking is required. Only when the block is full do we have to get a new block from the block allocator, and that requires a lock, but it's a rare case.

Cheers,
        Simon

On the side note, is London HUG still active? The website seems to be
down...


With kind regards,
Denys Rtveliashvili

Adding an INLINE pragma is the right thing for alloca and similar functions.

alloca is a small overloaded wrapper around allocaBytesAligned, and
without the INLINE pragma the body of allocaBytesAligned gets inlined
into alloca itself, making it too big to be inlined at the call site
(you can work around it with e.g. -funfolding-use-threshold=100).  This
is really a case of manual worker/wrapper: we want to tell GHC that
alloca is a wrapper, and the way to do that is with INLINE.  Ideally GHC
would manage this itself - there's a lot of scope for doing some general
code splitting, I don't think anyone has explored that yet.

Cheers,
        Simon

_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Reply via email to