Hi,

In the thread 
http://archives.postgresql.org/message-id/4da69d60.4000...@2ndquadrant.com /
"Single client performance on trivial SELECTs" I wondered whether it might be
possible to increase the performance of the current allocator by using some
slab allocator like concept.

While avoiding doing my tax returns and having a long and delayed train ride
I started to play around.

The profile I used in this case was:

pgbench -h /tmp/ -p5433 -s 30 pgbench -S -T 20

I hacked together a patch which records every allocation to a file. That patch
obviously slows everything down quite a bit so that I only get about a third
of the normal tps.

The current profile for that looks like that (hope thats recognizable). I
guess the only unclear column is "compile_time". Thats the result of
__builtin_constant_p(size) for the size argument of the various allocation
operations which is true for the case where size is known at compile time.


           action          |       file       |                 func            
     | line  | size  | compile_time | count  
 
--------------------------+------------------+--------------------------------------+-------+-------+--------------+--------
  chunk_alloc              | list.c           | new_list                        
     |    68 |    16 | t            | 940753
  chunk_alloc              | list.c           | new_list                        
     |    72 |    24 | t            | 940753
  chunk_alloc              | list.c           | new_tail_cell                   
     |   112 |    16 | t            | 202908
  chunk_alloc_zero         | bitmapset.c      | bms_make_singleton              
     |   189 |     8 | f            | 129122
  chunk_alloc_zero_aligned | value.c          | makeString                      
     |    55 |    16 | t            | 129122
  chunk_alloc_zero_aligned | makefuncs.c      | makeVar                         
     |    73 |    40 | t            | 110676
  chunk_alloc_zero_aligned | makefuncs.c      | makeTargetEntry                 
     |   216 |    48 | t            |  92230
  chunk_alloc              | stringinfo.c     | initStringInfo                  
     |    50 |  1024 | t            |  73799
  chunk_alloc              | bitmapset.c      | bms_copy                        
     |   119 |     8 | f            |  73784
  chunk_alloc              | lock.c           | LockAcquireExtended             
     |   559 |   128 | t            |  55441
  chunk_alloc              | nodeFuncs.c      | expression_tree_mutator         
     |  2036 |    40 | t            |  55338
  chunk_alloc              | snapmgr.c        | PushActiveSnapshot              
     |   282 |    24 | t            |  55338
  chunk_alloc              | nbtsearch.c      | _bt_search                      
     |   121 |    24 | t            |  36908
  chunk_alloc              | resowner.c       | 
ResourceOwnerEnlargeCatCacheRefs     |   612 |   128 | t            |  36893
  chunk_alloc              | resowner.c       | 
ResourceOwnerEnlargeRelationRefs     |   754 |   128 | t            |  36893
  chunk_alloc_zero         | resowner.c       | ResourceOwnerCreate             
     |   135 |   168 | t            |  36893
  chunk_alloc              | nodeFuncs.c      | expression_tree_mutator         
     |  2027 |    40 | t            |  36892
  chunk_alloc              | snapmgr.c        | CopySnapshot                    
     |   217 |    72 | f            |  36892
  chunk_alloc_zero_aligned | copyfuncs.c      | _copyVar                        
     |  1041 |    40 | t            |  36892
  chunk_alloc_zero_aligned | equivclass.c     | add_eq_member                   
     |   452 |    32 | t            |  36892
  chunk_alloc_zero_aligned | execQual.c       | ExecInitExpr                    
     |  4231 |    24 | t            |  36892
  chunk_alloc_zero_aligned | execTuples.c     | MakeTupleTableSlot              
     |   118 |    96 | t            |  36892
  chunk_alloc_zero_aligned | gram.y           | makeColumnRef                   
     | 12286 |    24 | t            |  36892
  chunk_alloc              | list.c           | new_head_cell                   
     |    93 |    16 | t            |  18529
  chunk_alloc              | genam.c          | RelationGetIndexScan            
     |    77 |   104 | t            |  18489
  chunk_alloc              | nbtree.c         | btbeginscan                     
     |   385 |  6608 | t            |  18489
  chunk_alloc              | tupdesc.c        | CreateTemplateTupleDesc         
     |    63 |   160 | f            |  18479
  chunk_alloc              | genam.c          | RelationGetIndexScan            
     |    89 |    72 | f            |  18471
  chunk_alloc              | nbtree.c         | btbeginscan                     
     |   388 |    72 | f            |  18471
  chunk_alloc              | resowner.c       | ResourceOwnerEnlargeBuffers     
     |   524 |    64 | t            |  18448
  chunk_alloc              | trigger.c        | AfterTriggerBeginXact           
     |  3607 |   112 | t            |  18447
  chunk_alloc              | trigger.c        | AfterTriggerBeginXact           
     |  3619 |   192 | t            |  18447

splitted to contexts:

                   self                   |          action          |       
file       |                 func                 | line  | size  | 
compile_time | count  
 
-----------------------------------------+--------------------------+------------------+--------------------------------------+-------+-------+--------------+--------
  MessageContext                          | chunk_alloc              | list.c   
        | new_list                             |    68 |    16 | t            | 
848520
  MessageContext                          | chunk_alloc              | list.c   
        | new_list                             |    72 |    24 | t            | 
848520
  MessageContext                          | chunk_alloc              | list.c   
        | new_tail_cell                        |   112 |    16 | t            | 
166016
  MessageContext                          | chunk_alloc_zero         | 
bitmapset.c      | bms_make_singleton                   |   189 |     8 | f     
       | 129122
  MessageContext                          | chunk_alloc_zero_aligned | value.c  
        | makeString                           |    55 |    16 | t            | 
129122
  MessageContext                          | chunk_alloc_zero_aligned | 
makefuncs.c      | makeVar                              |    73 |    40 | t     
       | 110676
  ExecutorState                           | chunk_alloc              | list.c   
        | new_list                             |    68 |    16 | t            | 
 92230
  ExecutorState                           | chunk_alloc              | list.c   
        | new_list                             |    72 |    24 | t            | 
 92230
  MessageContext                          | chunk_alloc_zero_aligned | 
makefuncs.c      | makeTargetEntry                      |   216 |    48 | t     
       |  92230
  MessageContext                          | chunk_alloc              | 
bitmapset.c      | bms_copy                             |   119 |     8 | f     
       |  73784
  TopMemoryContext                        | chunk_alloc              | lock.c   
        | LockAcquireExtended                  |   559 |   128 | t            | 
 55441
  MessageContext                          | chunk_alloc              | 
nodeFuncs.c      | expression_tree_mutator              |  2036 |    40 | t     
       |  55338
  TopTransactionContext                   | chunk_alloc              | 
snapmgr.c        | PushActiveSnapshot                   |   282 |    24 | t     
       |  55338
  MessageContext                          | chunk_alloc              | 
stringinfo.c     | initStringInfo                       |    50 |  1024 | t     
       |  36894
  TopMemoryContext                        | chunk_alloc              | 
resowner.c       | ResourceOwnerEnlargeCatCacheRefs     |   612 |   128 | t     
       |  36893
  TopMemoryContext                        | chunk_alloc              | 
resowner.c       | ResourceOwnerEnlargeRelationRefs     |   754 |   128 | t     
       |  36893
  TopMemoryContext                        | chunk_alloc_zero         | 
resowner.c       | ResourceOwnerCreate                  |   135 |   168 | t     
       |  36893
  ExecutorState                           | chunk_alloc              | list.c   
        | new_tail_cell                        |   112 |    16 | t            | 
 36892
  ExecutorState                           | chunk_alloc              | 
nbtsearch.c      | _bt_search                           |   121 |    24 | t     
       |  36892
  ExecutorState                           | chunk_alloc              | 
stringinfo.c     | initStringInfo                       |    50 |  1024 | t     
       |  36892
  ExecutorState                           | chunk_alloc_zero_aligned | 
execQual.c       | ExecInitExpr                         |  4231 |    24 | t     
       |  36892
  ExecutorState                           | chunk_alloc_zero_aligned | 
execTuples.c     | MakeTupleTableSlot                   |   118 |    96 | t     
       |  36892
  MessageContext                          | chunk_alloc              | 
nodeFuncs.c      | expression_tree_mutator              |  2027 |    40 | t     
       |  36892
  MessageContext                          | chunk_alloc_zero_aligned | 
copyfuncs.c      | _copyVar                             |  1041 |    40 | t     
       |  36892
  MessageContext                          | chunk_alloc_zero_aligned | 
equivclass.c     | add_eq_member                        |   452 |    32 | t     
       |  36892
  MessageContext                          | chunk_alloc_zero_aligned | gram.y   
        | makeColumnRef                        | 12286 |    24 | t            | 
 36892
  TopTransactionContext                   | chunk_alloc              | 
snapmgr.c        | CopySnapshot                         |   217 |    72 | f     
       |  36892
  TopMemoryContext                        | chunk_alloc              | 
resowner.c       | ResourceOwnerEnlargeBuffers          |   524 |    64 | t     
       |  18448
  ExecutorState                           | chunk_alloc              | genam.c  
        | RelationGetIndexScan                 |    77 |   104 | t            | 
 18447
  ExecutorState                           | chunk_alloc              | genam.c  
        | RelationGetIndexScan                 |    89 |    72 | f            | 
 18447
  ExecutorState                           | chunk_alloc              | nbtree.c 
        | btbeginscan                          |   385 |  6608 | t            | 
 18447
  ExecutorState                           | chunk_alloc              | nbtree.c 
        | btbeginscan                          |   388 |    72 | f            | 
 18447
  MessageContext                          | chunk_alloc              | list.c   
        | new_head_cell                        |    93 |    16 | t            | 
 18447
  TopTransactionContext                   | chunk_alloc              | 
trigger.c        | AfterTriggerBeginXact                |  3607 |   112 | t     
       |  18447
  TopTransactionContext                   | chunk_alloc              | 
trigger.c        | AfterTriggerBeginXact                |  3619 |   192 | t     
       |  18447

1:
For that allocation profile I found that one significant part of the costs is
AllocSetFreeIndex (which is inlined in optimized mode, but still).

If we would move more stuff to the headers those very common cases with
compile time known sizes could eliminate the AllocSetFreeIndex computation
at runtime as its only dependent on the size.

The problem with that is obviously that it would violate the whole mctx.c
abstraction as it has to be known at compile time which memory manager is
used.

Are we willing to reduce that abstraction?

2:
aset.c fragments horribly for longliving contexts. I haven't generated
sensibly accessible data for that as I currently just store it in an sql
database and sql (at least my queries :P) isn't really that efficient
for figuring out when how much was allocated from that:
                             Table "public.allocations"
     Column    |  Type   |                        Modifiers                     
    
 
--------------+---------+----------------------------------------------------------
  id           | integer | not null default 
nextval('allocations_id_seq'::regclass)
  action       | text    | 
  parent       | text    | 
  self         | text    | 
  size         | bigint  | 
  addr         | bigint  | 
  file         | text    | 
  func         | text    | 
  line         | integer | 
  compile_time | boolean | 

schema when the data is sensibly large. A sensible benchmark easily produces
three digit million rows...

I guess some custom implementation for evaluating that data should be way much
better at that.

That fragmentation causes two main problems. 1 space wastage and 2. (more 
importantly imo) cache misses.

3:
Given the frequentness of very small allocations the current space overhead
of at least 16 bytes on 64bit Platforms seems quite high.

I don't think its too hard to devise a scheme (actually I have started that)
where the overhead is only sizeof(*void). But again that would reduce the 
the abstraction because it would make it harder to implement something like
pfree/repalloc/GetMemoryChunkContext which need to figure out the context
from a pointer.

If we had a separate api for small allocations we could even devise a schema
where a single chunk wouldn't have any memory overhead, but that seems to
complicated for now.


So after all this my question basically is: How important do we think the
mctx.c abstraction is?
I personally found the speed of AllocSetAlloc being the most limiting factor
in several workloads so I am willing to sacrifice some abstraction to gain
speed here. Especially as I hope its possible to write a single allocator
which is "good enough" for everything.

I currently started working on two pieces:
* A new memory manager implementation. For that the equivalent of
 AllocSetFreeIdx seems to be the biggest limit so far. Its not really ready
 for anything yet though.

* recording and replaying MemoryContext* to get a somewhat reasonable
 and reproducable micro benchmark. That unfortunately doesn't model the
 effects of cache misses that well but I got no better idea so far.

Any opinions, input, ideas?

Thanks for reading,

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to