Hello,

We (David and I) recently observed that a Zedstore table can be considerably
bloated when we load data into it with concurrent copies. Also, we found
that
concurrent insert performance was less than desirable. This is a detailed
analysis of the extent of the problem, the cause of the problem and how we
fixed it. This has input from much of our team: Alex, Ashwin, David, Heikki,
Melanie, Taylor and myself.

An example of the bloat that we observed:
TPC-DS scale = 270:
Table             heap   zed(serial)  zed(16 parallel COPYs)
web_sales    39G    19G            39G

We found that it was caused due to inefficient page splits resultant from
out-of-tid-order-inserts into full/full-ish attribute tree leaf pages. The
degree of under-utilization was significant - attribute tree leaves with a
serial data load had 6-8x more datums than the attribute tree leaves
resultant with a parallel load of 16 sessions.

Consider the scenario below:

Assumptions:
1. Let us consider two concurrent copy commands executing (sessions S1 and
S2).
2. The table has only one (fixed-length for sake of argument) attribute 'a'.
3. For attribute 'a', a full attribute tree leaf page can accommodate 1500
datums.

TID allocations:
S1: 1-1000
S2: 1001-2000, 2001-3000

Order of operations:

1. S2 writes datums for tids 1001-2000, 2001-3000.
The resulting leaves are:
L1:
lokey  = 1              hikey    = 2500
firsttid = 1001        lasttid   = 2500
L2:
lokey  = 2501        hikey    = MaxZSTid
firsttid = 2501        lasttid   = 3000

2. S1 now writes datums for tids 1-1000.
We have to split L1 into L1' and L1''.
L1':
lokey    = 1           hikey    = 1500
firsttid   = 1           lasttid  =  1500
L1'': [under-utilized page]
lokey    = 1501     hikey   = 2000
firsttid = 1501       lasttid  = 2000
L2:
lokey  = 2501        hikey   = MaxZSTid
firsttid = 2501        lasttid  = 3000

Note: The lokeys/hikeys reflect ranges of what CAN be inserted whereas
firsttid
and lasttid reflect what actually have been inserted.

L1'' will be an under-utilized page that is not going to be filled again
because
it inherits the tight hikey from L1. In this example, space wastage in L1''
is
66% but it could very easily be close to 100%, especially under concurrent
workloads which mixes single and multi-inserts, or even unequally sized
multi-inserts.

Solution (kudos to Ashwin!):

For every multi-insert (and only multi-insert, not for singleton inserts),
allocate N times more tids. Each session will keep these extra tids in a
buffer. Subsequent calls to multi-insert would use these buffered tids. If
at
any time a tid allocation request cannot be met by the remaining buffered
tids,
a new batch of N times the number of tids requested will again be allocated.

If we take the same example above and say we allocated N=5 times the number
of
tids upon the first request for 1000 tids.:

TID allocations:
S1: 1-5000
S2: 5001-10000

Order of operations:

1. S2 writes datums for tids 5001-6000, 6001-7000.
The resulting leaves are:
L1:
lokey  = 1             hikey    = 6500
firsttid = 5001       lasttid  = 6500
L2:
lokey  = 6501        hikey   = MaxZSTid
firsttid = 6501       lasttid  = 7000

2. S1 writes datums for tids 1-1000.
L1 will be split into L1' and L1''.

L1':
lokey  = 1             hikey  = 5500
firsttid = 1             lasttid  = 1000
L1'' [under-utilized page]:
lokey  = 5501       hikey   = 6500
firsttid = 5501       lasttid  = 6500
L2:
lokey  = 6501        hikey   = MaxZSTid
firsttid = 6501        lasttid  = 7000

Subsequent inserts by S1 will land on L1' whose hikey isn't restrictive.

However, we do end up with the inefficient page L1''. With a high enough
value
of N, we reduce the frequency of such pages. We could further reduce this
wastage by incorporating a special left split (Since L1 was already full, we
don't change it at all -> we simply update it's lokey -> L1 becomes L1''
and we
fork of a new leaf to its left: L1'). This would look like:

L1':
lokey  = 1           hikey   = 5000
firsttid = 1           lasttid  = 1000

L1'':
lokey  = 5001      hikey   = 6500
firsttid = 5001      lasttid  = 6500

We found that with a high enough value of N, we did not get significant
space
benefits from the left split. Thus, we decided to only incorporate N.

Results: [TPC-DS scale = 50, 16 conc copies]

Table                     zed     N=10     N=100     N=1000     heap
zed(serial)
catalog_sales       15G    9.1G      7.7G      7.5G            15G
8.0G
catalog_returns    1.5G    0.9G     0.7G      0.7G            1.2G
 0.8G
store_returns        2.1G   1.2G     1.1G      1.1G            1.9G
 1.2G
store_sales          17G    11G       10.1G     10.1G         21G        10G

Load time:
N=10                 30min
N=100               10min
N=1000             7min
zed                    100min
heap                  8min

'zed' refers to the zedstore branch without our fix. We see that with N =
10, we
get closer to what we get with serial inserts. For N = 100, we even beat
serial
insert.
We can attribute the differences in runtime to the fact that by lowering the
number of tid range requests, we reduce the contention on the tid tree -
which
is a bottleneck for concurrent loads. A significant win!

How N relates to the other parameters in play:

Let S be the number of concurrent sessions
Let T be the average number of rows that a session wants to write in t sized
multi-insert batches
Let A be the number of attributes
Number of times a session multi-inserts into the tid tree without buffered
allocation = T/t
Number of times a session multi-inserts into the tid tree with buffered
allocation  = T/Nt
Total number of multi-inserts into the tid tree = Mt = ST/Nt
Also, total number of adverse insert cases where we could have bloat ∝ Mt
So, bloat ∝ Mt
Run-time of a parallel data load ∝ Mt * A
So the guidance would be to increase N with the increase in S or in T (t
will
be relatively constant for a certain table - it is constrained by the size
of a
row and the copy buffer) and also if the table is significantly wide.
We can see that it is difficult to provide a default to N, it really should
be
a GUC. Also, SINGLE_INSERT_TID_RESERVATION_THRESHOLD and
SINGLE_INSERT_TID_RESERVATION_SIZE should be turned into GUCs. In our
implementation, we treat MULTI_INSERT_TID_RESERVATION_FACTOR = N. We leave
the
GUC implementation for later.

Cost of killing the extra unused tids not consumed by multi-inserts:

The maximum number of tids that can be wasted (W) is capped at (tN - 1) *
S. This is
the worst case: where the last tid allocation request only used 1 tid out
of the
tN tids it received and buffered for every session.
So average case ~ (tN /2) * S.  Number of times the tid tree has to be
accessed
to delete these (tN/2) * S tids is S.  So taking tid wastage into account,
on
average, number of accesses to the tid tree = Mt + W = ST/Nt +
Thus this additional cost of S, and thus cost of tid killing is not really
significant.

Regards,
Soumyadeep & David

Reply via email to