Re: [HACKERS] GIN improvements part 1: additional information

Tomas Vondra Tue, 21 Jan 2014 17:40:35 -0800

On 21.1.2014 22:21, Heikki Linnakangas wrote:
> On 01/21/2014 11:35 AM, Heikki Linnakangas wrote:
>> On 01/21/2014 04:02 AM, Tomas Vondra wrote:
>>> On 20.1.2014 19:30, Heikki Linnakangas wrote:
>>>>
>>>> Attached is a yet another version, with more bugs fixed and more
>>>> comments added and updated. I would appreciate some heavy-testing of
>>>> this patch now. If you could re-run the tests you've been using,
>>>> that could be great. I've tested the WAL replay by replicating GIN
>>>> operations over streaming replication. That doesn't guarantee it's
>>>> correct, but it's a good smoke test.
>>>
>>> I gave it a try - the OOM error seems to be gone, but now get this
>>>
>>>      PANIC:  cannot insert duplicate items to GIN index page
>>>
>>> This only happens when building the index incrementally (i.e. using a
>>> sequence of INSERT statements into a table with GIN index). When I
>>> create a new index on a table (already containing the same dataset) it
>>> works just fine.
>>>
>>> Also, I tried to reproduce the issue by running a simple plpgsql loop
>>> (instead of a complex python script):
>>>
>>> DO LANGUAGE plpgsql $$
>>> DECLARE
>>>       r tsvector;
>>> BEGIN
>>>       FOR r IN SELECT body_tsvector FROM data_table LOOP
>>>           INSERT INTO idx_table (body_tsvector) VALUES (r);
>>>       END LOOP;
>>> END$$;
>>>
>>> where data_table is the table with imported data (the same data I
>>> mentioned in the post about OOM errors), and index_table is an empty
>>> table with a GIN index. And indeed it fails, but only if I run the block
>>> in multiple sessions in parallel.
>>
>> Oh, I see what's going on. I had assumed that there cannot be duplicate
>> insertions into the posting tree, but that's dead wrong. The fast
>> insertion mechanism depends on a duplicate insertion to do nothing.
> 
> Ok, this turned out to be a much bigger change than I thought...
> 
> In principle, it's not difficult to eliminate duplicates. However, to
> detect a duplicate, you have to check if the item is present in the
> uncompressed items array, or in the compressed lists. That requires
> decoding the segment where it should be.
> 
> But if we decode the segment, what's the purpose of the uncompressed
> items array? Its purpose was to speed up insertions, by buffering them
> so that we don't need to decode and re-encode the page/segment on every
> inserted item. But if we're paying the price of decoding it anyway, we
> might as well include the new item and re-encode the segment. The
> uncompressed area saves some effort in WAL-logging, as the record of
> inserting an entry into the uncompressed area is much smaller than that
> of re-encoding part of the page, but if that really is a concern, we
> could track more carefully which parts of the page is modified, and only
> WAL-log the required parts. And hopefully, the fast-update lists buffer
> inserts so that you insert many items at a time to the posting tree, and
> the price of re-encoding is only paid once.
> 
> So, now that I think about this once more, I don't think the
> uncompressed area in every leaf page is a good idea.
> 
> I refactored the way the recompression routine works again. It is now
> more clearly a multi-step process. First, the existing page is
> "disassembled" into an in-memory struct, which is a linked list of all
> the segments. In-memory each segment can be represented as an array of
> item pointers, or in the compressed format. In the next phase, all the
> new items are added to the segments where they belong. This naturally
> can lead to overly large segments; in the third phase, the items are
> redistributed among the segments, and compressed back to the compressed
> format. Finally, the finished segments are written back to the page, or
> pages if it had to be split.
> 
> The same subroutines to disassemble and recompress are also used in vacuum.
> 
> Attached is what I've got now. This is again quite heavily-changed from
> the previous version - sorry for repeatedly rewriting this. I'll
> continue testing and re-reviewing this tomorrow.


I've repeated the tests, and it actually finishes OK with this patch.
The incremental load works OK, does not fail because of OOM or PANIC
errors, or any other issues. Subsequent querying seems to work fine too.

The sizes before / after applying the patch look like this:

  object  | raw size (old) | raw size (new) |   diff
----------|------------------------------------------
  table   |           6093 |           6093 |  00.0%
  index A |           2288 |           2035 | -11.0%
  index B |             96 |             96 |  00.0%

and after running VACUUM FULL

  object  | vacuumed (old) | vacuumed (new) |   diff
----------|------------------------------------------
  table   |           6416 |           6416 |  00.0%
  index A |            678 |            415 | -38.8%
  index B |             42 |             20 | -52.4%

There results are slightly different than on my previous runs - for
example in my post from 2013/10/12 I've posted this (after vacuuming)

                   |   HEAD   |  patched |
    ============================================
    subject index  |   42 MB  |    15 MB | -75%
    body index     |  624 MB  |   328 MB | -47%

It was collected with slightly different test code / data, but I believe
the ratios should be about the same. However the differences are rather
significant. Is that because of the recent changes in encoding? Could
that be improved somehow?

What annoys me a bit is the huge size difference between the index
updated incrementally (by a sequence of INSERT commands), and the index
rebuilt from scratch using VACUUM FULL. It's a bit better with the patch
(2288 vs. 2035 MB), but is there a chance to improve this?

Anyway, is it safe to apply the second patch on top of this one?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN improvements part 1: additional information

Reply via email to