Re: [HACKERS] GIN improvements part 1: additional information

Oleg Bartunov Wed, 18 Dec 2013 22:38:34 -0800

Guys,

before digging deep into the art of comp/decomp world I'd like to know
if you familiar with results of
http://wwwconference.org/www2008/papers/pdf/p387-zhangA.pdf paper and
some newer research ?  Do we agree in what we really want ? Basically,
there are three main features: size, compression, decompression speed
- we should take two :)


Should we design sort of plugin, which could support independent
storage on disk, so users can apply different techniques, depending on
data.

What I want to say is that we certainly can play with this very
challenged task, but we have limited time  before 9.4 and we should
think in positive direction.

Oleg

On Wed, Dec 18, 2013 at 6:50 PM, Heikki Linnakangas
<[email protected]> wrote:
> On 12/18/2013 01:45 PM, Alexander Korotkov wrote:
>>
>> On Tue, Dec 17, 2013 at 2:49 AM, Heikki Linnakangas
>> <[email protected]
>>>
>>> wrote:
>>
>>
>>> On 12/17/2013 12:22 AM, Alexander Korotkov wrote:
>>>   2) Storage would be easily extendable to hold additional information as
>>>>
>>>> well.
>>>> Better compression shouldn't block more serious improvements.
>>>>
>>>
>>> I'm not sure I agree with that. For all the cases where you don't care
>>> about additional information - which covers all existing users for
>>> example
>>> - reducing disk size is pretty important. How are you planning to store
>>> the
>>> additional information, and how does using another encoding gets in the
>>> way
>>> of that?
>>
>>
>> I was planned to store additional information datums between
>> varbyte-encoded tids. I was expected it would be hard to do with PFOR.
>> However, I don't see significant problems in your implementation of
>> Simple9
>> encoding. I'm going to dig deeper in your version of patch.
>
>
> Ok, thanks.
>
> I had another idea about the page format this morning. Instead of having the
> item-indexes at the end of the page, it would be more flexible to store a
> bunch of self-contained posting list "segments" on the page. So I propose
> that we get rid of the item-indexes, and instead store a bunch of
> independent posting lists on the page:
>
> typedef struct
> {
>     ItemPointerData first;   /* first item in this segment (unpacked) */
>     uint16      nwords;      /* number of words that follow */
>     uint64      words[1];    /* var length */
> } PostingListSegment;
>
> Each segment can be encoded and decoded independently. When searching for a
> particular item (like on insertion), you skip over segments where 'first' >
> the item you're searching for.
>
> This format offers a lot more flexibility compared to the separate item
> indexes. First, we don't need to have another fixed sized area on the page,
> which simplifies the page format. Second, we can more easily re-encode only
> one segment on the page, on insertion or vacuum. The latter is particularly
> important with the Simple-9 encoding, which operates one word at a time
> rather than one item at a time; removing or inserting an item in the middle
> can require a complete re-encoding of everything that follows. Third, when a
> page is being inserted into and contains only uncompressed items, you don't
> waste any space for unused item indexes.
>
> While we're at it, I think we should use the above struct in the inline
> posting lists stored directly in entry tuples. That wastes a few bytes
> compared to the current approach in the patch (more alignment, and 'words'
> is redundant with the number of items stored on the tuple header), but it
> simplifies the functions handling these lists.
>
>
> - Heikki
>
>
> --
> Sent via pgsql-hackers mailing list ([email protected])
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN improvements part 1: additional information

Reply via email to