[HACKERS] varlena beyond 1GB and matrix

Kohei KaiGai Wed, 07 Dec 2016 05:51:08 -0800

This is a design proposal for matrix data type which can be larger
than 1GB. Not only a new data type support, it also needs a platform
enhancement because existing varlena has a hard limit (1GB).
We had a discussion about this topic on the developer unconference
at Tokyo/Akihabara, the day before PGconf.ASIA. The design proposal
below stands on the overall consensus on the discussion.


============
 BACKGROUND
============
The varlena format has either short (1-byte) or long (4-bytes) header.
We use the long header for in-memory structure which is referenced
by VARSIZE()/VARDATA(), or on-disk strcuture which is larger than
126b but less than TOAST threshold. Elsewhere, the short format is
used if varlena is less than 126b or externally stored in the toast
relation. Any kind of varlena representation does not support data-
size larger than 1GB.
On the other hands, some use cases which handle (relative) big-data
in database are interested in variable length datum larger than 1GB.

One example is what I've presented at PGconf.SV.
A PL/CUDA function takes two arguments (2D-array of int4 instead of
matrix), then returns top-N combination of the chemical compounds
according to the similarity, like as:

    SELECT knn_gpu_similarity(5, Q.matrix,
                                 D.matrix)
      FROM (SELECT array_matrix(bitmap) matrix
              FROM finger_print_query
             WHERE tag LIKE '%abc%') Q,
           (SELECT array_matrix(bitmap) matrix
              FROM finger_print_10m) D;

array_matrix() is an aggregate function to generate 2D-array which
contains all the input relation stream.
It works fine as long as data size is less than 1GB. Once it exceeds
the boundary, user has to split the 2D-array by manual, although
it is not uncommon the recent GPU model has more than 10GB RAM.

It is really problematic if we cannot split the mathematical problem
into several portions appropriately. And, it is not comfortable for
users who cannot use full capability of the GPU device.

=========
 PROBLEM
=========
Our problem is that varlena format does not permit to have a variable
length datum larger than 1GB, even if our "matrix" type wants to move
a bunch of data larger than 1GB.
So, we need to solve the problem of the varlena format restriction
prior to the matrix type implementation.

In the developer unconference, people had discussed three ideas towards
the problem. Then, overall consensus was the idea of special data-type
which can contain multiple indirect references to other data chunks.
Both of the main part and referenced data chunks are less than 1GB,
but total amount of data size we can represent is more than 1GB.

For example, even if we have a large matrix around 8GB, its sub-matrix
separated into 9 portions (3x3) are individually less than 1GB.
It is problematic when we try to save the matrix which contains indirect
reference to the sub-matrixes, because toast_save_datum() writes out
the flat portion just after the varlena head onto a tuple or a toast
relation as is.
If main portion of the matrix contains pointers (indirect reference),
it is obviously problematic. We need to have an infrastructure to
serialize the indirect reference prior to saving.

BTW, other ideas but less acknowledgement were 64bit varlena header
and utilization of large object. The earlier idea breaks effects to
the current data format, thus, will make unexpected side-effect on
the existing code of PostgreSQL core and extensions. The later one
requires users to construct a large object preliminary. It makes
impossible to use interim result of sub-query, and leads unnecessary
i/o for preparation.

==========
 SOLUTION
==========
I like to propose a new optional type handler 'typserialize' to
serialize an in-memory varlena structure (that can have indirect
references) to on-disk format.
If any, it shall be involced on the head of toast_insert_or_update()
than indirect references are transformed to something other which
is safe to save. (My expectation is, the 'typserialize' handler
preliminary saves the indirect chunks to the toast relation, then
put toast pointers instead.)

On the other hands, it is uncertain whether we need 'typdeserialize'
handler symmetrically. Because all the functions/operators which
support the special data types should know its internal format,
it is possible to load the data chunks indirectly referenced on
demand.
It will be beneficial from the performance perspective if functions
/operators touches only a part of the large structure, because the
rest of portions are not necessary to load into the memory.

One thing I'm not certain is, whether we can update the datum
supplied as an argument in functions/operators.
Let me assume the data structure below:

  struct {
      int32   varlena_head;
      Oid     element_id;
      int     matrix_type;
      int     blocksz_x; /* horizontal size of each block */
      int     blocksz_y; /* vertical size of each block */
      int     gridsz_x;  /* horizontal # of blocks */
      int     gridsz_y;  /* vertical # of blocks */
      struct {
          Oid     va_valueid;
          Oid     va_toastrelid;
          void   *ptr_block;
      } blocks[FLEXIBLE_ARRAY_MEMBER];
  };

If and when this structure is fetched from the tuple, its @ptr_block
is initialized to NULL. Once it is supplied to a function which
references a part of blocks, type specific code can load sub-matrix
from the toast relation, then update the @ptr_block not to load the
sub-matrix from the toast multiple times.
I'm not certain whether it is acceptable behavior/manner.

If it is OK, it seems to me the direction to support matrix larger
than 1GB is all green.


Your comments are welcome.

====================
 FOR YOUR REFERENCE
====================
* Beyond the 1GB limitation of varlena
http://kaigai.hatenablog.com/entry/2016/12/04/223826

* PGconf.SV 2016 and PL/CUDA
http://kaigai.hatenablog.com/entry/2016/11/17/070708

* PL/CUDA slides at PGconf.ASIA (English)
http://www.slideshare.net/kaigai/pgconfasia2016-plcuda-en

Best regards,
-- 
KaiGai Kohei <kai...@kaigai.gr.jp>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] varlena beyond 1GB and matrix

Reply via email to