This is a design proposal for matrix data type which can be larger than 1GB. Not only a new data type support, it also needs a platform enhancement because existing varlena has a hard limit (1GB). We had a discussion about this topic on the developer unconference at Tokyo/Akihabara, the day before PGconf.ASIA. The design proposal below stands on the overall consensus on the discussion.
============ BACKGROUND ============ The varlena format has either short (1-byte) or long (4-bytes) header. We use the long header for in-memory structure which is referenced by VARSIZE()/VARDATA(), or on-disk strcuture which is larger than 126b but less than TOAST threshold. Elsewhere, the short format is used if varlena is less than 126b or externally stored in the toast relation. Any kind of varlena representation does not support data- size larger than 1GB. On the other hands, some use cases which handle (relative) big-data in database are interested in variable length datum larger than 1GB. One example is what I've presented at PGconf.SV. A PL/CUDA function takes two arguments (2D-array of int4 instead of matrix), then returns top-N combination of the chemical compounds according to the similarity, like as: SELECT knn_gpu_similarity(5, Q.matrix, D.matrix) FROM (SELECT array_matrix(bitmap) matrix FROM finger_print_query WHERE tag LIKE '%abc%') Q, (SELECT array_matrix(bitmap) matrix FROM finger_print_10m) D; array_matrix() is an aggregate function to generate 2D-array which contains all the input relation stream. It works fine as long as data size is less than 1GB. Once it exceeds the boundary, user has to split the 2D-array by manual, although it is not uncommon the recent GPU model has more than 10GB RAM. It is really problematic if we cannot split the mathematical problem into several portions appropriately. And, it is not comfortable for users who cannot use full capability of the GPU device. ========= PROBLEM ========= Our problem is that varlena format does not permit to have a variable length datum larger than 1GB, even if our "matrix" type wants to move a bunch of data larger than 1GB. So, we need to solve the problem of the varlena format restriction prior to the matrix type implementation. In the developer unconference, people had discussed three ideas towards the problem. Then, overall consensus was the idea of special data-type which can contain multiple indirect references to other data chunks. Both of the main part and referenced data chunks are less than 1GB, but total amount of data size we can represent is more than 1GB. For example, even if we have a large matrix around 8GB, its sub-matrix separated into 9 portions (3x3) are individually less than 1GB. It is problematic when we try to save the matrix which contains indirect reference to the sub-matrixes, because toast_save_datum() writes out the flat portion just after the varlena head onto a tuple or a toast relation as is. If main portion of the matrix contains pointers (indirect reference), it is obviously problematic. We need to have an infrastructure to serialize the indirect reference prior to saving. BTW, other ideas but less acknowledgement were 64bit varlena header and utilization of large object. The earlier idea breaks effects to the current data format, thus, will make unexpected side-effect on the existing code of PostgreSQL core and extensions. The later one requires users to construct a large object preliminary. It makes impossible to use interim result of sub-query, and leads unnecessary i/o for preparation. ========== SOLUTION ========== I like to propose a new optional type handler 'typserialize' to serialize an in-memory varlena structure (that can have indirect references) to on-disk format. If any, it shall be involced on the head of toast_insert_or_update() than indirect references are transformed to something other which is safe to save. (My expectation is, the 'typserialize' handler preliminary saves the indirect chunks to the toast relation, then put toast pointers instead.) On the other hands, it is uncertain whether we need 'typdeserialize' handler symmetrically. Because all the functions/operators which support the special data types should know its internal format, it is possible to load the data chunks indirectly referenced on demand. It will be beneficial from the performance perspective if functions /operators touches only a part of the large structure, because the rest of portions are not necessary to load into the memory. One thing I'm not certain is, whether we can update the datum supplied as an argument in functions/operators. Let me assume the data structure below: struct { int32 varlena_head; Oid element_id; int matrix_type; int blocksz_x; /* horizontal size of each block */ int blocksz_y; /* vertical size of each block */ int gridsz_x; /* horizontal # of blocks */ int gridsz_y; /* vertical # of blocks */ struct { Oid va_valueid; Oid va_toastrelid; void *ptr_block; } blocks[FLEXIBLE_ARRAY_MEMBER]; }; If and when this structure is fetched from the tuple, its @ptr_block is initialized to NULL. Once it is supplied to a function which references a part of blocks, type specific code can load sub-matrix from the toast relation, then update the @ptr_block not to load the sub-matrix from the toast multiple times. I'm not certain whether it is acceptable behavior/manner. If it is OK, it seems to me the direction to support matrix larger than 1GB is all green. Your comments are welcome. ==================== FOR YOUR REFERENCE ==================== * Beyond the 1GB limitation of varlena http://kaigai.hatenablog.com/entry/2016/12/04/223826 * PGconf.SV 2016 and PL/CUDA http://kaigai.hatenablog.com/entry/2016/11/17/070708 * PL/CUDA slides at PGconf.ASIA (English) http://www.slideshare.net/kaigai/pgconfasia2016-plcuda-en Best regards, -- KaiGai Kohei <kai...@kaigai.gr.jp> -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers