At 2020-05-21 15:04:55, "Fabien COELHO" <[email protected]> wrote: > >Hello,
> >My 0.02, some of which may just show some misunderstanding on my part: > >
- Could this be proposed as some kind of extension, provided that enough >
hooks are available? ISTM that foreign tables and/or alternative > storage
engine (aka ACCESS METHOD) provide convenient APIs which could > fit the need
for these? Or are they not appropriate? You seem to > suggest that there are
not. > > If not, what could be done to improve API to allow what you are
seeking > to do? Maybe you need a somehow lower-level programmable API which
does > not exist already, or at least is not exported already, but could be >
specified and implemented with limited effort? Basically you would like > to
read/write pg pages to somewhere, and then there is the syncing > issue to
consider. Maybe such a "page storage" API could provide > benefit for some
specialized hardware, eg persistent memory stores, > so there would be more
reason to define it anyway? I think it might > be valuable to give it some
thoughts. Thank you for giving so many comments. In my opinion, developing a
foreign table or a new storage engine, in addition to compression, also needs
to do a lot of extra things. A similar explanation was mentioned in Nikolay P's
email. The "page storage" API may be a good choice, and I will consider it, but
I have not yet figured out how to implement it. > - Could you maybe elaborate
on how your plan differs from [4] and [5]? My solution is similar to CFS, and
it is also embedded in the file access layer (fd.c, md.c) to realize the
mapping from block number to the corresponding file and location where
compressed data is stored. However, the most important difference is that I
hope to avoid the need for GC through the design of the page layout.
https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net
>> The most difficult thing in CFS development is certainly >>
defragmentation. In CFS it is done using background garbage collection, >> by
one or one >> GC worker processes. The main challenges were to minimize its >>
interaction with normal work of the system, make it fault tolerant and >>
prevent unlimited growth of data segments. >> CFS is not introducing its own
storage manager, it is mostly embedded in >> existed Postgres file access layer
(fd.c, md.c). It allows to reused >> code responsible for mapping relations and
file descriptors cache. As it >> was recently discussed in hackers, it may be
good idea to separate the >> questions "how to map blocks to filenames and
offsets" and "how to >> actually perform IO". In this it will be easier to
implement compressed >> storage manager. > - Have you consider keeping page
headers and compressing tuple data > only? In that case, we must add some
additional information in the page header to identify whether this is a
compressed page or an uncompressed page. When a compressed page becomes an
uncompressed page, or vice versa, an uncompressed page becomes a compressed
page, the original page header must be modified. This is unacceptable because
it requires modifying the shared buffer and recalculating the checksum.
However, it should be feasible to put this flag in the compressed address file.
The problem with this is that even if a page only occupies the size of one
compressed block, the address file needs to be read, that is, from 1 IO to 2
IO. Since the address file is very small, it is basically a memory access, this
cost may not be as large as I had imagined. > - I'm not sure there is a point
in going below the underlying file > system blocksize, quite often 4 KiB? Or
maybe yes? Or is there > a benefit to aim at 1/4 even if most pages overflow?
My solution is mainly optimized for scenarios where the original page can be
compressed to only require one compressed block of storage. The scene where the
original page is stored in multiple compressed blocks is suitable for scenarios
that are not particularly sensitive to performance, but are more concerned
about the compression rate, such as cold data. In addition, users can also
choose to compile PostgreSQL with 16KB or 32KB BLOCKSZ. > - ISTM that your
approach entails 3 "files". Could it be done with 2? > I'd suggest that the
possible overflow pointers (coa) could be part of > the headers so that when
reading the 3.1 page, then the header would > tell where to find the overflow
3.2, without requiring an additional > independent structure with very small
data in it, most of it zeros. > Possibly this is not possible, because it would
require some available > space in standard headers when the is page is not
compressible, and > there is not enough. Maybe creating a little room for that
in > existing headers (4 bytes could be enough?) would be a good compromise. >
Hmmm. Maybe the approach I suggest would only work for 1/2 compression, > but
not for other target ratios, but I think it could be made to work > if the
pointer can entail several blocks in the overflow table. My solution is
optimized for scenarios where the original page can be compressed to only need
one compressed block to store, In this scenario, only 1 IO is required for
reading and writing, and there is no need to access additional overflow address
file and overflow data file. Your suggestion reminded me. The performance
difference may not be as big as I thought (testing and comparison is required).
If I give up the pursuit of "only one IO", the file layout can be simplified.
For example, it is simplified to the following form, only two files (the
following example uses a compressed block size of 4KB) # Page storage(Plan B)
Use the compress address file to store the compressed block pointer, and the
Compress data file stores the compressed block data. compress address file: 0 1
2 3 +=======+=======+=======+=======+=======+ | head | 1 | 2 | 3,4 | 5 |
+=======+=======+=======+=======+=======+ compress address file saves the
following information for each page -Compressed size (when size is 0, it means
uncompressed format) -Block number occupied in Compress data file By the way, I
want to access the compress address file through mmap, just like snapfs
https://github.com/postgrespro/snapfs/blob/pg_snap/src/backend/storage/file/snapfs.c
Compress data file: 0 1 2 3 4
+=========+=========+==========+=========+=========+ | data1 | data2 | data3_1
| data3_2 | data4 | +=========+=========+==========+=========+=========+ | 4K |
# Page storage(Plan C) Further, since the size of the compress address file is
fixed, the above address file and data file can also be combined into one file
0 1 2 123071 0 1 2 +=======+=======+=======+ +=======+=========+=========+ |
head | 1 | 2 | ... | | data1 | data2 | ... +=======+=======+=======+
+=======+=========+=========+ head | address | data | If the difference in
performance is so negligible, maybe Plan C is a better solution. (Are there any
other problems?) > > - Maybe the compressed and overflow table could become
bloated somehow, > which would require the vaccuuming implementation and add to
the > complexity of the implementation? > Vacuuming is what I try to avoid. As
I explained in the first email, even without vaccuum, bloating should not
become a serious problem. >>However, the fragment will only appear in the scene
where the size of the same block is frequently changed greatly after
compression. >>... >>And no matter how severe the fragmentation, the total
space occupied by the compressed table cannot be larger than the original table
before compression. Best Regards Chen Huajun