On 01/29/2014 02:21 PM, Amit Kapila wrote:
On Wed, Jan 29, 2014 at 3:41 PM, Heikki Linnakangas
<hlinnakan...@vmware.com> wrote:
For example, if the new tuple is an exact copy of the old tuple,
except for one additional byte in the beginning, the algorithm would fail to
recognize that. It would be good to add a test case like that in the test
suite.

You can skip bytes when building the history table, or when finding matches,
but not both. Or you could skip N bytes, and then check for matches for the
next four bytes, then skip again and so forth, as long as you always check
four consecutive bytes (because the hash function is calculated from four
bytes).

Can we do something like:
Build Phase
a. Calculate the hash and add the entry in history table at every 4 bytes.

Match Phase
a. Calculate the hash in rolling fashion and try to find match at every byte.

Sure, that'll work. However, I believe it's cheaper to add entries to the history table at every byte, and check for a match every 4 bytes. I think you'll get more or less the same level of compression either way, but adding to the history table is cheaper than checking for matches, and we'd rather do the cheap thing more often than the expensive thing.

b. When match is found then skip only in chunks, something like I was
     doing in find match function
+
+ /* consider only complete chunk matches. */
+ if (history_chunk_size == 0)
+ thislen += PGRB_MIN_CHUNK_SIZE;
+ }

Will this address the concern?

Hmm, so when checking if a match is truly a match, you compare the strings four bytes at a time rather than byte-by-byte? That might work, but I don't think that's a hot spot currently. In the profiling I did, with a "nothing matches" test case, all the cycles were spent in the history building, and finding matches. Finding out how long a match is was negligible. Of course, it might be a different story with input where the encoding helps and you have matches, but I think we were doing pretty well in those cases already.

The main reason to process in chunks as much as possible is to save
cpu cycles. For example if we build hash table byte-by-byte, then even
for best case where most of tuple has a match, it will have reasonable
overhead due to formation of hash table.

Hmm. One very simple optimization we could do is to just compare the two strings byte by byte, before doing anything else, to find any common prefix they might have. Then output a tag for the common prefix, and run the normal algorithm on the rest of the strings. In many real-world tables, the 1-2 first columns are a key that never changes, so that might work pretty well in practice. Maybe it would also be worthwhile to do the same for any common suffix the tuples might have.

That would fail to find matches where you e.g. update the last column to have the same value as the first column, and change nothing else, but that's ok. We're not aiming for the best possible compression, just trying to avoid WAL-logging data that wasn't changed.

Here during match phase, I think we can avoid copying literal bytes until
a match is found, that can save cycles for cases when old and new
tuple are mostly different.

I think the extra if's in the loop will actually cost you more cycles than you save. You could perhaps have two copies of the main match-finding loop though. First, loop without outputting anything, until you find the first match. Then, output anything up to that point as literals. Then fall into the second loop, which outputs any non-matches byte by byte.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to