On 13 June 2013 10:35, Dean Rasheed <dean.a.rash...@gmail.com> wrote: > Hi, > > Attached is a patch implementing a new aggregate function md5_agg() to > compute the aggregate MD5 sum across a number of rows. This is > something I've wished for a number of times. I think the primary use > case is to do a quick check that 2 tables, possibly on different > servers, contain the same data, using a query like > > SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo; > > or > > SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;
There seem to be 2 separate directions that this could go, which really meet different requirements: 1). Produce an unordered sum for SQL to compare 2 tables regardless of the order in which they are scanned. A possible approach to this might be something like an aggregate md5_total(text/bytea) returns text that returns the sum of the md5 values of each input value, treating each md5 value as an unsigned 128-bit integer, and then producing the hexadecimal representation of the final sum. This should out-perform a solution based on numeric addition, and in typical cases, the result wouldn't be much longer than a regular md5 sum, and so would be easy to eyeball for differences. 2). Produce an ordered MD5 sum compatible with COPY, whose result would match that of running unix md5sum on the COPY output. Given all the possible COPY options that would affect the result (format, delimiters, headers, quoting, escaping, ...), I think that such a thing would only reasonably be possible as an extension to the COPY command itself. I guess in its simplest form this would just be a new option "MD5" to COPY that would cause it to pipe its output to the md5 aggregator and then send the final sum to the COPY destination at the end (e.g., "COPY foo TO STDOUT MD5" would produce the ordered MD5 sum of the data in foo). I still think my original md5_agg() has its uses, since what it produces is comparable with external md5 sums, and is directly available to SQL, but (1) is probably the most useful for quickly comparing 2 tables. I'm much less convinced about the value of (2), but on the face of it, it doesn't seem like it would be hard to implement. Thoughts? Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers