Thomas,
I only meant the storage requirements.
Datatype 'text'
http://www.postgresql.org/docs/8.0/interactive/datatype-character.html:
"Long strings are compressed by the system automatically, so the
physical requirement on disk may be less."
Datatype 'binary'
http://www.postgresql.org/docs/8.0/interactive/datatype-binary.html says
nothing about compression. Probably because the size reduction/cpu cost
is much worse than for text fields.
Unfortunately Moores Law doesn't apply to disc io, there is only very
little improvement every year, so io becomes more and more a bottleneck.
That's why it's a good idea to increase cpu usage to lower io (compression).
Both "text" and "bytea" are so-called "TOAST" types in PostgreSQL.
TOAST types have some flexibility in storage strategy; have a look at
http://www.postgresql.org/docs/8.0/interactive/storage-toast.html
I just did a little experiment to try it out: inserting long,
compressible strings in sql_ascii and unicode databases, using storage
types "extended" (i.e. compressed) and "external" (not compressed):
sql_ascii
bytea
extended
BEFORE: 4600
AFTER : 4616
external
BEFORE: 4600
AFTER : 5660
text
extended
BEFORE: 4600
AFTER : 4616
external
BEFORE: 4600
AFTER : 5660
unicode
bytea
extended
BEFORE: 4600
AFTER : 4616
external
BEFORE: 4600
AFTER : 5660
text
extended
BEFORE: 4600
AFTER : 4616
external
BEFORE: 4600
AFTER : 5660
numbers are PostgreSQL data directory sizes in KB. These are for
PostgreSQL 7.4; 8.0 numbers are similar. Script is attached for reference.
I thought about fulltext search engines like tsearch2, they only work on
text fields.
A normal SELECT on a mailbox would be a real bad idea, because the
database would have to read all message bodies from disc. My mailbox has
10 GB, that's lot of io :)
Haven't used tsearch before, but it could be a factor. If converting to
UTF-8, another consideration would be whether dbmail will be able to
unambiguously round-trip the text back to the original encoding,
byte-for-byte, e.g. to preserve digital signatures. Pathological
example: the original message is in UTF-8, with
"Content-Transfer-Encoding: 8bit", but contains an invalid UTF-8
sequence. Not sure what MTAs typically do in those cases, if anything.
Robert
import pg, os
pgVer = '7.4'
def createAndPopulateDb(storageMechanismName, encoding, columnType):
print ' ', storageMechanismName
dbName = "compression_test_" + storageMechanismName + "_" + encoding + "_" + columnType
conn.query('CREATE DATABASE ' + dbName + " WITH ENCODING '" + encoding + "'")
dbOid = conn.query("SELECT oid FROM pg_database WHERE datname='" + dbName + "'").getresult()[0][0]
conn2 = pg.connect(dbName)
conn2.query("CREATE TABLE foo (bar " + columnType + ")")
dbDirName = '/var/lib/postgresql/' + pgVer + '/main/base/' + str(dbOid)
print ' BEFORE:', os.popen('du -ks ' + dbDirName).read().split()[0]
conn2.query("ALTER TABLE foo ALTER COLUMN bar SET storage " + storageMechanismName)
for i in range(10):
conn2.query("INSERT INTO foo VALUES ('" + ' '*100000 + "')")
print ' AFTER :', os.popen('du -ks ' + dbDirName).read().split()[0]
conn2.close()
conn = pg.connect('template1')
for encoding in ['sql_ascii', 'unicode']:
print encoding
for columnType in ['bytea', 'text']:
print ' ', columnType
createAndPopulateDb('extended', encoding, columnType) # compression
createAndPopulateDb('external', encoding, columnType) # no compression