Re: [Dbmail-dev] Re: PostgreSQL + non-ASCII encoding

Robert Fleming Wed, 2 Nov 2005 00:11:40 +0100 (CET)

Thomas,

I only meant the storage requirements.


Datatype 'text'
http://www.postgresql.org/docs/8.0/interactive/datatype-character.html:
"Long strings are compressed by the system automatically, so the
physical requirement on disk may be less."

Datatype 'binary'
http://www.postgresql.org/docs/8.0/interactive/datatype-binary.html says
nothing about compression. Probably because the size reduction/cpu cost
is much worse than for text fields.

Unfortunately Moores Law doesn't apply to disc io, there is only very
little improvement every year, so io becomes more and more a bottleneck.
That's why it's a good idea to increase cpu usage to lower io (compression).

Both "text" and "bytea" are so-called "TOAST" types in PostgreSQL.TOAST types have some flexibility in storage strategy; have a look athttp://www.postgresql.org/docs/8.0/interactive/storage-toast.html

I just did a little experiment to try it out: inserting long,compressible strings in sql_ascii and unicode databases, using storagetypes "extended" (i.e. compressed) and "external" (not compressed):


sql_ascii
 bytea
   extended
     BEFORE: 4600
     AFTER : 4616
   external
     BEFORE: 4600
     AFTER : 5660
 text
   extended
     BEFORE: 4600
     AFTER : 4616
   external
     BEFORE: 4600
     AFTER : 5660
unicode
 bytea
   extended
     BEFORE: 4600
     AFTER : 4616
   external
     BEFORE: 4600
     AFTER : 5660
 text
   extended
     BEFORE: 4600
     AFTER : 4616
   external
     BEFORE: 4600
     AFTER : 5660

numbers are PostgreSQL data directory sizes in KB. These are forPostgreSQL 7.4; 8.0 numbers are similar. Script is attached for reference.


I thought about fulltext search engines like tsearch2, they only work on
text fields.

A normal SELECT on a mailbox would be a real bad idea, because the
database would have to read all message bodies from disc. My mailbox has
10 GB, that's lot of io :)

Haven't used tsearch before, but it could be a factor. If converting toUTF-8, another consideration would be whether dbmail will be able tounambiguously round-trip the text back to the original encoding,byte-for-byte, e.g. to preserve digital signatures. Pathologicalexample: the original message is in UTF-8, with"Content-Transfer-Encoding: 8bit", but contains an invalid UTF-8sequence. Not sure what MTAs typically do in those cases, if anything.


Robert

import pg, os

pgVer = '7.4'

def createAndPopulateDb(storageMechanismName, encoding, columnType):
  print '   ', storageMechanismName
  dbName = "compression_test_" + storageMechanismName + "_" + encoding + "_" + columnType
  conn.query('CREATE DATABASE ' + dbName + " WITH ENCODING '" + encoding + "'")
  dbOid = conn.query("SELECT oid FROM pg_database WHERE datname='" + dbName + "'").getresult()[0][0]
  conn2 = pg.connect(dbName)
  conn2.query("CREATE TABLE foo (bar " + columnType + ")")
  dbDirName = '/var/lib/postgresql/' + pgVer + '/main/base/' + str(dbOid)
  print '      BEFORE:', os.popen('du -ks ' + dbDirName).read().split()[0]
  conn2.query("ALTER TABLE foo ALTER COLUMN bar SET storage " + storageMechanismName)
  for i in range(10):
    conn2.query("INSERT INTO foo VALUES ('" + ' '*100000 + "')")
  print '      AFTER :', os.popen('du -ks ' + dbDirName).read().split()[0]
  conn2.close()

conn = pg.connect('template1')

for encoding in ['sql_ascii', 'unicode']:
  print encoding
  for columnType in ['bytea', 'text']:
    print ' ', columnType
    createAndPopulateDb('extended', encoding, columnType) # compression
    createAndPopulateDb('external', encoding, columnType) # no compression

Re: [Dbmail-dev] Re: PostgreSQL + non-ASCII encoding

Reply via email to