[Dbmail-dev] Re: [DBMail 0000754]: single instance storage for headervalues

Michael Monnerie Wed, 04 Feb 2009 17:34:32 -0800

Hello, now that I'm subscribed to the -dev list also I want to greet 
everybody, and resend my message to Niki and Paul:

On Dienstag 03 Februar 2009 Niki Guldbrand wrote:
> /me waves his hand

Hi Niki! Sorry I'm not on -dev, so didn't know you.

> git clone git://git.dbmail.eu/paul/dbmail

I did that now, and have some questions:
dbmail_23=# \d dbmail_headervalue
                              Tabelle »public.dbmail_headervalue«
   Spalte    |      Typ       |                            Attribute
-------------+----------------+-----------------------------------------------------------------
 id          | bigint         | not null default 
nextval('dbmail_headervalue_id_seq'::regclass)
 hash        | character(256) | not null
 headervalue | text           | not null default ''::text
Indexe:
    »dbmail_headervalue_pkey« PRIMARY KEY, btree (id)
    »dbmail_headervalue_1« btree ("substring"(headervalue, 0, 255))
    »dbmail_headervalue_2« btree (hash)

Is "hash" really a hash of the value? I guess that's for performance 
only, so I immediately investigated and think I have another proposal, 
that should be faster and save disk space. But, on the downside, it's 
not completely normalized anymore. Still, I think it could be worth it, 
depending on how often it is used in the code:

First, I tested my "distribution" of headervalue lengths:

OVERALL:
select count(1) from dbmail_headervalue;
 2832434

select count(1) from dbmail_headervalue where length(headervalue)>200 
AND length(headervalue)<256; (skipping other SELECTs later)
LEN 200..255: 11831
LEN 100..200: 74519
LEN 51..100: 438533
LEN 20..50: 1929089
LEN 0..20: 319328

LEN >255: 20202

So most headervalues are <100 characters long I would say, and only 
0,71% (!) of all headervalues are >255 chars. So the table could be 
changed to:

 id          | bigint         | not null default 
nextval('dbmail_headervalue_id_seq'::regclass)
 headervalue | character varying (255) | default null
 headervalue_full | text           | default null
Indexe:
    »dbmail_headervalue_pkey« PRIMARY KEY, btree (id)
    »dbmail_headervalue_1« btree (headervalue)

And then fill the real headervalue into "headervalue" if it's <=255 
chars, and only to headervalue_full if it's >255 chars long.

What changes?
*) no more hash calculation (CPU save) for each headervalue_entry on 
insert
*) save 256 hash bytes per headervalue entry (for me with 2.8 mio. 
entries thats about 700MB on disk).
*) saves one complete index (over "hash"), roughly 700MB again, maybe 
more
*) current SELECT change:
a) no more ...where substring(headervalue,0,255) ... needed
b) SELECT ... WHERE headervalue like '%search%' becomes
SELECT ... WHERE headervalue like '%search%' or headervalue_full like 
'%search%'
*) current UPDATE change:
if (strlen(hval)) > 255 {
UPDATE ... SET headervalue_long=hval,headervalue=NULL
} else {
UPDATE ... SET headervalue=hval,headervalue_long=NULL
}
Like that, always only one of headervalue or headervalue_long is set, 
saving disk space and using the index whenever possible, and for those 
0,7% of all entries a full scan is needed (but that's needed now anyway 
if you search for %search%, as you can't use an index then).

Sorry that I have this idea only now, but it could really save lots of 
disk space and CPU time, and improve speed even further. I don't know 
the amount of work that would mean for you, but I can imagine it's 
relatively straightforward.

mfg zmi
-- 
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0660 / 415 65 31                      .network.your.ideas.
// PGP Key:         "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38  500E CE14 91F7 1C12 09B4
// Keyserver: wwwkeys.eu.pgp.net                  Key-ID: 1C1209B4

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Dbmail-dev mailing list
Dbmail-dev@dbmail.org
http://twister.fastxs.net/mailman/listinfo/dbmail-dev

[Dbmail-dev] Re: [DBMail 0000754]: single instance storage for headervalues

Reply via email to