Hi Paul,
it works great in postgresql if you convert the datatype bytea to text
before compare, than it is possible to use ILIKE
SELECT m.message_idnr,k.messageblk FROM dbmail_messageblks k JOIN
dbmail_physmessage p ON k.physmessage_id = p.id JOIN dbmail_messages m
ON p.id = m.physmessage_id WHERE mailbox_idnr = '1' AND status IN
('0','1' ) AND k.is_header = '0' group by m.message_idnr,k.messageblk
HAVING encode(k.messageblk::bytea,'escape') ILIKE '%Kündigung%'
The performance for this extra operation is drastic but okay (better
than the "old" solution), here some statistics from my test-system
(nearly 2500 real Mails for testing, every statement running 10 times
and I calculate average):
*case sensitive:*
SELECT m.message_idnr,k.messageblk FROM dbmail_messageblks k JOIN
dbmail_physmessage p ON k.physmessage_id = p.id JOIN dbmail_messages m
ON p.id = m.physmessage_id WHERE mailbox_idnr = '1' AND status IN
('0','1' ) AND k.is_header = '0' group by m.message_idnr,k.messageblk
HAVING k.messageblk like '%Kündigung%'
Result: 73 Records, 488.979ms
*case sensitive (LIKE, but encoding to see encoding costs):*
SELECT m.message_idnr,k.messageblk FROM dbmail_messageblks k JOIN
dbmail_physmessage p ON k.physmessage_id = p.id JOIN dbmail_messages m
ON p.id = m.physmessage_id WHERE mailbox_idnr = '1' AND status IN
('0','1' ) AND k.is_header = '0' group by m.message_idnr,k.messageblk
HAVING encode(k.messageblk::bytea,'escape') like '%Kündigung%'
Result: 73 record, 770.314ms
*case insensitive (ILIKE + encoding):*
SELECT m.message_idnr,k.messageblk FROM dbmail_messageblks k JOIN
dbmail_physmessage p ON k.physmessage_id = p.id JOIN dbmail_messages m
ON p.id = m.physmessage_id WHERE mailbox_idnr = '1' AND status IN
('0','1' ) AND k.is_header = '0' group by m.message_idnr,k.messageblk
HAVING encode(k.messageblk::bytea,'escape') ilike '%Kündigung%'
Result: 81 records, 883.023ms
(You dont need the ::bytea, it's okay if you simple use
encode(k.messageblk,'escape') )
I testing to use my changes in the source and I get the case-insensitive
results back in my Thunderbird in less than 1 second, yesterday (with
the old method) it needs 32 seconds !!!
Kind regards,
Jochen
> Hi Paul,
>
> the idea looks good, but postgresql is a little bit more fussy then
> mysql, so your SQL failed.
> 1. for your select postgresql needs a grouping
> 2. at the moment there is no case insensitive operator for bytea
> datatypes (but I'm searching for a workaround)
>
> this statement work (case-sensitive):
> SELECT m.message_idnr,k.messageblk FROM dbmail_messageblks k JOIN
> dbmail_physmessage p ON k.physmessage_id = p.id JOIN dbmail_messages m
> ON p.id = m.physmessage_id WHERE mailbox_idnr = '1' AND status IN
> ('0','1' ) AND k.is_header = '0' group by m.message_idnr,k.messageblk
> HAVING k.messageblk like '%Kündigung%'
>
> The speedup is very good!
>
> Kind regards,
>
> Jochen
>
>> Jochen, Aaron, and all sql gurus:
>>
>> I've done this (tiny, trivial, minor) change and have pushed it onto
>> svn-trunk. Dbmail now does the body search directly in the database
>> using HAVING SQL_INSENSITIVE_LIKE '%somestring%' against the messageblks.
>>
>> A massive improvement of body searches both in terms of performance and
>> compliance, I believe.
>>
>>
>>
>> shoot me. I need a break.
>>
>> enjoy.
>>
>>
>>
>> Paul J Stevens wrote:
>>
>>
>>> Jochen,
>>>
>>> Just to keep you up to date though. I've been playing a little with the
>>> non-indexed HAVING queries. Looks like a massive improvement over the
>>> current situation (surpise! :-\ not!)
>>>
>>> some thing like
>>>
>>> select m.message_idnr,k.messageblk
>>> from dbmail_messageblks k
>>> join dbmail_physmessage p on k.physmessage_id = p.id
>>> join dbmail_messages m on p.id = m.physmessage_id
>>> join dbmail_mailboxes b on m.mailbox_idnr = b.mailbox_idnr
>>> join dbmail_users u on b.owner_idnr = u.user_idnr
>>> where u.userid='pablo' and b.name = 'list/dbmail'
>>> having k.messageblk like '%Kündigung%'
>>>
>>> where of course the query will be tweaked to match the engine. This
>>> example is for mysql
>>>
>>> Major speedup, very simple solution. Now on to code.
>>>
>>>
>>> Paul J Stevens wrote:
>>>
>>>
>>>> Jochen Schroer wrote:
>>>>
>>>>
>>>>> Hi Paul,
>>>>>
>>>>> this multiple cenversations decrease the oerformance of body search
>>>>> again, because at the meoment the search process has to load every
>>>>> messagepart of every message the performance is not really good.
>>>>> I'm thinking a little bit about a universal solution, but I have now
>>>>> idea up to now.
>>>>> (for me with postgresql I can add a new messagetext table, fill them
>>>>> with a trigger an d search with simple selects on this table, but that
>>>>> is no universal solution for postgresql, mysql and sqlite)
>>>>>
>>>>> do you have think about a solution where you save the message twice if a
>>>>> mail is inserted into the system?
>>>>> One time in the existing format and additional in another table with
>>>>> converted (UTF8?) content that you can use for search?
>>>>>
>>>>>
>>>> I've designed a solution to address this problem. Check the wiki:
>>>>
>>>> http://www.dbmail.org/dokuwiki/doku.php?id=bodysearch
>>>>
>>>> Basically, you can't use indexed searches on blobs. So the idea is to
>>>> store only the /words/ from each attachment in separate records. Once
>>>> that is done, those records /can/ be searched. Even non-indexed searches
>>>> directly on the database (i.e. using HAVING) would probably be an
>>>> improvement over the current situation.
>>>>
>>>> If you have a better idea, please share.
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
>
--
ECS-Solution GmbH
Raiffeisenstr. 26
67817 Imsbach
Germany
Web: http://www.ecs-solution.de
Tel: +49 (6302) 609 71-0
Fax: +49 (6302) 609 71-1