Re: [Dbmail] Body search with german-umlaut failed

Jochen Schroer Thu, 13 Jul 2006 10:26:16 +0200 (CEST)

Hi Paul,

it works great in postgresql if you convert the datatype bytea to text
before compare, than it is possible to use ILIKE


SELECT m.message_idnr,k.messageblk FROM dbmail_messageblks k JOIN
dbmail_physmessage p ON k.physmessage_id = p.id JOIN dbmail_messages m
ON p.id = m.physmessage_id WHERE mailbox_idnr = '1' AND status IN
('0','1' ) AND k.is_header = '0' group by m.message_idnr,k.messageblk
HAVING encode(k.messageblk::bytea,'escape') ILIKE '%Kündigung%'

The performance for this extra operation is drastic but okay (better
than the "old" solution), here some statistics from my test-system
(nearly 2500 real Mails for testing, every statement running 10 times
and I calculate average):

*case sensitive:*
SELECT m.message_idnr,k.messageblk FROM dbmail_messageblks k JOIN
dbmail_physmessage p ON k.physmessage_id = p.id JOIN dbmail_messages m
ON p.id = m.physmessage_id WHERE mailbox_idnr = '1' AND status IN
('0','1' ) AND k.is_header = '0' group by m.message_idnr,k.messageblk
HAVING k.messageblk like '%Kündigung%'

Result: 73 Records, 488.979ms

*case sensitive (LIKE, but encoding to see encoding costs):*
SELECT m.message_idnr,k.messageblk FROM dbmail_messageblks k JOIN
dbmail_physmessage p ON k.physmessage_id = p.id JOIN dbmail_messages m
ON p.id = m.physmessage_id WHERE mailbox_idnr = '1' AND status IN
('0','1' ) AND k.is_header = '0' group by m.message_idnr,k.messageblk
HAVING encode(k.messageblk::bytea,'escape') like '%Kündigung%'

Result: 73 record, 770.314ms

*case insensitive (ILIKE + encoding):*
SELECT m.message_idnr,k.messageblk FROM dbmail_messageblks k JOIN
dbmail_physmessage p ON k.physmessage_id = p.id JOIN dbmail_messages m
ON p.id = m.physmessage_id WHERE mailbox_idnr = '1' AND status IN
('0','1' ) AND k.is_header = '0' group by m.message_idnr,k.messageblk
HAVING encode(k.messageblk::bytea,'escape') ilike '%Kündigung%'

Result: 81 records, 883.023ms

(You dont need the ::bytea, it's okay if you simple use
encode(k.messageblk,'escape')  )

I testing to use my changes in the source and I get the case-insensitive
results back in my Thunderbird in less than 1 second, yesterday (with
the old method) it needs 32 seconds !!!


Kind regards,

Jochen

> Hi Paul,
>
> the idea looks good, but postgresql is a little bit more fussy then
> mysql, so your SQL failed.
> 1. for your select postgresql needs a grouping
> 2. at the moment there is no case insensitive operator for bytea
> datatypes (but I'm searching for a workaround)
>
> this statement work (case-sensitive):
> SELECT m.message_idnr,k.messageblk FROM dbmail_messageblks k JOIN
> dbmail_physmessage p ON k.physmessage_id = p.id JOIN dbmail_messages m
> ON p.id = m.physmessage_id WHERE mailbox_idnr = '1' AND status IN
> ('0','1' ) AND k.is_header = '0' group by m.message_idnr,k.messageblk
> HAVING k.messageblk like '%Kündigung%'
>
> The speedup is very good!
>
> Kind regards,
>
> Jochen
>   
>> Jochen, Aaron, and all sql gurus:
>>
>> I've done this (tiny, trivial, minor) change and have pushed it onto
>> svn-trunk. Dbmail now does the body search directly in the database
>> using HAVING SQL_INSENSITIVE_LIKE '%somestring%' against the messageblks.
>>
>> A massive improvement of body searches both in terms of performance and
>> compliance, I believe.
>>
>>
>>
>> shoot me. I need a break.
>>
>> enjoy.
>>
>>
>>
>> Paul J Stevens wrote:
>>   
>>     
>>> Jochen,
>>>
>>> Just to keep you up to date though. I've been playing a little with the
>>> non-indexed HAVING queries. Looks like a massive improvement over the
>>> current situation (surpise! :-\ not!)
>>>
>>> some thing like
>>>
>>> select m.message_idnr,k.messageblk
>>>     from dbmail_messageblks k
>>>     join dbmail_physmessage p on k.physmessage_id = p.id
>>>     join dbmail_messages m on p.id = m.physmessage_id
>>>     join dbmail_mailboxes b on m.mailbox_idnr = b.mailbox_idnr
>>>     join dbmail_users u on b.owner_idnr = u.user_idnr
>>>     where u.userid='pablo' and b.name = 'list/dbmail'
>>>     having k.messageblk like '%Kündigung%'
>>>
>>> where of course the query will be tweaked to match the engine. This
>>> example is for mysql
>>>
>>> Major speedup, very simple solution. Now on to code.
>>>
>>>
>>> Paul J Stevens wrote:
>>>     
>>>       
>>>> Jochen Schroer wrote:
>>>>       
>>>>         
>>>>> Hi Paul,
>>>>>
>>>>> this multiple cenversations decrease the oerformance of body search
>>>>> again, because at the meoment the search process has to load every
>>>>> messagepart of every message the performance is not really good.
>>>>> I'm thinking a little bit about a universal solution, but I have now
>>>>> idea up to now.
>>>>> (for me with postgresql I can add a new messagetext table, fill them
>>>>> with a trigger an d search with simple selects on this table, but that
>>>>> is no universal solution for postgresql, mysql and sqlite)
>>>>>
>>>>> do you have think about a solution where you save the message twice if a
>>>>> mail is inserted into the system?
>>>>> One time in the existing format and additional in another table with
>>>>> converted (UTF8?) content that you can use for search?
>>>>>         
>>>>>           
>>>> I've designed a solution to address this problem. Check the wiki:
>>>>
>>>> http://www.dbmail.org/dokuwiki/doku.php?id=bodysearch
>>>>
>>>> Basically, you can't use indexed searches on blobs. So the idea is to
>>>> store only the /words/ from each attachment in separate records. Once
>>>> that is done, those records /can/ be searched. Even non-indexed searches
>>>> directly on the database (i.e. using HAVING) would probably be an
>>>> improvement over the current situation.
>>>>
>>>> If you have a better idea, please share.
>>>>
>>>>       
>>>>         
>>>     
>>>       
>>   
>>     
>
>
>   


-- 
ECS-Solution GmbH
Raiffeisenstr. 26
67817 Imsbach
Germany

Web: http://www.ecs-solution.de
Tel: +49 (6302)  609 71-0
Fax: +49 (6302) 609 71-1

Re: [Dbmail] Body search with german-umlaut failed

Reply via email to