Re: [pmacct-discussion] Questions about the MySQL query strategy

Paolo Lucente Thu, 1 Dec 2005 21:16:07 +0100

Hello Inge,
at first, thank you for your appreciation to the project. Read my comments 
inline !

On Thu, Dec 01, 2005 at 05:34:51PM +0100, Inge Bjørnvall Arnesen wrote:

> 1: It seems like the current approach on cache flush is roughly (in 
> meta-language by creative unfolding of functions from MY_cache_purge):
> 
> LOCK TABLE
> for all cache entries
> do
>       // Maybe the row exists, try update
>       UPDATE TABLE
>       // No it didn't, insert the row instead
>       if update_failed then
>               INSERT TABLE
>       endif
> done
> UNLOCK TABLE
> 
> There is support for the sql_dont_try_update flag, which would be nice if it 
> was a way to
> synchronize with my sql_history periods as it would prevent as manye UPDATE 
> attempts as
> there are INSERT attempts. I can't see how that can be done with NetFlow, 
> however, as flow
> packets may arrive later with more information for the same SQL history 
> epoch. Are there
> ways to do this with Netflow?

'sql_dont_try_update' is meant to do exactly what you've described. By setting 
it accordingly
with 'sql_history' lets you to avoid bunchs of UPDATE queries which are the 
most expensives.
You are right about NetFlow: packets that arrive at a later time might carry 
informations of
some previous timeperiod, thus requiring an UPDATE query; this is because in 
this configuration
you also need the companion directive 'nfacctd_time_new' which basically makes 
'nfacctd' to
generate new timestamps ignoring those enclosed into the NetFlow datagram.

While introducing such time skew to certain (either not-closed or long-lived) 
flows would be
quite invasive, for example, in a forensics analysis context, it is not 
supposed to be vital
into a traffic accounting and aggregation context. 

> 2: In INTERNALS Paolo writes: "Because we, at this moment, don't known if 
> INSERT queries
> would create duplicates, an UPDATE query is launched first and only if no 
> rows are affected,
> then an INSERT query is trapped."
> 
> This is a bit cryptic to me. What is meant by this? A plain INSERT will yield 
> the normal
> duplicate error if the row already exists (given default primary key). A 4.1+ 
> compatible
> INSERT... ON DUPLICATE KEY UPDATE... seems to me to do all that's required 
> and doing so

You are right on this. But that solution introduces just the benefit of 
condensing all in a
single SQL query (no speedups) while breaks backward compatibility with MySQL 
3.23 / 4.0
which are still widely deployed out there. 

Moreover it would invert the default logics from a UPDATE-then-INSERT to a 
INSERT-then-UPDATE.
We have the 'sql_dont_try_update' when we decide to rely just over INSERTs; 
switching to a
default INSERT-then-UPDATE logics will penalize those that need the UPDATEs. 
Instead, actually
those requiring just the INSERTs can activate the 'sql_dont_try_update' 
directive.

> 3: Ultimately performance enhancement is only of interest to me to ensure 
> proper logging
> of all Netflow data. Whether the CPU spends 5% or 50% of its time doing this 
> is irrelevant
> to me. What is not fine is that it locks the table while doing it, thereby 
> preventing
> concurrent analysis of Netflow information.
>
> This brings me to my last question: Why is the table locked? Is it only 
> because of the
> UPDATE/INSERT combination on multi-threading, where a failed UPDATE will 
> yield a failed
> INSERT if the row is inserted by another process in the mean time? Could an 
> INSERT... ON
> DUPLICATE KEY UPDATE... strategy enable us to remove the dreaded locks or are 
> there other
> issues?

Initially LOCKs were there just to ensure table consistency and allow multiple 
plugin to use
the same table (it's not a suggested configuration but ...). Some time after 
that, i've looked 
into MySQL documentation ( 
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html ) and i've
read what follows: "you can obtain better overall performance by adding locks 
around everything
that does more than about five inserts or updates in a row." . The key behind 
this should be
in the index handling. 

So, LOCKs help in gaining some speed while playing around the DB. If you follow 
the INSERT-only
strategy (which i strongly advice for you) keep an eye to the 
'sql_multi_values' directive.
I'm quite sure it will give you even more satisfactions. The multi-values 
tecnique (valid
only for MySQL) is also described in the forementioned MySQL documentation page.

Let me know how things get evolved on your side !

Cheers,
Paolo

Re: [pmacct-discussion] Questions about the MySQL query strategy

Reply via email to