Hello Inge, at first, thank you for your appreciation to the project. Read my comments inline !
On Thu, Dec 01, 2005 at 05:34:51PM +0100, Inge Bjørnvall Arnesen wrote: > 1: It seems like the current approach on cache flush is roughly (in > meta-language by creative unfolding of functions from MY_cache_purge): > > LOCK TABLE > for all cache entries > do > // Maybe the row exists, try update > UPDATE TABLE > // No it didn't, insert the row instead > if update_failed then > INSERT TABLE > endif > done > UNLOCK TABLE > > There is support for the sql_dont_try_update flag, which would be nice if it > was a way to > synchronize with my sql_history periods as it would prevent as manye UPDATE > attempts as > there are INSERT attempts. I can't see how that can be done with NetFlow, > however, as flow > packets may arrive later with more information for the same SQL history > epoch. Are there > ways to do this with Netflow? 'sql_dont_try_update' is meant to do exactly what you've described. By setting it accordingly with 'sql_history' lets you to avoid bunchs of UPDATE queries which are the most expensives. You are right about NetFlow: packets that arrive at a later time might carry informations of some previous timeperiod, thus requiring an UPDATE query; this is because in this configuration you also need the companion directive 'nfacctd_time_new' which basically makes 'nfacctd' to generate new timestamps ignoring those enclosed into the NetFlow datagram. While introducing such time skew to certain (either not-closed or long-lived) flows would be quite invasive, for example, in a forensics analysis context, it is not supposed to be vital into a traffic accounting and aggregation context. > 2: In INTERNALS Paolo writes: "Because we, at this moment, don't known if > INSERT queries > would create duplicates, an UPDATE query is launched first and only if no > rows are affected, > then an INSERT query is trapped." > > This is a bit cryptic to me. What is meant by this? A plain INSERT will yield > the normal > duplicate error if the row already exists (given default primary key). A 4.1+ > compatible > INSERT... ON DUPLICATE KEY UPDATE... seems to me to do all that's required > and doing so You are right on this. But that solution introduces just the benefit of condensing all in a single SQL query (no speedups) while breaks backward compatibility with MySQL 3.23 / 4.0 which are still widely deployed out there. Moreover it would invert the default logics from a UPDATE-then-INSERT to a INSERT-then-UPDATE. We have the 'sql_dont_try_update' when we decide to rely just over INSERTs; switching to a default INSERT-then-UPDATE logics will penalize those that need the UPDATEs. Instead, actually those requiring just the INSERTs can activate the 'sql_dont_try_update' directive. > 3: Ultimately performance enhancement is only of interest to me to ensure > proper logging > of all Netflow data. Whether the CPU spends 5% or 50% of its time doing this > is irrelevant > to me. What is not fine is that it locks the table while doing it, thereby > preventing > concurrent analysis of Netflow information. > > This brings me to my last question: Why is the table locked? Is it only > because of the > UPDATE/INSERT combination on multi-threading, where a failed UPDATE will > yield a failed > INSERT if the row is inserted by another process in the mean time? Could an > INSERT... ON > DUPLICATE KEY UPDATE... strategy enable us to remove the dreaded locks or are > there other > issues? Initially LOCKs were there just to ensure table consistency and allow multiple plugin to use the same table (it's not a suggested configuration but ...). Some time after that, i've looked into MySQL documentation ( http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html ) and i've read what follows: "you can obtain better overall performance by adding locks around everything that does more than about five inserts or updates in a row." . The key behind this should be in the index handling. So, LOCKs help in gaining some speed while playing around the DB. If you follow the INSERT-only strategy (which i strongly advice for you) keep an eye to the 'sql_multi_values' directive. I'm quite sure it will give you even more satisfactions. The multi-values tecnique (valid only for MySQL) is also described in the forementioned MySQL documentation page. Let me know how things get evolved on your side ! Cheers, Paolo
