Hi Folks,
I think I attempted to do this a couple years ago but never got around to
it.

It's occurred to me, as it may have to many of you, that many of us with
large amounts of incoming data would seriously benefit from implementing a
"de-duplication" of log messages.

To that end, here are my thoughts:

First, I would define a duplicate message as any message that matches on the
following:
* hostname
* message
That is to say, if an event comes in from the same host having the same
message, it should be marked as a duplicate.
Sounds easy, right? Maybe not :-)

This may or may not need to be a "fuzzy" search - some messages may be
duplicates, but may not be *exactly* alike - how do you determine that?
Take, for example the following messages:

mysql> select * from logs where msg like '%reloadcache%' limit 2 \G
*************************** 1. row ***************************
    host: loghost
facility: cron
priority: info
   level: info
     tag: 4e
datetime: 2008-02-08 00:05:01
 program: /USR/SBIN/CRON
     msg: /USR/SBIN/CRON[5418]: (root) CMD (php
/www/php-syslog-ng/scripts/reloadcache.php >>
/var/log/php-syslog-ng/reloadcache.log)
     seq: 1986
*************************** 2. row ***************************
    host: loghost
facility: cron
priority: info
   level: info
     tag: 4e
datetime: 2008-02-08 00:10:01
 program: /USR/SBIN/CRON
     msg: /USR/SBIN/CRON[5463]: (root) CMD (php
/www/php-syslog-ng/scripts/reloadcache.php >>
/var/log/php-syslog-ng/reloadcache.log)
     seq: 1995


Clearly, these are duplicate messages, but they don't match *exactly* due to
the CRON[nnnn] entry
So, we need a way to figure out a "fuzzy" match, like 90% of the words in
the message match or something like that, but it requires some way to take
in each message, split it into tokens and do a comparison on it to see if
there is a 90% match.

Once we figure out that part, we then need to perform the deduplication.

I believe there are 3 possible ways to accomplish this:
1. Through syslog-ng
- There's an advantage here in that we could possibly dedup messages before
they even get into the database saving processing cycles.
- I did some searching but found nothing worthwhile, anyone know if it can
do this?)


2. In MySQL itself
- Anyone know how to split the message field up and do a comparison within
mysql?

3. As a script or some type of update after the data is inserted.
- I would save this as a last resort since the objective here is to never
allow the duplicates into the database to begin with.
That said, I do believe we could just create a script that can do what we
need.


Advantage of doing this:
On a large database with millions of rows, I could see easily cutting out
1/3 to 2/3 of the rows due to duplicates.
Once we identify a duplicate row, we can simply add an additional column
called "count" and update that field to n+1 and delete the duplicate row.

This means that instead of 10,000 rows with the same message, we now have a
single row with a count field of 10000.
And that means a *much* faster database and frontend.

Your thoughts are appreciated!


-- 
______________________________________________________________

Clayton Dukes
______________________________________________________________
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Php-syslog-ng-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/php-syslog-ng-support

Reply via email to