On Sunday 29 August 2004 HH:58:18, Jenda Krynicky wrote: > From: Philipp Traeder <[EMAIL PROTECTED]> > > > You're right - the problem I'm trying to solve is quite restricted - > > and I'm very thankful for this ;-) Basically, I'm trying to write an > > application that "recognizes" log file formats, so that the following > > lines are identified as several manifestations of the same log > > message: > > > > could not delete user 3248234 > > could not delete user 2348723 > > > > or even > > > > failed to connect to primary server > > failed to connect to secondary server > > > > What I would like to see is a count of how many "manifestations" of > > each log message are being thrown, independently of the actual data > > they might contain. Since I do not want to hardcode the log messages > > into my application, I would like to generate regexes on the fly as > > they are needed. > > Well and how are you going to tell the program which messages to take > as the same? > Do you plan to teach the app as it reads the lines? Do you want it to > ask which group is a line that doesn't match any of the regexps so > far and have the regexp modified on the fly to match that line as > well? > > Or what do you want to do? > > IMHO it might be best to use handmade regexps, just don't have them > built into the application, but read from a config file. That is for > each type of logs you'd have a file with something like this: > > delete_user=^could not delete user \d+ > connect=^failed to connect to (?:primary|secondary) server > ... > > read the file, compile the regexps with qr// and have the application > try to match them and have the messages counted in the first group > whose regexp matches. > > > Do I make sense or am I babbling nonsense?
You're making perfect sense - the problem is not as trivial as I thought originally, but I think it's not that bad as long as you don't require a precision of 100%. In a perl script I wrote some time ago, I'm grouping log messages by comparing them word by word, using the String::Compare module like this: compare($message1, $message2, word_by_word => 5); If I read the module's code correctly, the strings are split up by whitespace and then compared char by char. Using this approach, I get a high similarity even if the differing parts of the strings do not have the same length, like in failed to connect to primary server failed to connect to secondary server What I did now was to extend String::Compare in a way that it records the differing parts of the strings in a string array for each string (actually, I did not extend String::Compare, but ported it to Java, because I'm writing the application in Java, but the idea should be the same) and returns a "wildcarded" version of the string, i.e. a version that replaces each character that is not identical in both strings with a wildcard string. Currently, I'm not using the regexp that is generated in this way for matching new messages, because I ran in some kind of deadlock: What should I do when I get a message for which I do not have a matching regexp yet? Since I do have only one occurence of this message so far, I can not detect a pattern, thus I can not generate a regexp. Therefore, I've got to compare all messages that follow in the method described above against the real messages, not against a wildcarded version. Anyway - if you choose the wildcard-character wisely, I think you should be able to generate a regexp that is surely not as good as one written by a human, but probably good enough (e.g. you could take (.*?) as wildcard character for each differing "word"). At the moment, this should be enough to solve my problem - I'm already using the word-by-word string comparison successfully, and it looks as if the ported/extended java version of String::Compare would do what I need. Nevertheless I could imagine that you could build better regexps by comparing the data that you extracted from the message (since I need to extract the data anyway to use them later, this is a very likely option). Let's say I've got the following log messages taken from a web application: 30/08/2004 23:25:01 processed request for a.html - took 35 ms 30/08/2004 23:25:05 processed request for ab.html - took 42 ms 30/08/2004 23:25:05 processed request for a.html - took 37 ms My application compares the messages, detects that they are very similar, and creates the following pattern (assuming that the wildcard char is an asterisk and that a multi-character difference is replaced by one wildcard char): 30/08/2004 23:25:* processed request for *.html - took * ms The differing data it extracts for the three lines is this: 01 a 35 05 ab 42 05 a 37 Going over the individual "columns" of data, the application could try to match some pre-declared data formats, i.e. it could check if all values match certain patterns like "\d+", "[azAZ]+" etc. If it finds a matching format, it could adapt the regexp so that it matches more fine-grained. You could object (and if I understood your mail correctly, you already did) that the application created a wrong pattern by taking the date (including the minutes, but not the seconds) as fixed - a log message that arrives a minute later would not fit the regexp anymore. This is a problem if I'm trying to use the regexp to match the messages, but not if I'm comparing the messages as strings again (as described above). Writing this, I think you're right - my problem is probably not solvable by generating regexps on the fly, but only (hopefully) by comparing strings on a more brute-force level. It might be an option to try to use regexps in order to speed up the process, but if you do not find a matching regexp, you probably need to go back to comparing strings again... I've not finished the application yet, so I can't say if all of this is going to work, but I'm quite optimistic at the moment. With a bit of luck, I can show you a working version in a few weeks (FWIW: The application I'm talking about will be a log4j server application - similar to chainsaw, but built for the application operators as opposed to the developers). Thank you for your insightful questions and suggestions - I appreciate very much the opportunity to discuss those problems before running against too many walls. :-) Philipp -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>