I guess the problem I'm having is that I need to consolidate information. Since this is an NFS log, each line represents a file read or written. That's too much information (hundreds of MBs a day). I need to be able to to distill it to just summary information. I'm just not sure how to handle that. I figure that the smallest unit I'd have is what one user on one machine read or wrote on one filesystem during an hour.
Maybe a simple format: filesystem user client-machine time-to-the-hour read written Then for every line, I check to see if I have an entry that matches the first four parameters. If so, I add the number of bytes read or written. If not, I create a new entry. Then I can sort by whatever field I want, and limit my searches however I need to. Only that seems inefficient. Could I normalize that somehow? Paul Yesterday, G. Wade Johnson wrote: > I'm not familiar with the that log format, but taking the detabase suggestion > one step in an odd direction, DBD::CSV might be able to put a relational > database front-end on a log file. > > In the past, my normal approach for this sort of thing was an array of hashes. > (One hash perl line) The array is easily sorted with usung 'sort' and can be > filtered using 'grep'. > > Depending on how big the data set is and how complicated the query a database > might be a better choice. > > G. Wade > > On Mon, 27 Mar 2006 17:41:39 -0600 > [EMAIL PROTECTED] wrote: > >> On Mon, Mar 27, 2006 at 05:13:02PM -0600, Paul Archer wrote: >>> I'm writing a log analyzer (a la Webalyzer) to analyze Solaris' nfslog >>> files. They're in the same format as wu-ftpd xferlog files. I'd use an >>> existing solution, but I can't find anything that keeps track of reads vs >>> writes, which is critical for us. >>> Anyway, I need to be able to sort by filesystem, client machine, user, >>> time (with a one-hour base period) read, write, or total usage. >>> Can anyone suggest a data structure (or pointers to same) that will allow >>> me to pull data out in an arbitrary fashion (ie users on X day sorted by >>> data written)? >>> Once I have the structure, I can deal with doing the reports, but I want >>> to make sure I don't shoot myself in the foot with the structure. >>> >>> I was thinking of a hash of hashes, where the keys are filesystems >>> pointing to hashes where the keys are client machines, etc, etc. But it >>> seems that approach would be inefficent for lookups based on times or >>> users (for example). >>> >>> Any help would be greatly appreciated. >>> >>> Paul >>> _______________________________________________ >>> Houston mailing list >>> [email protected] >>> http://mail.pm.org/mailman/listinfo/houston >> >> Um. Have you considered a relational database? Sounds ideal for your >> problem.. >> _______________________________________________ >> Houston mailing list >> [email protected] >> http://mail.pm.org/mailman/listinfo/houston > > > -- > No, no, you're not thinking, you're just being logical. > -- Neils Bohr > _______________________________________________ > Houston mailing list > [email protected] > http://mail.pm.org/mailman/listinfo/houston > ----------------------------------------------------- "Somebody did say Swedish porn, there-- but someone always does..." --Clive Anderson, host of "Whose Line Is It, Anyway", after asking the audience for movie suggestions ----------------------------------------------------- _______________________________________________ Houston mailing list [email protected] http://mail.pm.org/mailman/listinfo/houston
