I find myself in an odd position. I am trying to convince the
developers of our system that we (netops) need their applications
to log things. I'm getting a lot of resistance. And I'm having
trouble coming up with a good "We need it because ____" argument.
What I've said:
1) We need to be able to trace data flow through the system
2) We need to be able to observe what the system does during
normal operation (so we know the difference when something is
wrong)
3) We need logs to troubleshoot at the individual machine level
What I want to know is: does anyone have a good reference that
programmurs will accept (like written by a coder) that describes
concisely the operational requirements surrounding logging? I
mean, I know that good logfiles are absolutely critical for the
repairability of any reliable system...and so does everyone I
hang out with (most of them are sysadmins...) but how do I
convince NON-sysadmins of this fact?
Some background might help. The place I work at now is
basically a big message passing system. Messages (requests)
come in and responses (and errors and notifications) go out.
Message data is stored in queues (IBM MQSeries) and message
state information is stored in a database (Oracle). Our
code is mostly Java stuff split across nearly a dozen different
components (and twice that many machines). The different
components were written by different people...some of them log
what they're doing (to varying degrees of usefulness) and
some do not log ANYTHING unless there's an error. Literally
the logs say "Starting..." and then there will be ABSOLUTELY
NOTHING even if we send a thousand messages through it.
The response from the coders has been basically "Everything
you want you can get from looking in the database."
My requirement 1) above *can* be addressed by looking in
the database, because the data flow at any point is represented
by data in the database.
2) may or may not be addressed by the database. I need to look
more into what exactly is in there, but it may be theoretically
possible to examine tables and get an idea of what the system
is doing. However (and this is difficult to quantify), I don't
think doing SQL SELECTs is as useful to an ops person as being
able to tail logfiles. The latter gives a real-time monologue
of what the system is doing, while the former is more interrupt-
driven and interactive. I think ops folks WILL sometimes just
look through logfiles to see what's going on...I don't think
they will EVER look through the database unless there's a
problem.
3) is not at all addresed by the database. The data has no
record of what instance (what thread on what machine) inserted
it. So if we have a problem that is specific to one machine
we can only catch it from the logs. However, as long as errors
ARE recorded in the logs, it addresses this requirement, so it's
not a good argument as to why normal operation should be logged.
OK, so that got real long, sorry about that.
Any suggestions? Any of you had to deal with this sort of thing
in the past?
___________________________________________________________________
P a u l
[EMAIL PROTECTED]
_______________________________________________
Bits mailing list
[EMAIL PROTECTED]
http://www.sugoi.org/mailman/listinfo/bits