Cary, This is greatness.
Jack --- Cary Millsap <[EMAIL PROTECTED]> wrote: > I have an example for you (Anjo, I hope you won't > mind). A prospect we > visited once upon a time had been fighting a > performance problem with an > Oracle Payroll program. They "knew" what their > problem was: very > clearly, v$system_event was telling them that their > overwhelmingly > dominant system performance problem was waits for > "latch free." (Well, > no it wasn't. It was waits for the Oracle Payroll > program. ;) ) > > Actually, the v$system_event view indicated clearly > that the top wait > event was "SQL*Net message from client" (and some > other "SQL*Net %" > stuff), but of course every good DBA knows that you > have to discard all > the "SQL*Net" events because they are "idle" events. > > Three months of trying and failing to fix the > problem led finally to a > last-resort, desperation hardware upgrade. Twelve > 700MHz CPUs became > twelve new 1,024MHz CPUs. (Don't ask why "waits for > latch free" was > translated as "need to upgrade CPUs.") And > unfortunately, the Payroll > program got *slower*. Not just "seemed > slower"--*was* slower. > > After the upgrade failed to help, these guys finally > let us come see > them. We traced the Payroll process (10046, level > 8), very carefully > initiating the trace immediately after the > initiation of the Payroll > run, and very carefully terminating the trace > immediately after the > completion of the Payroll run. The key here is that > every second of data > in the trace file was a second of somebody's > response time. > > The result was a really fun story. The thing ran in > about 33 minutes. > Here's a summary of what was in the session's trace > file: > > Oracle Kernel Event Duration > Calls Avg > > ------------------------------ ------------------ > -------- ---------- > SQL*Net message from client 984.01s 49.6% > 95,161 0.010340s > SQL*Net more data from client 418.82s 21.1% > 3,345 0.125208s > db file sequential read 279.54s 14.1% > 45,085 0.006200s > CPU service 248.69s 12.5% > 222,760 0.001116s > unaccounted-for 27.73s 1.4% > > latch free 23.69s 1.2% > 34,695 0.000683s > log file sync 1.09s 0.1% > 506 0.002154s > SQL*Net more data to client 0.83s 0.0% > 15,982 0.000052s > log file switch completion 0.28s 0.0% > 3 0.093333s > enqueue 0.25s 0.0% > 106 0.002358s > SQL*Net message to client 0.24s 0.0% > 95,161 0.000003s > buffer busy waits 0.22s 0.0% > 67 0.003284s > db file scattered read 0.01s 0.0% > 2 0.005000s > SQL*Net break/reset to client 0.00s 0.0% > 2 0.000000s > ------------------------------ ------------------ > -------- ---------- > Total 1,985.40s 100.0% > > Well, waits for "latch free" accounted for only 1.2% > of the total > response time. If the DBAs had been completely > successful in eliminating > "latch free" waits from their system, it would have > made only a 1.2% > difference in the runtime of this program. Point 1: > <emphasis>This > program's bottleneck is not the same as its system's > system-wide average > bottleneck.</emphasis> This happens all the time. > You can NOT see this > type of problem when you look only at system-wide > average statistics. > > You probably didn't pay much heed to the first thing > you should have > noticed: slightly more than 70% of this program's > runtime was consumed > by so-called "server idle" time. ??! You can't > ignore this time, even if > people do call it "idle." Idle or not, this time was > part of someone's > response time, so we need to deal with it. Now if we > hadn't collected > our statistics so carefully (with proper time scope > and proper program > scope), then we would have seen probably much more > "SQL*Net message from > client" time in our data. If you make that > particular collection error, > then you *have* to disregard the so-called "idle" > wait time. > > What was the Payroll program's problem? A couple of > things. First, we > knew that the Payroll program was running on the > same host as Oracle. > But the latency per call on the "SQL*Net message > from client" event > looked suspiciously LAN-like (order of 10ms), not > IPC-like (order of 1ms > or less). So we checked the tnsnames.ora file. Turns > out that to > conserve administration effort, the DBAs had decided > to use a single > tnsnames.ora file system-wide. So the Payroll > program was using the same > TCP/IP protocol that the system's PC clients were > using. Just by > switching the protocol to BEQ instead of TCP in the > db server's > tnsnames.ora file (a change that took about 15 > minutes to test, 20 > seconds to implement, and about a week to navigate > through change > control), we eliminated about 50% of the program's > runtime. > > Next, what's with the "SQL*Net more data from > client"? The developers of > this particular Payroll program did a pretty bad > thing in their code: > they passed several really long (10KB) SQL texts > from the client to the > server in their parse calls (oh yeh, and with no > bind variables either). > Anyway, each time the Oracle kernel received a > packet containing only > part of a SQL statement, it tallied to "SQL*Net more > data from client" > while it waited for the next piece. Of course, > communicating via the > network interface card instead of IPC didn't help, > but passing huge SQL > statements from the app to the db is bad form to > begin with (see Ixora > for more detail on just how bad this is). > > Finally, realize that Oracle tallies to "SQL*Net > message from client" > pretty much every time it's "between" working on db > call fulfillment. > Therefore, there were on the order of 95,161 db > calls during this > 33-minute period. Looking at the SQL, we discovered > that of course, > since the app developers didn't use bind variables, > there was a parse > call for almost every execute. This accounted for > thousands of > unnecessary parse calls. Next, each execute call > manipulated only zero > or one row, which added up to dozens of execute > calls for each single > SQL statement with a multi-row result source. The > program could have > performed the business function with a fraction of > the db calls. > > Oh yes... Why did the Payroll program get *slower* > after the upgrade? > Little Payroll program time was spent using CPU, so > the upgrade had very > little direct positive effect upon the program. Most > of the program's > time was spent waiting for network latency. Other > programs ran at the > same time as this Payroll job. The CPU upgrade made > *them* faster, which > intensified *their* constant number of network calls > into a smaller time > window. The result was increased competition for the > network during the > === message truncated === __________________________________________________ Do You Yahoo!? LAUNCH - Your Yahoo! Music Experience http://launch.yahoo.com -- Please see the official ORACLE-L FAQ: http://www.orafaq.com -- Author: Jack Silvey INET: [EMAIL PROTECTED] Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051 San Diego, California -- Public Internet access / Mailing Lists -------------------------------------------------------------------- To REMOVE yourself from this mailing list, send an E-Mail message to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in the message BODY, include a line containing: UNSUB ORACLE-L (or the name of mailing list you want to be removed from). You may also send the HELP command for other information (like subscribing).
