yep, that's the one
--- Mohammad Rafiq <[EMAIL PROTECTED]> wrote: > I think Rachel is referring following email from Carry Millsap.. It > is a > really nice explanation of this issue.... > Regards > Rafiq > > > > I have an example for you (Anjo, I hope you won't mind). A prospect > we > visited once upon a time had been fighting a performance problem with > an > Oracle Payroll program. They "knew" what their problem was: very > clearly, v$system_event was telling them that their overwhelmingly > dominant system performance problem was waits for "latch free." > (Well, > no it wasn't. It was waits for the Oracle Payroll program. ;) ) > > Actually, the v$system_event view indicated clearly that the top wait > event was "SQL*Net message from client" (and some other "SQL*Net %" > stuff), but of course every good DBA knows that you have to discard > all > the "SQL*Net" events because they are "idle" events. > > Three months of trying and failing to fix the problem led finally to > a > last-resort, desperation hardware upgrade. Twelve 700MHz CPUs became > twelve new 1,024MHz CPUs. (Don't ask why "waits for latch free" was > translated as "need to upgrade CPUs.") And unfortunately, the Payroll > program got *slower*. Not just "seemed slower"--*was* slower. > > After the upgrade failed to help, these guys finally let us come see > them. We traced the Payroll process (10046, level 8), very carefully > initiating the trace immediately after the initiation of the Payroll > run, and very carefully terminating the trace immediately after the > completion of the Payroll run. The key here is that every second of > data > in the trace file was a second of somebody's response time. > > The result was a really fun story. The thing ran in about 33 minutes. > Here's a summary of what was in the session's trace file: > > Oracle Kernel Event Duration Calls Avg > > ------------------------------ ------------------ -------- ---------- > SQL*Net message from client 984.01s 49.6% 95,161 0.010340s > SQL*Net more data from client 418.82s 21.1% 3,345 0.125208s > db file sequential read 279.54s 14.1% 45,085 0.006200s > CPU service 248.69s 12.5% 222,760 0.001116s > unaccounted-for 27.73s 1.4% > latch free 23.69s 1.2% 34,695 0.000683s > log file sync 1.09s 0.1% 506 0.002154s > SQL*Net more data to client 0.83s 0.0% 15,982 0.000052s > log file switch completion 0.28s 0.0% 3 0.093333s > enqueue 0.25s 0.0% 106 0.002358s > SQL*Net message to client 0.24s 0.0% 95,161 0.000003s > buffer busy waits 0.22s 0.0% 67 0.003284s > db file scattered read 0.01s 0.0% 2 0.005000s > SQL*Net break/reset to client 0.00s 0.0% 2 0.000000s > ------------------------------ ------------------ -------- ---------- > Total 1,985.40s 100.0% > > Well, waits for "latch free" accounted for only 1.2% of the total > response time. If the DBAs had been completely successful in > eliminating > "latch free" waits from their system, it would have made only a 1.2% > difference in the runtime of this program. Point 1: <emphasis>This > program's bottleneck is not the same as its system's system-wide > average > bottleneck.</emphasis> This happens all the time. You can NOT see > this > type of problem when you look only at system-wide average statistics. > > You probably didn't pay much heed to the first thing you should have > noticed: slightly more than 70% of this program's runtime was > consumed > by so-called "server idle" time. ??! You can't ignore this time, even > if > people do call it "idle." Idle or not, this time was part of > someone's > response time, so we need to deal with it. Now if we hadn't collected > our statistics so carefully (with proper time scope and proper > program > scope), then we would have seen probably much more "SQL*Net message > from > client" time in our data. If you make that particular collection > error, > then you *have* to disregard the so-called "idle" wait time. > > What was the Payroll program's problem? A couple of things. First, we > knew that the Payroll program was running on the same host as Oracle. > But the latency per call on the "SQL*Net message from client" event > looked suspiciously LAN-like (order of 10ms), not IPC-like (order of > 1ms > or less). So we checked the tnsnames.ora file. Turns out that to > conserve administration effort, the DBAs had decided to use a single > tnsnames.ora file system-wide. So the Payroll program was using the > same > TCP/IP protocol that the system's PC clients were using. Just by > switching the protocol to BEQ instead of TCP in the db server's > tnsnames.ora file (a change that took about 15 minutes to test, 20 > seconds to implement, and about a week to navigate through change > control), we eliminated about 50% of the program's runtime. > > Next, what's with the "SQL*Net more data from client"? The developers > of > this particular Payroll program did a pretty bad thing in their code: > they passed several really long (10KB) SQL texts from the client to > the > server in their parse calls (oh yeh, and with no bind variables > either). > Anyway, each time the Oracle kernel received a packet containing only > part of a SQL statement, it tallied to "SQL*Net more data from > client" > while it waited for the next piece. Of course, communicating via the > network interface card instead of IPC didn't help, but passing huge > SQL > statements from the app to the db is bad form to begin with (see > Ixora > for more detail on just how bad this is). > > Finally, realize that Oracle tallies to "SQL*Net message from client" > pretty much every time it's "between" working on db call fulfillment. > Therefore, there were on the order of 95,161 db calls during this > 33-minute period. Looking at the SQL, we discovered that of course, > since the app developers didn't use bind variables, there was a parse > call for almost every execute. This accounted for thousands of > unnecessary parse calls. Next, each execute call manipulated only > zero > or one row, which added up to dozens of execute calls for each single > SQL statement with a multi-row result source. The program could have > performed the business function with a fraction of the db calls. > > Oh yes... Why did the Payroll program get *slower* after the upgrade? > Little Payroll program time was spent using CPU, so the upgrade had > very > little direct positive effect upon the program. Most of the program's > time was spent waiting for network latency. Other programs ran at the > same time as this Payroll job. The CPU upgrade made *them* faster, > which > intensified *their* constant number of network calls into a smaller > time > window. The result was increased competition for the network during > the > Payroll run. Therefore, every network I/O call the Payroll call made > was > a little slower than before the CPU upgrade. The increase in network > response time overwhelmed the tiny direct improvement of the CPU time > reduction, resulting in a net degradation of Payroll performance. > ...Not > a good thing, because this Payroll program had a higher business > priority than everything else on the system at the same time. > > The End. > > * * * > > So, some wrap-up: > > 1. Looking at v$session_event is better than looking at > v$system_event > for these things. When you collect system-wide data to solve a single > program's performance problem, your irrelevant data can bury your > relevant data. We see this all the time. Any time you query > v$system_event and ask about the top-line events you see, your > vulnerable to this problem. > > 2. Furthermore, looking even at v$session_event tells you NOTHING > about > the user's experience. When a v$ query tells you that there's > "SQL*Net > message from client" time, you can't know without more data whether > this > stuff showed up in someone's response time, or if it corresponded to > time when the USER was actually idle. > > 3. Another important reason that looking at v$session_event data can > mislead you is that there's no record of CPU time in there. To find > that, you have to go to something like v$sesstat. The "CPU used by > this > session" statistic is supposed to give you the CPU time you need, but > various bugs prevent you from getting what you need in many cases. > > 4. Response time = CPU from sesstat + waits from session_event + an > unaccounted-for gap in time. Lots of people believe that the "gap" is > a > deficiency of 10046 data: it's not. The irony is that the same > deficiency is even worse in v$ table methods; you just can't tell > that > there even *is* a gap: you can't compute gap=R-(cpu+waits), because > with > v$-based methods you don't know what R is. This "gap" is one of those > things that we've discovered how to interpret reliably after some > work. > For more information, see some of our Hotsos Profiler documentation > at > www.hotsos.com/dnloads/1/kevents and click "unaccounted-for" (the > document is free, but you have to register). (Bear with us, most of > the > event documentation on our kevents page is pre-beta, but the > "unaccounted-for" document is pretty much complete.) > === message truncated === __________________________________________________ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- Please see the official ORACLE-L FAQ: http://www.orafaq.com -- Author: Rachel Carmichael INET: [EMAIL PROTECTED] Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051 San Diego, California -- Public Internet access / Mailing Lists -------------------------------------------------------------------- To REMOVE yourself from this mailing list, send an E-Mail message to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in the message BODY, include a line containing: UNSUB ORACLE-L (or the name of mailing list you want to be removed from). You may also send the HELP command for other information (like subscribing).
