Hi All,
Well after 6 wekks of testing here is the basic way
to operate SUN T3's as efficiently as possible.
Be prepared for arguments with High priests from the cult of SAME.
SUN T3's are fiber attached hardware RAID 5 arrays with
a "modern cache". The hardware engineers argue that if you need more
I/Os/sec just add another array as a "concatenated volume". The theory
being the hardware is intelligent enough to use the cache to increase
throughput. It actually works as they claim. Never did explain why
it wasn't a single point of failure in the end though.
My hardware was 3 4810's, each with 4 - 8 cpus, each with 4 - 8GB,
2 - 4 bricks per machine
First insist that multiple bricks be mounted on at least 2 mount
points.
(D2 and D3). DO NOT USE the forcedirectio option. I don't know why but
I have been unable to take less than a 40% throughput hit with it
turned on. And I don't care what other people say, no matter how
much respect I have for them
Insist on at least one JBOD for oracle binaries and configs
Insist on at least one JBOD for redo logs (D1)
This a bare minimum.
One set of redo on D1
One set of redo on D2
Archive logs, Rollback and Temp on D3
All data files where needed on D2
Next Level up
Add another JBOD for redo and move redo on to it
Move Rollback and Temp to D2
At this point to get more throughput you have to take the
JBOD to raw devices. Or try forcedirectio on these devices :)
If even better performance is needed, more JBOD, for rollback and redo.
If more disk spaces is needed, get another brick.
Which leads me to the recent discussion on "proper way to tune"
Huh????? Why make it so complex?
Tuning from a blue collar DBA perspective:
Assess the machine first
No matter what your ratios or what your waiting for:
sar to see if the machine is ever pinned
vmstat to see your queues and paging
iostat to see disk activity
top at timed intervals to catch rogue jobs
read your logs and config files
Then talk to the users
Is the "system slow" or is it specific jobs?"
log on run ratio reports and query v$system_event
Any ratio that is out of range needs to be tuned:
Especially disk sorts to memory sorts
For the infamous buffer cache ratio:
< 10% throw memory at it
> 97% take memory away
For wait states here's a quick drive through for those who look at
the number and say "Yeah but what do they mean"
Time Wait Total Time Average
Event # Waits Timeout In
Hndrds Time
-------------------------- --------- ------- ----------------------
SQL*Net more data to client ######### 0
680421 .005
SQL*Net message to client ######### 0 17590
0.000
SQL*Net message from client ######### 0 3953399703
35.511
- These are all communication to the client. ignore
db file sequential read 39562523 0
12300885 .311
- Data read, 0.0003 seconds average wait, ignore. This number will climb
if
- IO is bottlenecked or inappropriate (ie using FTS for joins)
rdbms ipc message 12440441 ####### 2774129387
222.993
- Internal machine communication ignore
db file scattered read 12264223 0
6202885 .506
- Data read, 0.0005 seconds average wait, ignore. This is higher due to
type of read.
- Increase in this time indicates an IO bottleneck
log file parallel write 4724477 67
2212249 .468
log file sequential read 2097709 0
1712615 .816
- Redo logs, with 2 pure raw JBOD I have got this down to about 0.25
hundredths
buffer busy waits 1548548 0
408235 .264
- Memory latch contention 0.0002 seconds ignore
- If Timeout or average increase, need to determine why contention is
increasing
control file parallel write 669234 0
376491 .563
pmon timer 662092 662074 203382329
307.181
- Internal waits ignore
direct path read 573442 0
423920 .739
- Reads from tempfiles (sorting). Each segment is 10M in this db so
ignore.
log file sync 551716 15
459036 .832
- See redo above.
db file parallel write 201166 0 610793
3.036
- writing updates and inserts 0.003 seconds ignore
undo segment extension 100516 100507 27
0.000
- Don't know what this is, hope it's not critical ;)
SQL*Net break/reset to client 92904 0
9522 .102
- Client communication ignore
LGWR wait for redo copy 92844 7
736 .008
- Affected by the archiver not keeping up. The alert log and an infamous
ratio
- are better ways of detecting this.
file open 76910 0
1874 .024
- note the number of occurances is getting very small, average time is
low, ignore
direct path write 69706 0 1408596
20.208
- writes for sorts, even though average time is high, research indicates
that
- the client does not wait for this, so internal ignore
SQL*Net message to dblink 48680 0 7
0.000
SQL*Net message from dblink 48680 0 108414
2.227
- network or machine dependant ignore
control file sequential read 45198 0
31664 .701
latch free 44849 30305
28693 .640
- Memory latch contention, notice rate of Timeout to # of waits
- If average time increases, need to determine why contention is
increasing
- QUICKLY
SQL*Net more data from client 43851 0 229528
5.234
enqueue 19946 19380 5973595
299.488
- I used to worry but ...., nothing ever happened so ignore
- And the rest happen so infrequently ignore
file identify 6961 0
496 .071
smon timer 6639 6612 203366288
30632.066
log file single write 2770 0 2998
1.082
rdbms ipc reply 1290 0 2302
1.784
log buffer space 1104 1 5507
4.988
db file single write 924 0
162 .175
log file switch completion 743 8 17454
23.491
refresh controlfile command 722 0
413 .572
pipe get 377 213 81693
216.692
library cache pin 316 0
152 .481
control file single write 231 0
164 .710
library cache load lock 72 6 877
26.069
single-task message 68 0
49 .721
switch logfile command 45 0 815
18.111
process startup 16 0 94
5.875
SQL*Net more data from dblink 16 0 0
0.000
row cache lock 5 0 0
0.000
db file parallel read 3 0 7
2.333
instance state change 2 0 0
0.000
Null event 1 1 410
410.000
reliable message 1 0 0
0.000
sort segment request 1 1 103
103.000
And the more you study your database the more you will understand of the
above :)
After you are aware of your systems problems, fix your config files and
file positions and then chase down SQL issues.
>From your users, capture the SQL run explain plan
Run top, catch processes that use a full cpu for more than 30 seconds
Capture the sql, run explain plan
I have always ogled women. When I got married, (well started going out)
I explained to my wife that I was making sure I had the best.
But really, she's a good wife,
I'm even allowed to have opinions. If she approves of them I'm allowed
to have them :)
The digest hit 983K on Friday, if I'm kind, 100K was content.
>From the titles I see that there were
performance problems with partitioned tables and bitmap indices?
I can't help those who won't help themselves.
And I don't receive HTML email.
TTFN
Off to figure out the relationship between multiblock_read_count and
those index_optimizer thingies
Dave
--
Dave Morgan
[EMAIL PROTECTED]
403 399 2442
--
Please see the official ORACLE-L FAQ: http://www.orafaq.com
--
Author: Dave Morgan
INET: [EMAIL PROTECTED]
Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051
San Diego, California -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from). You may
also send the HELP command for other information (like subscribing).