Re: R: why CPU is high while disks are idle in a table scan???

SGreen Mon, 21 Jun 2004 12:18:55 -0700

There are 3 major factors determining your performance: 1) the speed of the
CPU, 2) the size if your RAM, and 3) the data transfer speed of your disks.


Since most computers do not let you change your CPU on the fly I will
assume that for some process A it will require the same number of clock
cycles to execute whether your data is in memory or on your disk. Being a
constant, we will leave it out for a moment.

RAM is much faster than disk for storage and retrieval. Thus, if you have
more RAM, you will generally have better performance.

That leave DISK I/O as the factor you are trying to change in order to
improve your performance, right?  Let's think through several scenarios and
figure out why what you did made only a small difference to your
performance in the tests you ran.

TEST 1) Random read of 100000 rows

Whenever the database engine has to find a value in a database it has only
two choices. It can find a value in an index or it can perform a table scan
to find the value. Since I assume you used your primary key as your search
value for both tables then both lookups were resolved by index search.
MySQL uses binary tree indexes. That means for any search in an index of
size X you will have to perform roughly N=(log-base-2( X) ) comparisons to
find your value. For an index containing 1.5 Million items it would take
(log-base-2(1.5e6)) or about 21 comparisons to locate your value in the
index. For an index that contains 16 million we get the number
N=(log-base-2(1.6e7)) or about  24 comparisons to find your number.   How
much longer did you think the search engine would need to do those 3 extra
comparisons?  What  takes the longest is transferring all of the data for
the index into RAM from the disk then moving the read heads to the correct
location to retrieve your record. The actual lookup is blazingly fast. 16
million integers (assuming 32 bits per integer + 8 bytes for record
pointers) is only 192 MB so the entire list fits nicely into most server's
memory.

TEST 2) Insert 100000 rows of new data

This performance will vary by database engine you are using (MyISAM or
InnoDB) as does the storage method. Suffice it to say that by the time the
disk has finished physically writing 100000 rows of data to the media, the
CPU has already updated your index and is waiting on a chance to commit its
cache to disk. Both engines can grow the end of their indices and data
storage on disk efficiently so dumping new data into a singly-indexed table
is fast. Since all of the data was sent to the end of the dataspace, there
was no need to split data pages to insert rows mid-space.   What can kill
INSERT performance is if there are several indices on a table. Each new
record will have to be added to each index causing a nearly linear slowdown
in INSERT performance.

What splitting your data into smaller tables WILL help with is:
1) you are more fault tolerant. The failure of one 10% "block" of your data
will not affect in the availability of the other 90%.
2) you can move parts of your data to other disks. Using multiple I/O
channels should be faster than a single channel.
3) it will be easier to backup and archive your database.  You can move
parts of your data to "near-line" storage if it is old and not frequently
needed.
4) You can apply different indexing methods to your most active data blocks
than you need for your older records, improving your response time. You
save total disk space by adding indexes to just the most active portions of
your data.
5) If your application is smart enough, you can split your data over
several servers. This would be a SERIOUS performance increase.
6) Write-locking a table only locks up a portion of your data (MyISAM).
This improves concurrency and consistency.

Negative  things about splitting your data:
1) You will have to code more MERGE tables, UNION queries, and  table JOINS
depending your circumstances. This is a negative due to the increased
development time and maintenance complexity.
2) Certain aggregate queries will be much less efficient or unavailable.
3) Requires an external partitioning processes to decide which table should
receive new data.

Depending on your application, you may be able to gain significant
increases in performance by splitting your tables. I hope I was able to
shed some light on why I think your tests didn't show much difference
between the two storage schemes.

Yours,
Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine



                                                                                       
                                 
                      "Leonardo                                                        
                                 
                      Francalanci"             To:       "Mysql" <[EMAIL PROTECTED]>   
                             
                      <[EMAIL PROTECTED]        cc:                                    
                                  
                      tel.ie>                  Fax to:                                 
                                 
                                               Subject:  R: why CPU is high while 
disks are idle in a table scan???     
                      06/21/2004 01:20                                                 
                                 
                      PM                                                               
                                 
                                                                                       
                                 
                                                                                       
                                 




Sorry, now it runs as espected (but I don't know what happened, maybe doing
other queries
I emptied out the cache)

Thank you anyway for your replies.

And: could somebody answer my previous question?

I wrote:


I have to deal with pretty big data, such a table (say BIG_TABLE) with
16.000.000 of rows.
The table has just one primary index (a simple int).

I thought that splitting data among multiple tables would have boosted
performance,
so I put my data in 16 different tables (everything is on one disk).
The first table had IDs from 1 to 1.500.000, the second the range
1500001-3000000 and so on.
When I made performance tests, I found that I didn't get any performance
gain.

So I tried (just for test purpose) to create a table of 100000 rows (say
SMALL_TABLE) with some data
(IDs from 1.000.000 to 1.100.000) from my big table BIG_TABLE.
Tests were like follows: (pseudocode)

for 100000 times
             read a column from BIG_TABLE where ID=<random number between
1.000.000 and
1.100.000>


for 100000 times
             read a column from SMALL_TABLE where ID=<random number between
1.000.000
and 1.100.000>


The results are the same!
There is no noticeable difference between a table with 16M rows and a table
with 100K rows!
Is that because my IDs are simple int from 1 to n (without gaps)?

I've tried to do 100000 inserts in both tables, and BIG_TABLE takes 45
seconds while
SMALL_TABLE takes 43 secs (not such a big difference for a table 160 times
smaller!)

The only big performance gain I got is from a SUM on a big range, because
MySql in this case
does a full scan on the SMALL_TABLE which is faster than an access by index
on the BIG_TABLE.

Am I doing something wrong? Isn't splitting supposed to improve
performance?
Or it is true
only if you can split data on multiple disks?


Thank you

Leonardo


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:
http://lists.mysql.com/[EMAIL PROTECTED]


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]







-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: R: why CPU is high while disks are idle in a table scan???

Reply via email to