Re[7]: More Innobase questions (Was: RE: Innobase)

Peter Zaitsev Tue, 13 Mar 2001 03:25:11 -0800
Hello Heikki,

Monday, March 12, 2001, 11:51:54 PM, you wrote:

HT> Hi Peter,

HT> and sorry that this response to your Big Mail comes some 12 days late.
HT> We have put together the release 3.23.34a and that has occupied
HT> my mind.

No Problem. At least the progress is going :) I'll try to test
innobase more shortly. Possibly later today.

>>The good thing is innobase is allready winning on some queries, there
>>it is not so optimized to use within mysql as MYISAM is.

HT> Some queries in sql-bench are a bit artificial, because they also
HT> test the bugginess of the database. If you select single rows from
HT> the database, then the CPU time is usually dominated by the communications
HT> overhead between your client and the database server. If the database is
HT> very big, it may be dominated by the disk seek needed to fetch the row.

Yes of couse. But anyway it's thrue slowdown in some things. So I just
point on this - this may mean something or may be expected (I think no
table handler will be efficient for every application)


HT> My basic test has been a COUNT(*) on a big join in main memory. That
HT> measures the speed of the database server kernel, and communications overhead is
HT> negligible. The TPC-H query benchmark contains many such queries.

HT> I have to look at the MySQL query optimizer. It does not always pick
HT> the best table access strategy for Innobase, maybe because table
HT> statistics are calculated in somewhat different way.

OK. It's the point of optimization I think :)

By the way. Does innobase optimize queries like select count(*) from
table ?  BDB needs to scan whole table for this which is not good for
some usages. If not it would be nice to have AVG_COUNT(*) aviable to
application - one of my applications uses it to deside if it's endeed
to purge the table or not.


HT> Also some insert tests are artificial. I have typically used a table
HT> with a primary key, and possible a couple of non-unique secondary indexes.
HT> Rows usually come in the ascending order of the primary key. In performing
HT> inserts, a transactional database has the extra burden of writing log records
HT> for the inserts. The log is needed to provide rollback and recovery, of course.

Yes. Of couse. That's why I'm saying I'm quite happy with insert
speed.


HT> Yes, it could handle, but currently you cannot define such tables
HT> in MySQL.

Of course :)

>>Is database clustered index page is the same as data page ?

HT> Yes, all data is stored in the pages of the clustered index.
HT> In Oracle such tables are called index organized tables, but in Sybase
HT> they use the term clustered tables. The idea is that the data is always
HT> kept in the order of the primary (cluster key). It makes accesses
HT> through the cluster key faster, since you do not need to latch and
HT> access a separate data page when you fetch a row.

Yes. This is good. But What about scaning table complitely ? Does it
slows down a lot (have you done any tests on it)


>>I would with MYSQL is to add a hash column to a table and a make an
>>index on it, and then to compute a hash function for each
>>inserting/updating row and also use it on search. This give me
>>significant perfomance improvement as well as better disk utilization.
>>I think the hash indexes is extremly good feature even if the hash
>>values are really stored in btree.

HT> It is best to do that in the application. Adding a new access method to
HT> a database server is a major task, requiring 3 man-months of work, at
HT> the minimum :).

Well. Of couse I can do it in application but this is common problem
what hash indexes are much more efficient in eqref and takes much less
space, so many databases have it and mysql in 4.0 as I remember is
also goint to have this.

Doing this in application is a bad idea as you can access a table from
many different places and for example from console, and it's hard to
keep HASH synced with a row by hand.


>>I ment What will be if I'll specify a wrong size in my.cnf ?

HT> When Innobase starts, it checks that the existing data files are of
HT> the size specified in my.cnf.

OK. So it will just not start in this case :)

>>HT> ALTER TABLESPACE has to edit also the .cnf file.
>>That's impossible because you can't find the exact place there you got
>>the value - it could be just specified as command line parameter. Also
>>the problem  is my.cnf is usualy only readable by mysql user and not
>>writable by it (it would be a security hole). Also in my case several
>>machines share one my.cnf over NFS. So if You don't like to
>>have system table space and system table it would be nice to have a
>>special database config in innobase directory which is updated by
>>mysql. This is a same way mysql does with replication - current status
>>is stored in special file.

HT> Monty wanted me to put the Innobase config parameters to my.cnf. Maybe
HT> we should do so that Innobase would internally remember the data files
HT> and when you start up, it would remind you to edit the my.cnf file
HT> if you have used ALTER TABLESPACE ADD DATAFILE online.

Well. The options which are changed in database during runtime (as
databases,tablespaces,tables and so on) is not to hold in the config
file which may (and I think should) be read-only by mysql user.
I think Monty would agree with this :)

>>thing why repair tool should exist.

HT> We have to look into this in depth. The problem with a repair tool is
HT> that it really cannot repair damaged data: if data in some rows has been
HT> written over, the tool cannot guess what the original data was. Of course,
HT> if the repair tool notices an inconsistency in a row, it can ask the
HT> user to type the data in by hand. If possible, you should of course put
HT> log files on a different disk from data files. If either disk fails,
HT> all is not lost.

Yes of couse. It's like with file system - then the dammage occures
the big chance is you will lose something (in some cases of couse you
don't). For example there may be cross linked files ore metadata
wrong. The idea is to at least recover most data possible - it's
better then not to recover anything.

For example if you have user database with 100.000 you will cry much
less if you will lose infomation about couple of then (or even about
100) than whole data.


I know much people wich work with databases which does not have
recovery - and they are not happy about that.  Also Yo may ask people
in mysql list how do they range the recovery - I'm shure they'll find
it important. And about Oracle and other databases - are you shure
there are people which use it without backup or got it dammaged
because of server memory flaw and quite happy there is
no such feature :) You may say they always use Brand servers with ECC
and so on and backup.... but mysql users usualy does not, as they do
not pay 50.000$ for theis database :)




HT> I just checked the Solid Server DBA manual from the Solid website, and I
HT> did not find any reference to a repair tool for Solid either. I know that
HT> Solid's predecessor VIA DRE database had a repair tool, but I think
HT> VIA was a buggy database. I conclude from the examples of Oracle,
HT> MS SQL Server and Solid Server that a repair tool is probably not
HT> the best option to obtain safety.

Well. What about BerkeleyDB,Postgres (well last is the main mysql
competitor) have this tools. Also there is tools to repair quite old
DBF files :)

>>This thing is quite important for me because I wont use a
>>database/filesystem in production enviroment then I don't have tools
>>which will help to safe a data in emergency.

HT> But, if your disk breaks, then the only way to save your data is
HT> to have replication, or a backup and logs on a different disk. A repair
HT> tool cannot help in case of a disk failure. Because MySQL writes
HT> its archived log on a high level (= logical log), your data is
HT> quite safe in the log, since an error in a simple write of an ASCII
HT> string is much less likely than a bug in a much more complicated 
HT> binary log.

If the disk is totally out yes. But if it's the over problems: for
example linux 2.4.0 had some problem which lead to file system
corruption, corruption can happen on IDE disks which do not have
parity on the bus, because of problems with CPU (for example
overheat)/chipset/memory/cache problems.... there are a lot of
problems there even logs on different disk does not help.

About replication - you know there is many different usages of
databases. In my case I have 20 servers under mysql to store
statistics data - Most users are free so I can't aford to buy 20 more
servers to make complete replication. Also it's statistics events - so
if I'll lose a couple of rows from hundreds of millions I'll not cry
much :)

I do the backup - but it's often better to lose a couple of rows then
have 10 hours downtime required to recover mysql backup and run update
logs on it.  These rows can be recovered later individually if endeed.

Of couse innobase recovery can be faster :)

You may think if you have problems with system it will hang fast not
having a chance to dammage the data. It's not. Once I had a CPU
problem there  my test program which wrote an array of couple of
millions of zeroes to the memory, afterwards read some of them with
over values but system was still able to work with mysql for several
hours....  Of couse I had a lot of problems whith a files afterwards
:)



>>HT> Hmm... even from a corrupted database you probably can read several
>>HT> tables and recover them. The data to a database usually comes from an
HT> outside
>>HT> source, and some data can be recovered from those outside sources.
>>
>>Well. You can't say so.  If the dammage is in system area it could
>>just fail to get the list of tables. These (metadata dammages) are
>>worst thing for filesystem and here should be the same.

HT> I agree, and repairing corrupted metadata can be very difficult too.
HT> A good thing about MySQL is that it keeps metadata also in separate
HT> .frm files. One could use them to repair the Innobase internal data dictionary
HT> if it breaks.

Well yes.  Also this dicrionary should be the same even in old backup
which also could help. The main problem I think is to repear trees,
allocation tables, cross links and so on.

HT> Some file systems are log-based, i.e., they use the same recovery mechanism
HT> as databases do. I am not sure if NTFS is such a file system.

Well you can check any other systems: XFS/JFS/REISERFS - all of them
in normal way  can survive if systems hangs or power goes down. But
all have their fsck to help in this cases
 - disk corruption (for example bad blocks)
 - unwanted writes to partition - for example because of OS
 problem/virus
 - internal file system driver bugs (all file system has these :) ) or
 OS one.
 - various hardware problems.

 Here you have quite the same list of problems you can't solve without
 special offline check and repair tool.

>>About outside sources - look at the real usages in a web (at least) -
>>for exampe we have using a database as forum backend - if the data is
>>lost will you be able to call the people to write the messages again ?

HT> It may be best to write a log also on the application level to get
HT> the maximum safety. Writing simple text log to a separate disk is
HT> very fast and should not be a bottleneck.

Well. This is also problem in at least two ways
1) The data can be updated by many application ondifferent disks and
even from console. Trying to log in every application and then
additionaly try to serialize the logs :)
2) In my case I get 50 millions of req/day  which is 1KB but then
normalized in number of databases it takes only about 150 bytes... :)

HT> Sorry, I must say that I disagree on the repair tool question. But
HT> possibly practical experience will show that such a tool should be written.
HT> What I am afraid of is that writing a general repair tool is a big
HT> task (and getting it to work correctly so that it does not itself corrupt data).
HT> Maybe we should look at what corruption problems will arise in pratical
HT> applications and write a specific repair tool for them. Currently,
HT> I am personally the best repair tool :), because I have to make sure
HT> beta testers can use Innobase without problems.

Well. Yes. But I'm afraid some users just will not use innobase
because of absend recovery tools without saying anything to you :)

I'm also should think about a worst case: What should I do If I'll get
this problem ? Will I be able to contact you, it I'll need to repair
the data ?

HT> Also I have to add to the online manual a chapter on backups and
HT> recovery.

HT> The comments and critique you give is the best feedback for Innobase
HT> and it will find its way to future releases.

Thank you.
I really hope  Innobase will solve many of my current problems, and
also I like som points of it's architecture :)

P.S I read GEMENI guys are using not buffered i/o for GEMINI under
linux - it would be nice to have an option for INNOBASE. Why only an
option ? Because in some applications which uses innobase and other
table types it may be interesting to shrink innobase buffer pool and
make it to use OS one (at least as read cache) - so different table
handlers will use memory efficiently depending on the load.

-- 
Best regards,
 Peter                            mailto:[EMAIL PROTECTED]



---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re[7]: More Innobase questions (Was: RE: Innobase)

Reply via email to