Re: Derby I/O issues during checkpointing

Øystein Grøvlen Tue, 01 Nov 2005 05:04:21 -0800

>>>>> "MM" == Mike Matrigali <[EMAIL PROTECTED]> writes:


    MM> Øystein Grøvlen wrote:
    >> Some tests runs we have done show very long transaction response times
    >> during checkpointing.  This has been seen on several platforms.  The
    >> load is TPC-B like transactions and the write cache is turned off so
    >> the system is I/O bound.  There seems to be two major issues:
    >> 1. Derby does checkpointing by writing all dirty pages by

    >> RandomAccessFile.write() and then do file sync when the entire
    >> cache has been scanned.  When the page cache is large, the file
    >> system buffer will overflow during checkpointing, and occasionally
    >> the writes will take very long.  I have observed single write
    >> operations that took almost 12 seconds.  What is even worse is that
    >> during this period also read performance on other files can be very
    >> bad.  For example, reading an index page from disk can take close
    >> to 10 seconds when the base table is checkpointed.  Hence,
    >> transactions are severely slowed down.
    >> I have managed  to improve response times by  flushing every file
    >> for

    >> every 100th write.  Is this something we should consider including
    >> in the code?  Do you have better suggestions?

    MM> probably the first thing to do is make sure we are doing a reasonable
    MM> amount of checkpoints, most  people who run these benchmarks configure
    MM> the system such that it either does 0 or 1 checkpoints during the run.

We do not do this for benchmarking.  We have just chosen TPC-B load,
because it represents a typical update-intensive load where a single
table represents most of the data volume.

    MM> This  goes to  the ongoing  discussion  on how  best to  automatically
    MM> configure checkpoint interval - the current defaults don't make much
    MM> sense for an OLTP system.

I agree.


    MM> I had  hoped that with the  current checkpoint design  that usually by
    MM> the time that the file sync  happened all the pages would have already
    MM> made
    MM> it to disk.  The hope was that while holding the write semaphore we
    MM> would not do any I/O and thus not cause much interruption to the rest of
    MM> the system.

Since the checkpoint will do buffered writes of all dirty pages, its
write rate will be much higher than the write rate of the disk.  There
is no way all the pages can make it to disk before sync is called.
Since the write is buffered, the write semaphore will not be held very
long for each write.  (I am not quite sure what you mean by the write
semaphore.  Something in the OS, or the synchronization on the file
container?)

Anyhow, I do not feel the problem is that writes or sync takes very
long.  The problem is that this impact read performance in two ways:
    - OS gives long response times on reads when checkpoint stresses
      file system.
    - Reads to a file, will have to wait for a write request to
      complete.  (Only one I/O per file at a time).

The solution seems to be to reduce the I/O utilization by
checkpointing. (i.e., reduce the write rate). 


    MM> What OS/filesystem are you seeing these results on? Any idea why a write
    MM> would take 10 seconds.  Do you think the write blocks when the sync is
    MM> called? If  so do you  think the  block a Derby  sync point or  an OS
    MM> internal sync point.

I have seen this both on Linux and Solaris.  My hypothesis is that a
write may take 10 seconds when the file system buffer is full.  I am
not sure why it is this way, but it seems like it helps to sync
regularly.  My guess is that this is because we avoid filling the
buffer.  We will try to investigate further.

I do not think the long write blocks on a Derby sync point.  What I
measured was just the time to call two RandomAccessFile methods (seek
and write).


    MM> We moved  away from using the  write then sync approach  for log files
    MM> because we found that on some OS/Filesystems performance of the sync

    MM> was linearly related to the size of the file, rather than the number
    MM> of modified pages.  I left it for checkpoint as it seemed an easy
    MM> way to do async write which I thought would then provide the OS with
    MM> basically the equivalent of many concurrent writes to do.

    MM> Another approach  may be to change  checkpoint to use  the direct sync
    MM> write, but make it get it's own open on the file similar to what you

    MM> describe below - that would mean other reader/writer would not block
    MM> ever on checkpoint read/write - at least from derby level.  Whether
    MM> this would increase or decrease overall checkpoint elapsed time is
    MM> probably system dependent - I am pretty sure it would increase time
    MM> on windows, but I continue to believe elapsed time of checkpoint is
    MM> not important - as you point out it is more important to make sure
    MM> it interferes with "real" work as little as possible.

I agree that elapsed time of checkpoint is not that important, but
scheduling only one page at a time for write will reduce the bandwith
of the I/O system since it will not be possible for the I/O to reorder
operations for optimal performance.  I would rather suggest doing a
bunch of unbuffered writes, then wait for a while before writing more
pages.  Alternately, one could use a pool of threads that do direct
io in parallel.

    >> 2. What makes thing even worse is that only a single thread can read
    >> a

    >> page from a file at a time.  (Note that Derby has one file per
    >> table). This is because the implementation of RAFContainer.readPage
    >> is as follow:
    >> synchronized (this)  { // 'this' is a  FileContainer, i.e. a
    >> file object

    >> fileData.seek(pageOffset);  // fileData is a RandomAccessFile
    >> fileData.readFully(pageData, 0, pageSize);
    >> }
    >> During checkpoint when I/O is slow this creates long queques of

    >> readers.  In my run with 20 clients, I observed read requests that
    >> took more than 20 seconds.
    >> This behavior will also limit throughput and can partly explains

    >> why I get low CPU utilization with 20 clients.  All my TPCB-B
    >> clients are serialized since most will need 1-2 disk accesses
    >> (index leaf page and one page of the account table).
    >> Generally,  in order to  make the  OS able  to optimize  I/O, one
    >> should

    >> have many outstanding I/O calls at a time.  (See Frederiksen,
    >> Bonnet: "Getting Priorities Straight: Improving Linux Support for
    >> Database I/O", VLDB 2005). I  have attached a patch where I have
    >> introduced several file

    >> descriptors (RandomAccessFile objects) per RAFContainer.  These are
    >> used for reading.  The principle is that when all readers are busy,
    >> a readPage request will create a new reader.  (There is a maximum
    >> number of readers.)  With this patch, throughput was improved by
    >> 50% on linux.  The combination of this patch and the synching for
    >> every 100th write, reduced maximum transaction response times with
    >> 90%.
    >> The patch is not ready for inclusion into Derby, but I would like

    >> to here whether you think this is a viable approach.
    >> 

    MM> I now see  what you were talking  about, I was thinking at  too high a
    MM> level. In  your test  is the  data spread across  more than  a single
    MM> disk?

No, data is on a single disk.  Log is on a separate disk.

    MM> Especially with data spread across multiple disks it would make sense
    MM> to allow multiple concurrent reads.  That config was just not the target
    MM> of the original Derby code - so especially as we target more processors
    MM> and more disks changes will need to be made.

    MM> I wonder if java's new async interfaces may be more appropriate, maybe
    MM> we just  need to change  every read into  an async read followed  by a
    MM> wait,
    MM> and the same for write.  I have not used the interfaces, does anyone
    MM> have experience with them and is there any downside to using them vs.
    MM> the current RandomAccessFile interfaces?

I have looked at Java NIO and could not find anything about
aynchronous IO for random access files.  There is a FileChannel class
but that seems only to support sequential IO. Have I missed something?

On the other hand, there is a JSR 203 for this.  This will
unfortunately not make it into Mustang (Java 6).

    MM> Your approach may be fine, one consideration may be the number of file
    MM> descriptors necessary to run the system.  On some very small platforms
    MM> the only way to run the original Cloudscape was to change the size
    MM> of the container cache to limit the number of file descriptors.

Maybe we could have a property to limit the number of file
descriptors.



-- 
Øystein

Re: Derby I/O issues during checkpointing

Reply via email to