Re: [jira] Commented: (DERBY-733) Starvation in RAFContainer.readPage()

Mike Matrigali Fri, 16 Dec 2005 13:00:52 -0800

you are right, I'll have to think about this some more.  Until java
gets async, guaranteed sync'd to disk writes I think we should continue
to use the current method for user initiated writes.


Suresh Thalamati wrote:

This might be obvious, thought I would mention it any way. Myunderstanding is one can not just enable "RWD" (direct io) for thecheckpoint. It has to be enabled for all the writes from the page cache,otherwise a file sync is required before doing "rwd" writes because I amnot sure If a file is opened in "rw" mode and then in "rws" mode ,writes to first open will also get synced to the disk , when file isopened in "rwd" mode , I doubt that.
If files are opened in direct io mode always , then page cache cleaningcan possible get slow and also user query request for a new page inbuffer pool can become slow if a cache is full and a page has to bethrown out to get a free page. Another thing to note is buffer cleaningis done on Rawstore daemon thread, which is overloaded with some postcommit work also , so page cache may not get cleaned often in some cases.
Thanks
-suresht


Mike Matrigali wrote:
excellent, I look forward to your work on concurrent I/O.  I am likely
to not be on the list much for the next 2 weeks, so won't be able to
help much.  In thinking about this issue I was hoping that somehow
the current container cache could be enhanced to support more than
one open container per container.  Then one would automatically get
control over the open file resource across all containers, by setting
the currently supported "max" on the container pool.

The challenge is that this would be a new concept for the basic services
cache implementation.  What we want is a cache that supports multiple
objects with the same key, and that returns an available one if another
one is "busy".  Also returns a newly opened one, if all are busy.  I
am going to start a thread on this, to see if any other help is
available. If possible I like this approach better than having aqueue of open files per container where it hard to control the growthof one queue vs. the growth in another.
On the checkpoint issue, I would not have a problem with changes to the
current mechanism to do "rwd" type sync I/O rather than sync at end(but we will have to support both until we don't have to support olderversions of JVM's). I believe this is as closeto "direct i/o" as we can get from java - if you mean somethingdifferent here let me know. The benefit is that I believe it will fix
the checkpoint flooding the I/O system problem.  The downside is that
it will cause total number of I/O's to increase in cases where the
derby block size is smaller than the filesystem/disk blocksize --assuming the OS currently converts our flood of multiple async writesto the same file to a smaller number of bigger I/O's. I think thistrade off is fine for checkpoints. If checkpoint efficiency is anissue, there are a number of other ways to address it in the future.
Øystein Grøvlen wrote:
"MM" == Mike Matrigali <[EMAIL PROTECTED]> writes:
    MM> user thread initiated read
MM> o should be high priority and should be "fair" withother user
    MM>        initiated reads.

    MM>      o These happen anytime a read of a row causes a cache miss.
MM> o Currently only one I/O operation to a file can happenat a time,
    MM>        could be big problem for some types of multi-threaded,
    MM>        highly concurrent low number of table apps.  I think
    MM>        the path here should be to increase the number of
    MM>        concurrent I/O's allowed to be outstanding by allowing
    MM>        each thread to have 1 (assuming sufficient open file
    MM>        resources).  100 outstanding I/O's to a single file may
    MM>        be overkill, but in java we can't know that the file is
    MM>        not actually 100 disks underneath.  The number of I/O's
    MM>        should grow as the actual application load increases,
    MM>        note I still think max I/O's should be tied to number
    MM>        of user threads, plus maybe a small number for
    MM>        background processing.

There was an interesting paper at the last VLDB conference that
discussed the virtue of having many outstanding I/O requests:
    http://www.vldb2005.org/program/paper/wed/p1116-hall.pdf (paper)
    http://www.vldb2005.org/program/slides/wed/s1116-hall.pdf (slides)

The basic message is that many outstanding requests are good.  The
SCSI controller they used in their study was able to handle 32
concurrent requests.  One reason database systems have been
conservative with respect to outstanding requests is that they want to
control the priority of the I/O requests.  We would like user thread
initiated requests to have priority over checkpoint initiated writes.
(The authors suggest building priorities into the file system to solve
this.)

I plan to start working on a patch for allowing more concurrency
between readers within a few weeks.  The main challenge is to find the
best way to organize the open file descriptors (reuse, limit the max.
number etc.)  I will file a JIRA for this.

I also think we should consider mechanisms for read ahead.

    MM> user thread initiated write
    MM>       o same issues as user initiated read.
MM> o happens way less than read, as it should only happenon a cacheMM> miss that can't find a non-dirty page in the cache.backgroundMM> cache cleaner should be keeping this fromhappening, thoughMM> apps that only do updates and cause cache hits areworst case.
    MM> checkpoint initiated write:
MM> o sometimes too many checkpoints happen in too short atime.MM> o needs an improved scheduling algorithm, currentlyjust defaultsMM> to N number of bytes to the log file no matter whatthe speed of
    MM>         log writes are.
MM> o currently may flood the I/O system causing userreads/writes toMM> stall - on some OS/JVM's this stall is amazing liketen's of
    MM>         seconds.
MM> o It is not important that checkpoints run fast,it is moreMM> important that it prodede methodically toconclusion whileMM> causing a little interuption to "real" work by userthreads. MM> Various approaches to this were discussed,but no patches yet.
For the scheduling of checkpoints, I was hoping Raymond would come up
with something.  Raymond are you still with us?

I have discussed our I/O architecture with Solaris engineers, and our
approach of doing buffered writes followed by a fsync, I was told was
the worst approach on Solaris.  They recommended using direct I/O.  I
guess there will be situations were single-threaded direct I/O for
checkpointing will give too low throughput.  In that case, we could
consider a pool of writers.  The challenge would then be how to give
priority to user-initiated requests over multi-threaded checkpoint
writes as discussed above.

Re: [jira] Commented: (DERBY-733) Starvation in RAFContainer.readPage()

Reply via email to