you are right, I'll have to think about this some more.  Until java
gets async, guaranteed sync'd to disk writes I think we should continue
to use the current method for user initiated writes.

Suresh Thalamati wrote:
This might be obvious, thought I would mention it any way. My understanding is one can not just enable "RWD" (direct io) for the checkpoint. It has to be enabled for all the writes from the page cache, otherwise a file sync is required before doing "rwd" writes because I am not sure If a file is opened in "rw" mode and then in "rws" mode , writes to first open will also get synced to the disk , when file is opened in "rwd" mode , I doubt that.

If files are opened in direct io mode always , then page cache cleaning can possible get slow and also user query request for a new page in buffer pool can become slow if a cache is full and a page has to be thrown out to get a free page. Another thing to note is buffer cleaning is done on Rawstore daemon thread, which is overloaded with some post commit work also , so page cache may not get cleaned often in some cases.


Thanks
-suresht


Mike Matrigali wrote:

excellent, I look forward to your work on concurrent I/O.  I am likely
to not be on the list much for the next 2 weeks, so won't be able to
help much.  In thinking about this issue I was hoping that somehow
the current container cache could be enhanced to support more than
one open container per container.  Then one would automatically get
control over the open file resource across all containers, by setting
the currently supported "max" on the container pool.

The challenge is that this would be a new concept for the basic services
cache implementation.  What we want is a cache that supports multiple
objects with the same key, and that returns an available one if another
one is "busy".  Also returns a newly opened one, if all are busy.  I
am going to start a thread on this, to see if any other help is
available. If possible I like this approach better than having a queue of open files per container where it hard to control the growth of one queue vs. the growth in another.

On the checkpoint issue, I would not have a problem with changes to the
current mechanism to do "rwd" type sync I/O rather than sync at end (but we will have to support both until we don't have to support older versions of JVM's). I believe this is as close to "direct i/o" as we can get from java - if you mean something different here let me know. The benefit is that I believe it will fix
the checkpoint flooding the I/O system problem.  The downside is that
it will cause total number of I/O's to increase in cases where the
derby block size is smaller than the filesystem/disk blocksize -- assuming the OS currently converts our flood of multiple async writes to the same file to a smaller number of bigger I/O's. I think this trade off is fine for checkpoints. If checkpoint efficiency is an issue, there are a number of other ways to address it in the future.

Øystein Grøvlen wrote:

"MM" == Mike Matrigali <[EMAIL PROTECTED]> writes:




    MM> user thread initiated read
MM> o should be high priority and should be "fair" with other user
    MM>        initiated reads.

    MM>      o These happen anytime a read of a row causes a cache miss.
MM> o Currently only one I/O operation to a file can happen at a time,
    MM>        could be big problem for some types of multi-threaded,
    MM>        highly concurrent low number of table apps.  I think
    MM>        the path here should be to increase the number of
    MM>        concurrent I/O's allowed to be outstanding by allowing
    MM>        each thread to have 1 (assuming sufficient open file
    MM>        resources).  100 outstanding I/O's to a single file may
    MM>        be overkill, but in java we can't know that the file is
    MM>        not actually 100 disks underneath.  The number of I/O's
    MM>        should grow as the actual application load increases,
    MM>        note I still think max I/O's should be tied to number
    MM>        of user threads, plus maybe a small number for
    MM>        background processing.

There was an interesting paper at the last VLDB conference that
discussed the virtue of having many outstanding I/O requests:
    http://www.vldb2005.org/program/paper/wed/p1116-hall.pdf (paper)
    http://www.vldb2005.org/program/slides/wed/s1116-hall.pdf (slides)

The basic message is that many outstanding requests are good.  The
SCSI controller they used in their study was able to handle 32
concurrent requests.  One reason database systems have been
conservative with respect to outstanding requests is that they want to
control the priority of the I/O requests.  We would like user thread
initiated requests to have priority over checkpoint initiated writes.
(The authors suggest building priorities into the file system to solve
this.)

I plan to start working on a patch for allowing more concurrency
between readers within a few weeks.  The main challenge is to find the
best way to organize the open file descriptors (reuse, limit the max.
number etc.)  I will file a JIRA for this.

I also think we should consider mechanisms for read ahead.

    MM> user thread initiated write
    MM>       o same issues as user initiated read.
MM> o happens way less than read, as it should only happen on a cache MM> miss that can't find a non-dirty page in the cache. background MM> cache cleaner should be keeping this from happening, though MM> apps that only do updates and cause cache hits are worst case.


    MM> checkpoint initiated write:
MM> o sometimes too many checkpoints happen in too short a time. MM> o needs an improved scheduling algorithm, currently just defaults MM> to N number of bytes to the log file no matter what the speed of
    MM>         log writes are.
MM> o currently may flood the I/O system causing user reads/writes to MM> stall - on some OS/JVM's this stall is amazing like ten's of
    MM>         seconds.
MM> o It is not important that checkpoints run fast, it is more MM> important that it prodede methodically to conclusion while MM> causing a little interuption to "real" work by user threads. MM> Various approaches to this were discussed, but no patches yet.

For the scheduling of checkpoints, I was hoping Raymond would come up
with something.  Raymond are you still with us?

I have discussed our I/O architecture with Solaris engineers, and our
approach of doing buffered writes followed by a fsync, I was told was
the worst approach on Solaris.  They recommended using direct I/O.  I
guess there will be situations were single-threaded direct I/O for
checkpointing will give too low throughput.  In that case, we could
consider a pool of writers.  The challenge would then be how to give
priority to user-initiated requests over multi-threaded checkpoint
writes as discussed above.







Reply via email to