Øystein Grøvlen wrote:
> Some tests runs we have done show very long transaction response times
> during checkpointing. This has been seen on several platforms. The
> load is TPC-B like transactions and the write cache is turned off so
> the system is I/O bound. There seems to be two major issues:
Nice investigation, I think I have seen similar problms on Windows.
> 1. Derby does checkpointing by writing all dirty pages by
> RandomAccessFile.write() and then do file sync when the entire
> cache has been scanned. When the page cache is large, the file
> system buffer will overflow during checkpointing, and occasionally
> the writes will take very long. I have observed single write
> operations that took almost 12 seconds. What is even worse is that
> during this period also read performance on other files can be very
> bad. For example, reading an index page from disk can take close
> to 10 seconds when the base table is checkpointed. Hence,
> transactions are severely slowed down.
>
> I have managed to improve response times by flushing every file for
> every 100th write. Is this something we should consider including
> in the code? Do you have better suggestions?
Sounds reasonable.
>
> 2. What makes thing even worse is that only a single thread can read a
> page from a file at a time. (Note that Derby has one file per
> table). This is because the implementation of RAFContainer.readPage
> is as follow:
>
> synchronized (this) { // 'this' is a FileContainer, i.e. a file
> object
> fileData.seek(pageOffset); // fileData is a RandomAccessFile
> fileData.readFully(pageData, 0, pageSize);
> }
>
> During checkpoint when I/O is slow this creates long queques of
> readers. In my run with 20 clients, I observed read requests that
> took more than 20 seconds.
Hmmm, I think that code was written assuming the call would nat take
that long!
>
> This behavior will also limit throughput and can partly explains
> why I get low CPU utilization with 20 clients. All my TPCB-B
> clients are serialized since most will need 1-2 disk accesses
> (index leaf page and one page of the account table).
>
> Generally, in order to make the OS able to optimize I/O, one should
> have many outstanding I/O calls at a time. (See Frederiksen,
> Bonnet: "Getting Priorities Straight: Improving Linux Support for
> Database I/O", VLDB 2005).
>
> I have attached a patch where I have introduced several file
> descriptors (RandomAccessFile objects) per RAFContainer. These are
> used for reading. The principle is that when all readers are busy,
> a readPage request will create a new reader. (There is a maximum
> number of readers.) With this patch, throughput was improved by
> 50% on linux. The combination of this patch and the synching for
> every 100th write, reduced maximum transaction response times with
> 90%.
Only concern would be number of open file descriptors as others have
pointed out. Might want to scavenged open descriptors from containers
that are no longer heavily used.
> The patch is not ready for inclusion into Derby, but I would like
> to here whether you think this is a viable approach.
It seems like these changes are low risk and enable worthwhile
performance increases without completely changing the i/o system.
Such changes could then provide the performance that a full async
re-write would have to better (or at least match).
Dan.