Hello Thomas,

Disclaimer: I am not familiar at all with the internals of H2 (neither 
PageStore nor MVStore). I just looked a bit at the code, but nothing too 
fancy. Anything I write here might therefore be entirely wrong :-).

2015-06-27 Thomas Mueller:

> I would not move to the MVStore yet, as there are known problem in
> case of power failure (in case of re-ordered writes). I'm working on
> that right now. There is also a known problem with corruption after
> out-of-memory, which is fixed in the trunk but not released yet.

I wonder how such reordered writes are coped with in PageStore. In a 
traditional "write-ahead log + main storage area" setting, I would expect 
that – for each "operation", e.g., an insertion into a table plus the 
corresponding changes to indexes etc – first the log is written, before any 
of the corresponding changes to the main storage area are written. Then, a 
"write barrier" is inserted to make sure that the log writes get to the 
platter before any of the corresponding main storage area writes get to the 
platter (hopefully disk drives don't violate this order too much, as 
consumer-grade drives supposedly do), and only then the writes to the main 
storage area are performed. If such a write barrier is not used, the OS is 
very likely to reorder the writes, *especially* if the blocks of the 
write-ahead log and the main storage area are intermingled in the same file 
(are they in H2?).

I would not know how to insert a write barrier in Java, except by fsync'ing 
the log after writing to it, and only then starting to issue the writes to 
the main storage area. But as H2 seems to put the log in the same file as 
the main storage area, either a "partial file fsync" would be needed (of 
the form "only fsync at least bytes X to Y of this file", where bytes X to 
Y is the location of the part of the log that was just written; I think 
Java doesn't support such an API?), or the whole file needs to be fsync'ed. 
The latter would be very bad for performance, as all previously issued 
writes to the main storage area (which are typically spread out rather 
randomly) would need to be performed by the OS (and in principle drive) 
before the call could return.

Even fsync'ing only the log after each "operation" would be bad for 
performance (severely limiting the number of operations per second), but 
that could be coped with by delaying fsync'ing the log until a COMMIT 
happens or main storage area writes are needed because of memory pressure. 
Any main storage area writes (that correspond to the operations whose log 
parts are not fsync'ed yet) must then of course be delayed until the 
corresponding log writes are fsync'ed. It seems that H2 is already 
implemented like this (except for the write barrier/fsync part); I assume 
that the writing happens on a separate thread, so that having to wait for 
fsync'ing would not prevent the SQL-performing code from just continuing 
while the fsync is in flight (unless reading uses the same thread)?

As I wrote before, if this "write barrier" stuff (or the equivalent "fsync" 
stuff) is not performed, then the OS can reoder writes in any order it 
wants before sending it to the drive (I think?). When a power failure 
occurs, some writes to the main storage area will probably already have 
occurred before all corresponding writes to the log have occurred, which 
would then likely result in a corrupt database.

Could you please shed some light onto how H2 tries to cope with this 
potential problem? By following the code, I could see that H2 makes sure 
that the log is written ("sent to the OS" in FileUtils.writeFully) before 
the corresponding changes to the main storage area are, but I could not 
find anything that would prevent the OS from reordering the writes. If that 
problem does indeed exist and is not solved by H2, I might take a jab at 
getting into the code and try to figure out what would be needed to 
implement such a solution. No guarantees of course, as I have lots of other 
stuff to do as well; I just think/hope it would be easier to fix H2 than to 
switch to another DBMS :-).

Assuming that the problem exists, the following would seem to be needed:

* Either find a way to fsync only a part of a file in Java, or (more 
likely),
* Put the log in a separate file (probably right next to the main storage 
area file).

  → The second solution introduces a problem where a user could 
accidentally mix a log file and a main file that don't belong together, but 
the resulting problems (corrupting the whole DB when replaying the log 
file) could be avoided by storing some kind of random ID in both files and 
complaining if they are not the same. Also, a user could mix an old log 
file and a new main file, which would also result in corrupting the DB when 
replaying it. A solution for this would be a bit more complex. But then, I 
don't think that these are scenarios that would need to be protected 
against in a first implementation.

I think that a similar problem would need solving for MVStore too, but I 
know too little about MVStore and I haven't given it enough thought (yet) 
to be able to give you any useful advice.

Greetings,

Nicolas [ works for the same company as Rob ]

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/h2-database.
For more options, visit https://groups.google.com/d/optout.

Reply via email to