Hi,

Quite an interesting stuff, Thomas! Thanks for sharing!

Few points:
You decided to make updates synchronous in favor of single threaded 
performance but concurrent updates is quite an important use case too in 
our century of multicore CPUs. May be it will not be too hard to make this 
configurable?

Also about Map interface. Why doesn't it at least NavigableMap which is 
better reflects sorted nature of the data structure and directly supports 
range operations? ConcurrentNavigableMap is preferable of course. For 
example I created scalable in-memory table engine with snapshot isolation 
based on SnapTree (https://github.com/nbronson/snaptree). It would be very 
interesting to just replace one ConcurrentNavigableMap implementation with 
another to compare performance.

You say that page size will be variable, will not it be harder to recover 
corrupted databases because of this?

Sergi


On Wednesday, September 5, 2012 10:34:25 PM UTC+4, Thomas Mueller wrote:
>
> Hi,
>
> I'm working on a new storage backend for H2, currently called MVStore (for 
> Multi-Version Store). It is not ready yet to be used as the storage backend 
> for the database, but you can try it out in 'standalone' mode if you want 
> (for example to run benchmarks). Please be aware the API is unstable and 
> will change.
>
> == Map API ==
>
> In addition to be the storage backend for H2, there will be a Java Map API 
> similar to IndexedDB and JDBM3 [1]. The goals and internal architecture of 
> JDBM3 / JDBM4 and MVStore are quite different so it doesn't make much sense 
> to share code (I discussed this with Jan Kotek).
>
> This is what you can do now. MVMap stands for Multi-Version Map, and 
> implements the java.util.Map interface:
>
>     // open the store (in-memory if fileName is null)
>     MVStore s = MVStore.open(fileName);
>
>     // create/get the map "data"
>     // the String.class, String.class will be optional later
>     MVMap<String, String> map = s.openMap("data",
>             String.class, String.class);
>
>     // add some data
>     map.put("1", "Hello");
>     map.put("2", "World");
>
>     // get the current version, for later use
>     long oldVersion = s.getCurrentVersion();
>
>     // from now on, the old version is read-only
>     s.incrementVersion();
>
>     // more changes, in the new version
>     // changes can be rolled back if required
>     // changes always go into 'head' (the newest version)
>     map.put("1", "Hi");
>     map.remove("2");
>
>     // access the old data (before incrementVersion)
>     MVMap<String, String> oldMap =
>             map.openVersion(oldVersion);
>
>     // store the newest version to disk
>     s.store();
>
>     // print the old version (can be done
>     // concurrently with further modifications)
>     // this will print Hello World
>     System.out.println(oldMap.get("1"));
>     System.out.println(oldMap.get("2"));
>
>     // print the newest version ("Hi")
>     System.out.println(map.get("1"));
>
>     // close the store - this doesn't write to disk
>     s.close();
>
> == Core / Modularity ==
>
> The core only supports very few data types (actually, only String keys and 
> values at the moment), but data types are pluggable, and I will write a 
> module to support all serializable Java objects.
>
> The idea is to keep the core very small, and make optional features 
> pluggable. This should simplify testing, allow to create a very small 
> Android version, and allow to write a C or C++ version of (just) the core 
> if needed. Pluggable features are for example caching, data types / 
> serialization, file system implementations (with things like encryption), 
> the compression algorithm, alternative tree implementations such as an 
> R-tree to index spatial data. Other components will be implemented on top 
> of the storage, for example SQL and JDBC support, clustering (replication / 
> sharding), a network server.
>
> == Storage and In-Memory Architecture ==
>
> The basic storage architecture is a log structured storage [2] combined 
> with a partially persistent in-memory data structure [3]. The 'partially 
> persistent' doesn't mean only partially stored to disk, but partially 
> immutable (this is a bit confusing I know, I suggest to read the Wikipedia 
> article). For speed, in-memory structures are only copied (COW, 
> copy-on-write) when the version is incremented.
>
> Each MVStore supports multiple maps that are stored as B-trees [4]. For 
> spatial queries, there is also a (currently incomplete) R-tree [5] 
> implementation. An MVStore can be in-memory and on disk.
>
> Data is stored in one file by default, in the form of chunks. A chunk 
> contains a chunk header and a list of pages, stored next to each other. 
> Plus there is the file header that points to the latest chunk. The file 
> format is quite a bit simpler than the current 'page store'. The MVStore 
> doesn't use in-place updates, and there should be no need for fsync [9], 
> which is the current bottleneck for writes. Pages are variable size, and 
> they are stored next to each other (no empty space), this alone should 
> improve write performance. Also, each page can optionally be compressed 
> (LZF by default), which should further increase data density. But multiple 
> versions of the same data are stored sometimes, so disk space usage might 
> be a bit higher than now, specially if there are many updates. There is a 
> feature to compact the file (garbage collection and defragmentation), which 
> may run in the background. A detail: truncating and dropping tables (and in 
> the future possibly also range deletes) are faster than they are now as the 
> leaf pages don't need to be read (around 90% are leaf pages, so it should 
> be about 10 times faster). Other operations such as range counting and skip 
> should also get faster, because it is a counted B-tree [10] (that's 
> possible because pages are copy-on-write [11]).
>
> The MVStore is a bit similar to leveldb. As far as I know, leveldb is log 
> structured merge tree (LSM) and sometimes uses a lot of files, while the 
> MVStore only uses one file (by default) and doesn't copy data unless really 
> required.
>
> The record size limit is the available memory. For large binaries I plan 
> to write a mechanism that splits them into blocks (not started yet), a bit 
> similar to how H2 works now, but maybe it will be a content addressable 
> storage [7] internally.
>
> For each store, there is a metadata map that contains chunk info and 
> information about what maps are available and where they are stored (where 
> the root page is). Other than that, the metadata map is really just like 
> any other map.
>
> To cache pages, a LIRS cache [6] is used currently (but I want to make it 
> pluggable).
>
> For low level file access, the same file system abstraction will be used 
> as of now. There is no .lock.db file, instead regular file locking will be 
> used by default.
>
> The main advantages compared to the current page store are: simpler file 
> format (no low-level transaction log, no in-place updates), simpler page 
> format (no overflow pages),  simpler rollback, no sync required, native 
> MVCC using COW (copy on write). I'm not quite sure yet what are the 
> disadvantages are, I guess there are a few. One is: the new storage engine 
> will temporarily need more disk space if you update a lot of data, but I 
> think that's acceptable. Reading isn't aligned to fixed page sizes, not 
> sure if that's a problem (I don't think it is). A transaction log will most 
> likely be needed, at least in some cases (two-phase commit for example), my 
> current plan is to implement it in the form of a map.
>
> == Concurrency ==
>
> The combination of log structured storage and persistent data structure 
> allows to read old versions of the data without blocking other readers or 
> writers. This includes reading very old data, unless it was explicitly 
> overwritten. Rollback is fast. This simplifies recovery a lot and should 
> simplify MVCC [8], which will probably be the default isolation level in 
> the future (as in PostgreSQL, most likely there will be row level locking 
> for writes).
>
> Concurrent writes to the head of the same map are not supported, because 
> that would make things slower. If you need it, you need to either split the 
> map or write in a synchronized way. I'm quite sure some people will want a 
> true ConcurrentMap, but I will not compromise performance for others. The 
> more likely case in my view is concurrent read (iterate) and write, and 
> this is supported: read from an older version. That way readers and writers 
> are isolated.
>
> == SQL / JDBC ==
>
> When used within the database engine, most likely there will be one map 
> per table, and one map per index. For a table, the key is the unique row id 
> (always a long, like now - if the primary key is an int or long, then this 
> is the key of the data map). For an indexes, the key of the map is index 
> columns (combined) for unique indexes, plus the row id for non-unique 
> indexes, and the value is the row id (no value at all for non-unique 
> indexes).
>
> Changing the SQL engine to fully use the MVStore will need some time, 
> maybe a year or so (that's hard to say).
>
> == Performance ==
>
> Performance: according to the current benchmarks, the MVStore is much 
> faster than the current page store, for multiple reasons. I believe on 
> Android, it will be faster than SQLite (but I didn't test recently), at 
> least when using the Java API.
>
> == Code ==
>
> The code is currently kept in the following directories (that will 
> change): src/tools/org/h2/dev/store/btree (implementation) and 
> src/test/org/h2/test/store (test and extensions). 
>
> [1] http://github.com/jankotek/JDBM3
> [2] 
> http://blog.notdot.net/2009/12/Damn-Cool-Algorithms-Log-structured-storage
> [3] http://en.wikipedia.org/wiki/Persistent_data_structure
> [4] http://en.wikipedia.org/wiki/B-tree
> [5] http://en.wikipedia.org/wiki/R-tree
> [6] 
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-02-6.pdf
> [7] http://en.wikipedia.org/wiki/Content-addressable_storage
> [8] http://en.wikipedia.org/wiki/Multiversion_concurrency_control
> [9] http://en.wikipedia.org/wiki/Fsync
> [10] http://www.chiark.greenend.org.uk/~sgtatham/algorithms/cbtree.html
> [11] http://en.wikipedia.org/wiki/Copy-on-write
>
> Regards,
> Thomas
>

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/h2-database/-/z-VeA2HSlgMJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/h2-database?hl=en.

Reply via email to