On 22/03/11 23:03, Stephen Allen wrote:
Hi Andy,

This is rather long email I'm afraid, but I've split it into two parts: 
Parliament and TxTDB comments:

Very helpful indeed. I'll respond to the TxTDB part here -- the pArliament description is interesting and helpful as to what you plan to do and how you came to your design decisions.

TxTDB -----

It sounds like you may be describing shadow paging [1].  In shadow
paging, the basic idea is that any reader or writer gets a view onto
the data based on reachability from pointers in a particular root
block.  Pages that are reachable from any live root block are never
modified.  A vacuum process is required to collect the space from
blocks that are no longer reachable (the last reader out could do it
too).  Updates to indexes must be treated in roughly the same way as
data pages, because they contain pointers to different data.  If you
wrote each block to disk at commit time the system would belong to
the class of logless transaction systems (although a small log might
be needed to prevent partial updates to the root page pointers).
With a shadow paging system you also would not have to actually copy
the new pages back over top of the old ones.


-Stephen

[1] http://en.wikipedia.org/wiki/Shadow_paging

Sort of shadow paging - the Wikipedia page is describing shadow paging where a new page is allocated with new page id, with the need to chase back up any datastructre that refers to the page and change that (the latter could be in-place or shadow'ed recursed). It requires data structure involvement to chase the references.

I'm thinking of keeping the same page id, and the transactions have a context that enables them to go to the right place place to find a given page. The data structures referring to the block do not change; the "get block 123" step carries out the looking for the right version.

The B+Tree code is already structured around "get block", "put block" operations - one reason this new implementation was done, rather than pick up an existing one.

http://www.sqlite.org/wal.html
Section "How WAL Works"

which isn't quite the same WAL as Wikipedia either. This WAL is a redo log without change in-place as the transaction progresses.

I hope that one form of compression for new blocks is to use a block-type compressor so adding a triple to a index is not a copy of the whole block, but just "shift-insert bytes ABC at offset 123, moving up old bytes". Most changes in a B+Tree are on one block, not causing block split or rebalancing. That level of compress takes the block change down to about "insert triple" size, except there are multiple indexes. TDB does not store a distinguished triple as such although one index is considered "primary" but they are all cluster indexes at the moment. Primary is just the one the code assumes exists and falls back on although currently only total indexing is done so that never happens.

It would also be possible to reverse the whole thing - copy old data out of the way and change in-place. That puts the I/O to the database on the writer critical path, which is random access, whereas redo puts append+sync on the log as the I/O on the writer critical path.

        Andy

Reply via email to