Re: [sqlite] Concurrency, MVCC

2004-04-16 Thread Christian Smith
On Thu, 15 Apr 2004, Doug Currie wrote:

>
>Thursday, April 15, 2004, 9:16:01 AM, Christian Smith wrote:
>
>> On Wed, 14 Apr 2004, Doug Currie wrote:
>
>>>One way to get table level locking without a great deal of pain is to
>>>integrate the shadow paging ideas with BTree management. Rather than
>>>using page tables for the shadow pages, use the BTrees themselves.
>>>This means that any change to a BTree requires changes along the
>>>entire path back to the root so that only free pages are used to store
>>>new data, including the BTree itself. Writing the root page(s) of the
>>>BTree(s) commits the changes to that table (these tables).
>
>> Actually, this gets my vote. Keeps the pager layer the same,
>
>The pager gets *much* simpler because it doesn't need to make a log
>file. The log file is not necessary because writes only go to free
>pages.
>
>Well, there would be one write-ahead log. It's needed to prevent
>partial updates to the page number pointers to the roots page(s) of
>the BTree(s) at commit. This log is created at commit time, and is
>much simpler and much smaller than the present log file.

I'd have thought it'd be better to preserve the pager layer as is. If it
ain't broke...


> [...]
>
>This design works well. It has the advantage (compared with shadow
>pager) that reads are not burdened with page table indirection. It has
>the potential disadvantage (compared with SQLite 2.8) that small
>writes can modify several pages (based on the depth of the BTree).

So for reads, there is basically no extra burden (other than the caching
of the initial tree roots,) and writing will be slightly slower, but with
decreasing penalty as updates get bigger, and probably insignificant
against dumping of the page cache when transactions are finished, and all
of course in parallel with reads, so overall performance should be improve
in many scenarios.

It would of course be limited, like shadow paging, to a single address
space (writes would block reads in other address spaces.)

>
>I used this design in a proprietary database in the late 1980s. The
>only reason I didn't consider modifying SQLite this way up until now
>is that I was anticipating BTree changes for 3.0, so I confined my
>efforts to the pager layer.

Given this design, if it is adopted, it would also be trivial (and free in
terms of IO) to maintain a running total of records in a given btree as
well, as was requested some weeks back, as any new/deleted records would
update the btree to the root anyway.

Is this design feasible given the time constraints on 3.0? I've not
studied the btree layer in much detail, so don't know how much existing
code would need to change.

Christian

-- 
/"\
\ /ASCII RIBBON CAMPAIGN - AGAINST HTML MAIL
 X   - AGAINST MS ATTACHMENTS
/ \

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Concurrency, MVCC

2004-04-15 Thread Mark D. Anderson

On Thu, 15 Apr 2004 20:16:32 -0400, "Doug Currie" <[EMAIL PROTECTED]> said:

> I used this design in a proprietary database in the late 1980s. The
> only reason I didn't consider modifying SQLite this way up until now
> is that I was anticipating BTree changes for 3.0, so I confined my
> efforts to the pager layer.

btw, another example of this class of approach, with a bsd-style
license, is GigaBASE from the prolific Konstantin Knizhnik:
   http://www.garret.ru/~knizhnik/gigabase/GigaBASE.htm

It does not offer anything approaching full SQL.
It does however have several features not available in sqlite:
- online backup [1]
- master-slave replication
- group commit [2]
- parallel query (multiple threads for full table scans)

[1] There is kind of support in sqlite for online backup, via
   echo '.dump' | sqlite ex1 > backup.sql
though this would result in a largish file and blocks
everything else.

[2] Grouping commits is a mechanism that allows for pending transactions 
to get fsync'd together.
This allows for greater performance with a risk only of losing
some transactions (at most the size of the group), but not
greater risk of a corrupted database.
This is more flexibility than sqlite's big knob of OFF/NORMAL/FULL.
It is also offered by DB2, Oracle, and MySQL.



In idle moments I've toyed with what it would take to splice
GigaBASE with the query parser and planner from
lambda-db or sqlite.
But then I wake up

-mda

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Concurrency, MVCC

2004-04-15 Thread Doug Currie

Thursday, April 15, 2004, 9:16:01 AM, Christian Smith wrote:

> On Wed, 14 Apr 2004, Doug Currie wrote:

>>One way to get table level locking without a great deal of pain is to
>>integrate the shadow paging ideas with BTree management. Rather than
>>using page tables for the shadow pages, use the BTrees themselves.
>>This means that any change to a BTree requires changes along the
>>entire path back to the root so that only free pages are used to store
>>new data, including the BTree itself. Writing the root page(s) of the
>>BTree(s) commits the changes to that table (these tables).

> Actually, this gets my vote. Keeps the pager layer the same,

The pager gets *much* simpler because it doesn't need to make a log
file. The log file is not necessary because writes only go to free
pages.

Well, there would be one write-ahead log. It's needed to prevent
partial updates to the page number pointers to the roots page(s) of
the BTree(s) at commit. This log is created at commit time, and is
much simpler and much smaller than the present log file.

> and only requires a cache of the root btree for each object
> (table/index) in the database to be maintained on a per-transaction
> basis

Yes, you need to cache the page number of each BTree root at
transaction start.

You'd also need a forest of free pages organized by transaction so
they can be freed at the right time (when the oldest read-transaction
that can reference them has completed).

> , reducing the complications of what to do under memory pressure
> when pages are spilled from the cache as we should be able to keep
> them in memory all the time.

Yes.

> Committing of a transaction would then be an atomic update root btree page
> number in the catalog table.

Yes, atomically for all the BTrees modified. This is probably a single
page of data (4 to 8 bytes of root page number per BTree, i.e., per
table and per index). Well, I usually assume fairly large pages
compared with SQLite's default of 1K. Using larger pages also
decreases the depth of the BTree which reduces the number of pages
written.

This design works well. It has the advantage (compared with shadow
pager) that reads are not burdened with page table indirection. It has
the potential disadvantage (compared with SQLite 2.8) that small
writes can modify several pages (based on the depth of the BTree).

I used this design in a proprietary database in the late 1980s. The
only reason I didn't consider modifying SQLite this way up until now
is that I was anticipating BTree changes for 3.0, so I confined my
efforts to the pager layer.

e


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Concurrency, MVCC

2004-04-15 Thread Andrew Piskorski
On Thu, Apr 15, 2004 at 02:16:01PM +0100, Christian Smith wrote:
> Right tool for the job. Multiple writers has client/server database
> written all over it. KISS.

No, not true, at least not when the multiple writers are all threads
within one single process, which appears to be the common case for
people who'd like greater concurrency in SQLite.

Also, if multiple writers worked well for the one-process many-threads
case, then if you wished you could write a small multi-threaded
client/server database using SQLite as the underlying storage engine.
As things stand now, the concurrency limitations mean there isn't much
point to doing that.

Simplicity however, is of course an important concern.

-- 
Andrew Piskorski <[EMAIL PROTECTED]>
http://www.piskorski.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Concurrency, MVCC

2004-04-15 Thread Christian Smith
On Wed, 14 Apr 2004, Doug Currie wrote:

>Wednesday, April 14, 2004, 1:16:54 AM, Andrew Piskorski wrote:
>
>> How could this be extended to support table locking and PostgreSQL's
>> default "read committed" isolation level? Would the smallest locking
>> granularity possible in Currie's design be one page of rows, however
>> many rows that happens to be?
>
>Things get *much* more complicated once you have multiple simultaneous
>write transactions. I didn't want to go there.

Right tool for the job. Multiple writers has client/server database
written all over it. KISS.

>
>One way to get table level locking without a great deal of pain is to
>integrate the shadow paging ideas with BTree management. Rather than
>using page tables for the shadow pages, use the BTrees themselves.
>This means that any change to a BTree requires changes along the
>entire path back to the root so that only free pages are used to store
>new data, including the BTree itself. Writing the root page(s) of the
>BTree(s) commits the changes to that table (these tables).

Actually, this gets my vote. Keeps the pager layer the same, and only
requires a cache of the root btree for each object (table/index) in the
database to be maintained on a per-transaction basis, reducing the
complications of what to do under memory pressure when pages are spilled
from the cache as we should be able to keep them in memory all the time.

Committing of a transaction would then be an atomic update root btree page
number in the catalog table.

Christian

-- 
/"\
\ /ASCII RIBBON CAMPAIGN - AGAINST HTML MAIL
 X   - AGAINST MS ATTACHMENTS
/ \

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Concurrency, MVCC

2004-04-15 Thread Mark D. Anderson

On Wed, 14 Apr 2004 08:13:39 -0400, "D. Richard Hipp" <[EMAIL PROTECTED]>
said:

>* Support for atomic commits of multi-database transactions,
>  which gives you a limited kind of table-level locking,
>  assuming you are willing to put each table in a separate
>  database.

and also a limited form of concurrent writers, as a consequence,
right?
assuming that table locks are acquired in a consistent order
to avoid deadlock, there could be concurrent writers that do
not touch the same tables (in this database-per-table model).

btw, what about offering better behavior about throwing away
cache pages? one approach would be something like a 
commit_begin() function which is offered by some rdbms native
apis. It says "commit what i've done, but at the same time
attempt to acquire the write lock".
Failure to "win" and actually be able to retain the write
lock might not be reported -- the idea is that the application
can at least indicate its desire.

This could also be done as some sort of connection option.

So in the case that a single writer is keeping up with all
requests, it can do so efficiently without throwing away
its pages.

-mda

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Concurrency, MVCC

2004-04-14 Thread Mark D. Anderson
Wednesday, April 14, 2004, 1:16:54 AM, Andrew Piskorski wrote:
> as far as I can tell, it seems to be describing a system with
> the usual Oracle/PostgreSQL MVCC semantics, EXCEPT of course that
> Currie proposes that each Write transaction must take a lock on the
> database as a whole.

Well, i suppose from a sufficient distance they look alike,
but in practice MVCC and shadow paging are rather different.

In MVCC, each row typically has two hidden fields identifying the first
and last transaction ids for which the row is relevant.
The last transaction id is to skip rows that are deleted.
There are many variants of MVCC, but you get the idea.

Any reader (or writer) knows its own transaction id, and just
ignores rows that are no applicable.

A "vacuum" process is necessary to periodically reclaim space
taken by rows whose last transaction id is lower than any live
transaction.

In shadow paging, the basic idea is that any reader or writer
gets a view onto the data based on reachability from "pointers"
in a particular root block. Pages that are reachable from any
live root block are never modified. A vacuum process is required 
to collect the space from blocks that are no longer reachable.
Updates to indexes must be treated in roughly the same way as
data pages, because they contain pointers to different data.

Shadow paging can be used for a table-based database, or
a persistent object store.
It certainly is much older than the HUT work; see for example Lorie 77,
"Physical Integrity in a Large Segmented Database."
It falls into the general class of logless transaction systems,
as opposed to the log-based approach that predominates in
current day non-academic database implementations.

-mda

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Concurrency, MVCC

2004-04-14 Thread Doug Currie
> D. Richard Hipp wrote:
>> 
>> My thoughts on BlueSky have been added to the wiki page:
>>http://www.sqlite.org/cvstrac/wiki?p=BlueSky

I added some responses; I do not agree with Richard's concerns about
Shadow Paging, and I corrected some mistaken conclusions. I apologize
if my paper was not clear enough in these areas.

Thank you, Richard, for taking the time to review the Shadow Paging
option.

Regards,

e


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [sqlite] Concurrency, MVCC

2004-04-14 Thread D. Richard Hipp
D. Richard Hipp wrote:
My thoughts on BlueSky have been added to the wiki page:

  http://www.sqlite.org/wiki?p=BlueSky

That URL should have been:

   http://www.sqlite.org/cvstrac/wiki?p=BlueSky

Left out the "cvstrac".  Sorry for the confusion.

--
D. Richard Hipp -- [EMAIL PROTECTED] -- 704.948.4565
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [sqlite] Concurrency, MVCC

2004-04-14 Thread D. Richard Hipp
Andrew Piskorski wrote:
How feasible would it be to add support for higher concurrency to
SQLite, especially via MVCC?  
My thoughts on BlueSky have been added to the wiki page:

  http://www.sqlite.org/wiki?p=BlueSky

The current plan for version 3.0 is as follows:

  * Support for a modification of the Carlyle method for
allowing writes to begin while reads are still pending.
All reads must finish before the write commits, however.
  * Support for atomic commits of multi-database transactions,
which gives you a limited kind of table-level locking,
assuming you are willing to put each table in a separate
database.
Business constraints require that version 3.0 be working
no later than May 31.  So if you have any alternative
suggestions, you should get them in quickly.
--
D. Richard Hipp -- [EMAIL PROTECTED] -- 704.948.4565
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]