Le 6/9/12 11:46 PM, Selcuk AYA a écrit :
Lets say we sacrifice cross partition txns. I think that is OK.
It's not a sacrifice. I see it as if we decide to postpone it atm.
On Sat, Jun 9, 2012 at 10:45 PM, Howard Chu<[email protected]> wrote:
Emmanuel Lécharny wrote:
Hi guys,
independently from the ongoing work on the txn layer, I'd like to start
a thread of discussion about the path we selected, and the other
possible options.
Feel free to express your opinion here, I'll create a few items I'd liek
to see debated.
1) Introduction
We badly need to have a consistent system. The fact is that the current
trunk - and I guess this is true for all the released we have done so
far) suffers from some serious issue when multiple modifications are
done during searches. The reason is that we depend on a BTree
implementation that exposes a data structure directly reading the pages
containing the data, expecting those pages to remain unchanged in the
ong run. Obviously, when we browse more than one entry, we are likely to
see a modification changing the data...
2) txn layer
There are a few way to get this problem solved :
- we can have a MVCC backend, and a protection against concurrent
modifications. Any read will always succeed, as each read will use a
revision and only one.
Lets say we want to implement a txn system within JDBM. We have to
implement this not within a singel B+ tree but across B+ trees.
Yes. But that does not really matter, as soon as two modifications can't
occur concurrently.
How
will this be different from what we are trying to implement now? We
still need a WAL log keeping track of txns on top of B+ trees, changes
could be kept track of in terms of pages or entries and indices. Old
version of data has to be copied over to some other location before
newer version can overwrite it or newer version has to be kept at
location X as long as readers need the old data. Any MVCC system has
to do something like this.
No, we don't need all this mechanism if we block all the modifications
while a modification is being processed. I agree that modifications will
be slower, but this is a price I want to pay if, at the same time, I can
guarantee consistant *and* concurrent reads.
For us, newer version of data is kept at WAL as long as a reader needs
the old version of data. As explained below, for simplicity we keep a
copy of WAL in memory in a format that makes merging data for readers
easier and faster. More on this below.
I think what we implement right now is not very different from what we
would implement inside a single partition.
With a single partition, I don't need to keep anything in memory,
assuming I serialize the modifications.
- we can also read fast the results and store them somwhere, blocking
the modification until the read is finished.
- or we can keep a copy of the modified elements within the original
elements, until the seraches that use those elements are finished.
(there are probably some other solutions, but I don't know them)
AFAICT, the transaction branch is implementing the third solution,
keepong the copy of modified elements in memory, so that they can be
sent back to the user.
it is true that the current txn system makes use of in memory copies
for fast merge of data. However, what it really does it it just keeps
a copy of txn wal log in memory. This can be extended to discard the
in memory copy and directly read from the WAL when memory exceeds some
threshold for example. Implementing read from memory was just easier.
Also think of adding another partition tomorrow. Say HBASE partition
is added which exposes atomic writes and atomic reads or scan
consistent scans. If we plug that partition with what we are
implementing right now, txns over HBASE partitions would just work
without much effort.
Yes. What you have written is also a way to keep partition dumb. What
I'm suggesting forces you to have MVCC copable partitions, which is a
real hassle. Now, let's face it : do we need anything else, atm ? Plus
HBase already implement a similar system to protect reads against
conncurrent modifications, so we don't necessarily need to have it.
Also keep in mind that if we want to implement the solution I proposed,
we still need to modify the code to protect the partitions against
concurrent modifications, and to leverage the MVCC parts in JDBM (and
probably write the versions on disk too).
--
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com