On Sun, Jun 10, 2012 at 3:00 AM, Emmanuel Lécharny <[email protected]> wrote: > Le 6/9/12 11:46 PM, Selcuk AYA a écrit : > >> Lets say we sacrifice cross partition txns. I think that is OK. > > > It's not a sacrifice. I see it as if we decide to postpone it atm. > >> >> >> >> On Sat, Jun 9, 2012 at 10:45 PM, Howard Chu<[email protected]> wrote: >>> >>> Emmanuel Lécharny wrote: >>>> >>>> Hi guys, >>>> >>>> independently from the ongoing work on the txn layer, I'd like to start >>>> a thread of discussion about the path we selected, and the other >>>> possible options. >>>> >>>> Feel free to express your opinion here, I'll create a few items I'd liek >>>> to see debated. >>>> >>>> 1) Introduction >>>> >>>> We badly need to have a consistent system. The fact is that the current >>>> trunk - and I guess this is true for all the released we have done so >>>> far) suffers from some serious issue when multiple modifications are >>>> done during searches. The reason is that we depend on a BTree >>>> implementation that exposes a data structure directly reading the pages >>>> containing the data, expecting those pages to remain unchanged in the >>>> ong run. Obviously, when we browse more than one entry, we are likely to >>>> see a modification changing the data... >>>> >>>> 2) txn layer >>>> >>>> There are a few way to get this problem solved : >>>> - we can have a MVCC backend, and a protection against concurrent >>>> modifications. Any read will always succeed, as each read will use a >>>> revision and only one. >> >> Lets say we want to implement a txn system within JDBM. We have to >> implement this not within a singel B+ tree but across B+ trees. > > Yes. But that does not really matter, as soon as two modifications can't > occur concurrently. > >> How >> will this be different from what we are trying to implement now? We >> still need a WAL log keeping track of txns on top of B+ trees, changes >> could be kept track of in terms of pages or entries and indices. Old >> version of data has to be copied over to some other location before >> newer version can overwrite it or newer version has to be kept at >> location X as long as readers need the old data. Any MVCC system has >> to do something like this. > > No, we don't need all this mechanism if we block all the modifications while > a modification is being processed. I agree that modifications will be > slower, but this is a price I want to pay if, at the same time, I can > guarantee consistant *and* concurrent reads. >
you have a single modification that touches a couple of entries and indices, how will reads proceed concurrently if the ongoing modification does not pay attention to not overwriting the versions the reads are using ? >> >> For us, newer version of data is kept at WAL as long as a reader needs >> the old version of data. As explained below, for simplicity we keep a >> copy of WAL in memory in a format that makes merging data for readers >> easier and faster. More on this below. >> >> I think what we implement right now is not very different from what we >> would implement inside a single partition. > > With a single partition, I don't need to keep anything in memory, assuming I > serialize the modifications. > >>>> - we can also read fast the results and store them somwhere, blocking >>>> the modification until the read is finished. >>>> - or we can keep a copy of the modified elements within the original >>>> elements, until the seraches that use those elements are finished. >>>> >>>> (there are probably some other solutions, but I don't know them) >>>> >>>> AFAICT, the transaction branch is implementing the third solution, >>>> keepong the copy of modified elements in memory, so that they can be >>>> sent back to the user. >> >> it is true that the current txn system makes use of in memory copies >> for fast merge of data. However, what it really does it it just keeps >> a copy of txn wal log in memory. This can be extended to discard the >> in memory copy and directly read from the WAL when memory exceeds some >> threshold for example. Implementing read from memory was just easier. >> >> Also think of adding another partition tomorrow. Say HBASE partition >> is added which exposes atomic writes and atomic reads or scan >> consistent scans. If we plug that partition with what we are >> implementing right now, txns over HBASE partitions would just work >> without much effort. > > Yes. What you have written is also a way to keep partition dumb. What I'm > suggesting forces you to have MVCC copable partitions, which is a real > hassle. Now, let's face it : do we need anything else, atm ? Plus HBase > already implement a similar system to protect reads against conncurrent > modifications, so we don't necessarily need to have it. > Also keep in mind that if we want to implement the solution I proposed, we > still need to modify the code to protect the partitions against concurrent > modifications, and to leverage the MVCC parts in JDBM (and probably write > the versions on disk too). > no. HBASE is not transactional. You still need transactions to make queries consistent. > > -- > Regards, > Cordialement, > Emmanuel Lécharny > www.iktek.com >
