The idea also doesn't account for the waste in obsolete b+tree nodes. Basically, it's more complicated than that.
Compaction is unavoidable with an append-only strategy. One idea I've pitched (and frankly stolen from Berkeley JE) is for the database file to be a series of files instead of a single one. If we track the used space in each file, we can compact any file that drops below a threshold (by copying the extant data to the new tail and deleting the old file). This is still compaction but it's no longer a wholesale rewrite of the database. All that said, with enough databases and some scheduling, the current scheme is still pretty good. B. On Thu, Sep 23, 2010 at 5:11 PM, Paul Davis <[email protected]> wrote: > On Thu, Sep 23, 2010 at 12:00 PM, chongqing xiao <[email protected]> wrote: >> Hi, Paul: >> >> Thanks for the clarification. >> >> I am not sure why this is designed this way but here is one approach I >> think might work better >> >> Instead of appending the header to the data file, why not just moving >> the header to a different file. The header file can be implmented as >> before - 2 duplicate header blocks to keep it >> corruption free. For performance reason, the header file can be cached >> (say using memory mapped file). >> >> The reason I like this approache better is that for the application I >> am interested in - archiving data from relational database, the saved >> data never change. So if there is no wasted space for the old header, >> there is no need to compact the database file. >> >> Chong >> > > Writing the header to the data file means that the header is where the > data is. Ie, if the header is there and intact, we can be reasonably > sure that the data the header refers to is also there (barring weirdo > filesystems like xfs). Using a second file descriptor per database is > an increase of 100% in the number of file descriptors. This would very > much affect people that have lots of active databases on a single > node. I'm sure there are other reasons but I've not had anything to > eat yet. > > Paul > > > >> On Thu, Sep 23, 2010 at 8:44 AM, Paul Davis <[email protected]> >> wrote: >>> Its not appended each time data is written necessarily. There are >>> optimizations to batch as many writes to the database together as >>> possible as well as delayed commits which will write the header out >>> every N seconds. >>> >>> Remember that *any* write to the database is going to look like wasted >>> space. Even document deletes make the database file grow larger. >>> >>> When a header is written, it contains checksums of its contents and >>> when reading we check that nothing has changed. There's an fsync >>> before and after writing the header which also help to ensure that >>> writes succeed. >>> >>> As to the header2 or header1 problem, if header2 appears to be >>> corrupted or is otherwise discarded, the header search just continues >>> through the file looking for the next valid header. In this case that >>> would mean that newData2 would not be considered valid data and >>> ignored. >>> >>> HTH, >>> Paul Davis >>> >>> On Wed, Sep 22, 2010 at 11:51 PM, chongqing xiao <[email protected]> wrote: >>>> Hi, Adam: >>>> >>>> Thanks for the answer. >>>> >>>> If that is how it works, that seems create a lot of wasted space >>>> assuming a new header has to be appended each time new data is saved. >>>> >>>> Also, assuming here is the data layout >>>> >>>> newData1 ->start >>>> header1 >>>> newData2 >>>> header2 -> end >>>> >>>> If header 2 is partially written, I am assuming newData will also be >>>> discarded. If that is the case, I am assuming there is a special flag >>>> in header 1 so the code can skip newData2 and find header1? >>>> >>>> I am very interested in couchdb and I think it might be a very good >>>> choice for archiving relational data with some minor changes. >>>> >>>> Thanks >>>> Chong >>>> >>>> On Wed, Sep 22, 2010 at 10:36 PM, Adam Kocoloski <[email protected]> >>>> wrote: >>>>> Hi Chong, that's exactly right. Regards, >>>>> >>>>> Adam >>>>> >>>>> On Sep 22, 2010, at 10:18 PM, chongqing xiao wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Could anyone explain how write_header (or header) in works in couchdb? >>>>>> >>>>>> When appending new header, I am assuming the new header will be >>>>>> appended to the end of the DB file and the old header will be kept >>>>>> around? >>>>>> >>>>>> If that is the case, what will happen if the header is partially >>>>>> written? I am assuming the code will loop back and find the previous >>>>>> old header and recover from there? >>>>>> >>>>>> Thanks >>>>>> >>>>>> Chong >>>>> >>>>> >>>> >>> >> >
