Hi James, I had some first thoughts on possible optimizations for the increasing file size problem, and I may have found a fairly easy solution that covers some of the current problems. It's not implemented yet, but I could at least fix the initial 4096 byte problem [1].
I'll keep you updated, Christian [1] https://github.com/BaseXdb/basex/issues/970 On Sat, Jul 19, 2014 at 12:06 PM, Christian Grün <christian.gr...@gmail.com> wrote: > Hi James, > >> However the behaviour is different when using db:replace. I think it's doing >> a db:delete() and then a db:add(). So first the index file has the ID list >> for that attribute value rewritten in place (so the count will go from 2048 >> to 2047 for example) with a new value for count and just the remaining IDs >> once the document being replaced is removed. The now unused bytes at the end >> are left with their previous values. Then a completely new ID list is >> written to the end of the file (now with the count back up to 2048 for >> example) as the replacement attribute is added. > > That's a good hint, and (as you already guessed) it's due to the > current semantics of our replace operation [1]. As a replaced document > may contain a completely different structure and contents, it would > probably be tricky to replace ID lists on a lower level (instead of > deleting and adding them). One plan to solve the issues could be a > data structure that remembers free slots in the heap file, which can > later be filled up with new entries. > > >> [As a note: there seems to be a small bug when UPDINDEX is true in that a >> index file is always at least 4096 bytes. When an empty database is created >> the index file will be 4096 zero bytes with updates appended to the end. >> Even if you optimize the file will be padded to 4096 bytes with zeros.] > > Thanks, I will remember that. Maybe the minimum of 4096 bytes will > stay, but it should definitely be overwritten from the very beginning > when new data is inserted. > > >> I'd love to be able to do everything with UPDINDEX set to true and just >> forget about it. > > Me too ;) Let's see when it can be done. > > >> How fixed is the index file format? I ask because I've spent some time >> understanding how it works so I can read the files and see exactly what's in >> them. If it would be useful then I'm happy to put the information into the >> wiki somewhere to make it quicker for anyone else who's interested. However >> if you want to keep the structure obscure for any reason then I won't >> publish anything. Let me know. > > Thanks, contributions like that are always appreciated! The storage > structure is supposed to be open to everyone. I guess you have already > stumbled upon [3] and [4]; all edits are welcome, and may motivate > others to think about better solutions. > > Christian > > [1] > https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/query/func/FNDb.java#L577-L608 > [2] https://github.com/BaseXdb/basex/issues/970 > [3] http://docs.basex.org/wiki/Storage_Layout > [4] http://docs.basex.org/wiki/Node_Storage