Hi Chris, Christopher Baines <[email protected]> skribis:
> So, as I understand it, the WAL is made up of pages, and checking for > this db, I think they're the current default size of 4096 bytes. > > sqlite> PRAGMA page_size; > 4096 > > From looking at the code, the wal_autocheckpoint value is set to 40000: > > /* Increase the auto-checkpoint interval to 40000 pages. This > seems enough to ensure that instantiating the NixOS system > derivation is done in a single fsync(). */ > if (mode == "wal" && sqlite3_exec(db, "pragma wal_autocheckpoint = > 40000;", 0, 0, 0) != SQLITE_OK) > throwSQLiteError(db, "setting autocheckpoint interval"); > > https://git.savannah.gnu.org/cgit/guix.git/tree/nix/libstore/local-store.cc#n253 > > This means you'd expect the WAL to be in the region of 40000*4096 bytes, > or ~160MB. Assuming the autocheckpointing is keeping up... it doesn't > look to be, since the file is now much larger than this. > > As described here [1], the automatic checkpoints are PASSIVE ones, which > has the advantage of not interrupting any readers or writers, but can > also do nothing if it's being blocked by readers or writers. > > 1: https://www.sqlite.org/wal.html#application_initiated_checkpoints > > What I've found while writing the Guix Build Coordinator, is that when > the service is busy (usually new builds being submitted, plus lots of > builds happening), the PASSIVE checkpoints aren't sufficient. To > supplement them, there's a regular check that looks at the size of the > WAL file, and runs a TRUNCATE checkpoint, which is a FULL checkpoint > (which blocks new writers), plus truncating the WAL file once it's > finished checkpointing the entire WAL. The truncating is mostly so that > it's easier to monitor the size of the WAL, by checking the size of the > file. OK. That may well be what happens on berlin these days: the database is kept busy all day long, so presumably checkpoints don’t happen and the WAL file grows. > I feel like I need to defend SQLite at this point. Tuning the > configuration of a database to get acceptable performance is the norm, I > had to tune the PostgreSQL configuration for data.guix.gnu.org to > improve the performance. It's easier to get in to trouble with SQLite > because it's a lower level too, and requires you to actually initiate > things like checkpoints or periodic optimisation if you want them to > happen. Understood. It’s really not about defending software X against Y, but rather about finding ways to address the issues we experience. >> • ‘db.sqlite’ weighs in at 19 GiB (!) so perhaps there’s something to >> do, like the “VACUUM” thing maybe. Chris? > > Doing a VACCUM might address some fragmentation and improve performance, > it's probably worth trying. Alright, let’s give it a try. >> • Stracing the session’s guix-daemon process during GC suggests that >> most of the time goes into I/O from ‘db.sqlite’. It’s not >> surprising because that GC phase is basically about browsing the >> database, but it does seem to take a little too long for each store >> item. > > At least the way I've approached finding and fixing the poor performance > issues in the Guix Build Coordinator is through adding instrumentation, > so just recording the time that calling particular procedures takes, and > then logging if it's longer than some threshold. Yeah, that makes sense. I think we always took these bits of the daemon for granted because they’d been used on large stores even before Guix existed. > Since this issue is about Cuirass, there's also the possibility of > avoiding the problems of a large store, by avoiding having a large > store. That's what bordeaux.guix.gnu.org does, and I thought it was part > of the plan for ci.guix.gnu.org (at least around a year ago)? That’s indeed the case: the store is smaller than it used to be (but still 27 TiB), it’s GC’d more aggressively than before, and instead we rely on /var/cache/guix/publish for long-term storage. Perhaps we should go further and keep the store smaller though. Thanks for your feedback! Ludo’.
