On Thu, May 29, 2025 at 01:03:56PM +0200, Julien Plissonneau Duquène wrote:
Right now we have an implementation that is dated but mostly works so I think that there is no need to rush a move. Working on it for a while and experimenting with the real data in there will certainly help with figuring out what could be done about the storage.

That data is somewhat transactional, moderately relational, also with relations like bugs that are merged with bugs, block bugs, affect packages which are dependencies of packages, are found in versions which have versions as successors ... and SQL is not that great at recursivity or working with graphs. Or handling large binary objects. Or doing finely tunable full-text indexing and search.

People underestimate PostgreSQL: I've done all these things in it or their very near equivalents, and it works very well in practice. Most of them aren't even hard. You have to bend your brain a bit to handle recursion and graphs, but it's totally doable and usually far faster than doing the equivalent on the application side. And even if you don't bother tuning its dictionaries or similar, its full-text indexing is a perfectly fine starting point - I doubt that debbugs would need more than very minor tuning.

Admittedly for large binary objects I sometimes use some other kind of storage. That's likely unnecessary for bug messages, but I could imagine it being worthwhile for bug attachments. But honestly, the BTS doesn't have enough data or a high enough rate of data change to really be a problem; in my last job I ran a PostgreSQL database that was something like six times the size of bugs.debian.org's entire data set, even leaving aside its much bigger data store for large objects. At our scale, we could just stuff bug attachments into bytea columns in a separate table, maybe mark them "SET STORAGE EXTERNAL" for better streaming support, give bug messages a one-to-many relationship to them, and it'd be fine. Or we could even just start by storing the raw form of each bug message in the DB and leaving it up to the application to parse it for display purposes; that would be fine for what debbugs does today, although a bit less ideal for future full-text search since searches probably wouldn't want to match on most email headers.

(PostgreSQL limits bytea columns using TOAST to 1 GiB. The largest .log file for any bug in bugs.debian.org is for #599476, which is about 204 MiB - and that's for the entire bug, with several video files as attachments. So this isn't even close to being a limit of any concern.)

The BTS today has a bunch of difficult-to-follow workarounds that would completely evaporate if backed by a proper database, allowing much higher performance and leaving more effort available for useful things. While I don't want to put words in Don's mouth, from the fact that he was working on a PostgreSQL port a while ago I infer that he probably agrees with me on this.

--
Colin Watson (he/him)                              [cjwat...@debian.org]

Reply via email to