On Thu, May 29, 2025 at 01:03:56PM +0200, Julien Plissonneau Duquène wrote:
Right now we have an implementation that is dated but mostly works so
I think that there is no need to rush a move. Working on it for a
while and experimenting with the real data in there will certainly
help with figuring out what could be done about the storage.
That data is somewhat transactional, moderately relational, also with
relations like bugs that are merged with bugs, block bugs, affect
packages which are dependencies of packages, are found in versions
which have versions as successors ... and SQL is not that great at
recursivity or working with graphs. Or handling large binary objects.
Or doing finely tunable full-text indexing and search.
People underestimate PostgreSQL: I've done all these things in it or
their very near equivalents, and it works very well in practice. Most
of them aren't even hard. You have to bend your brain a bit to handle
recursion and graphs, but it's totally doable and usually far faster
than doing the equivalent on the application side. And even if you
don't bother tuning its dictionaries or similar, its full-text indexing
is a perfectly fine starting point - I doubt that debbugs would need
more than very minor tuning.
Admittedly for large binary objects I sometimes use some other kind of
storage. That's likely unnecessary for bug messages, but I could
imagine it being worthwhile for bug attachments. But honestly, the BTS
doesn't have enough data or a high enough rate of data change to really
be a problem; in my last job I ran a PostgreSQL database that was
something like six times the size of bugs.debian.org's entire data set,
even leaving aside its much bigger data store for large objects. At our
scale, we could just stuff bug attachments into bytea columns in a
separate table, maybe mark them "SET STORAGE EXTERNAL" for better
streaming support, give bug messages a one-to-many relationship to them,
and it'd be fine. Or we could even just start by storing the raw form
of each bug message in the DB and leaving it up to the application to
parse it for display purposes; that would be fine for what debbugs does
today, although a bit less ideal for future full-text search since
searches probably wouldn't want to match on most email headers.
(PostgreSQL limits bytea columns using TOAST to 1 GiB. The largest .log
file for any bug in bugs.debian.org is for #599476, which is about 204
MiB - and that's for the entire bug, with several video files as
attachments. So this isn't even close to being a limit of any concern.)
The BTS today has a bunch of difficult-to-follow workarounds that would
completely evaporate if backed by a proper database, allowing much
higher performance and leaving more effort available for useful things.
While I don't want to put words in Don's mouth, from the fact that he
was working on a PostgreSQL port a while ago I infer that he probably
agrees with me on this.
--
Colin Watson (he/him) [cjwat...@debian.org]