The bug summary table I we started work on ~ 6 weeks back is now in devel - it will be in the deploy happening this week.
This is a pretty significant bit of work (though not complete), and I'd like to pick its design apart a bit - you may find other areas of the systems your are building that need similar solutions. One thing to note is that if we *had* a reliable-queue w/2pc we could have implemented this as an external service accepting deltas on changes to bugs, but we don't [yet]. We can still decide to move it out of the main database using a few different alternatives, but for now it is a trigger based design. I'm also going to avoid duplicating the docs - http://bazaar.launchpad.net/~launchpad-pqm/launchpad/devel/view/head:/lib/lp/bugs/doc/bugsummary.txt Bug summary is a fact table - http://en.wikipedia.org/wiki/Fact_table - which records various discrete /sets/ of data about bug tasks/bugs. It does not identify individual bugs (but we could build a join table if we wanted to). Every bug in Launchpad is aggregated into one-or-more rows in the fact table. The aggregation is maintained by a number of triggers which fire either before, after or both before-and-after rows are changed in Postgresql. For each dimension value (http://en.wikipedia.org/wiki/Dimension_table) that we want to report on, we add one to the 'count' field in the fact table. This means that when a bug/bugtask is multivalued in some dimension, we record it many times. bug *tasks* are one form of multi-value, so we have an immediate multiple-recording case there. Tags are multi-valued, so we also need to record against each tag (including the NULL tag, so that we get very cheap 'count everything' cases). Similarly sourcepackagename targeted bugtasks are counted twice: once against the distro[series] + sourcepackagename, and once against the distro[series] with sourcepackagename forced to NULL. Finally we do the same thing for subscribers of *private* bugs so that we can give (reasonably) accurate figures for private bugs too. The rows we add 1 to the count field are gained by taking the cross product of all the dimension values that apply to a bug. So a bug with: 2 tasks ('bzr' and 'ubuntu/bzr') 1 tag ('foo') will write to 6 rows: ('bzr', tag=NULL), ('bzr', tag='foo'), ('ubuntu/bzr', tag=NULL), ('ubuntu/bzr', tag='foo'), ('ubuntu', tag=NULL), ('ubuntu', tag='foo') Now, the schema is subject to evolution - the basics are in place but we may/will find things we cannot query yet (for instance, site-wide tag clouds) : feel free to improve on the schema, or chat to me/Stuart about the design if you're not sure. The magic though, is all in the querying. Revision 13171 of devel uses bugsummary to generate the tag portlets (like https://bugs.launchpad.net/ubuntu/+bugtarget-portlet-tags-content). These are currently a high volume timeout: hard/soft 20 / 146 Distribution:+bugtarget-portlet-tags-content The PPR tells me that this page has a 99% percentile render time of 13.2 seconds, and a mean of 6.6. The bugsummary page can deliver the main query for that page in 350ms *cold*. 170ms *hot*. So, I'm pretty sure that we'll have this timeout fixed next week :) Why is it so fast? This is an example of the new query: select * from (with teams as (select team from teamparticipation where person=319329) (SELECT tag, sum(count) FROM Bugsummary WHERE status IN (10, 15, 20, 21, 22, 25) AND distribution = 1 AND sourcepackagename IS NULL AND tag IS NOT NULL AND (viewed_by IS NULL OR viewed_by in (select team from teams)) GROUP BY tag ORDER BY sum(count) DESC, tag LIMIT 10) UNION (SELECT tag, sum(count) FROM Bugsummary WHERE status IN (10, 15, 20, 21, 22, 25) AND distribution = 1 AND sourcepackagename IS NULL AND tag IS NOT NULL AND (viewed_by IS NULL OR viewed_by in (select team from teams)) AND tag IN ('a11y', 'armel', 'bitesize', 'cherry-pick', 'compat-wireless', 'gnome3', 'hibernate', 'hw-specific', 'kernel-bug', 'lo33', 'manpage', 'metabug', 'multiarch', 'needs-design', 'needs-packaging', 'needs-reassignment', 'packaging', 'patch', 'patch-accepted-debian', 'patch-accepted-upstream', 'patch-forwarded-debian', 'patch-forwarded-upstream', 'patch-needswork', 'patch-rejected', 'patch-rejected-debian', 'patch-rejected-upstream', 'python27', 'regression-proposed','regression-release', 'regression-update', 'resume', 'string-fix', 'suspend', 'touch', 'ubuntu-unr', 'upgrade-software-version', 'verification-done', 'verification-failed', 'verification-needed') GROUP BY tag ORDER BY sum(count) DESC, tag)) as _tmp order by sum; The union handles both high-use tags and official tags. The official tag list is obtained in a separate query (but could easily be folded into this one query in future). If we look at what this does, it selects rows from bugsummary for the target we want to summarise (the distribution - to do a source package we would change the sourcepackagename clause to match a sourcepackagename id). In the first half we pin the tag field to non NULL because tag=NULL includes every bug - we would double count and besides, the NULL tag is pretty boring for this portlet :) We select the statuses that we want (note that we cannot discriminate on incomplete-with-response yet, because thats not a discrete status [yet]). The first half of this query considers 8500 rows in the bugsummary table. The second half considers 270 rows. So all in all its examing about 1% of the data that our currently-live portlet code examines - and this is why its substantially faster. Cheers, Rob _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : launchpad-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp