[Launchpad-dev] status of search performance - why do we use | not ?
I've been looking into our search performance, particularly on our duplicate finding page. The basic problem appears to be selectivity (now that the table-scanning front-end code is eliminated). We're doing an | operation to combine search terms. That means, than when a user searches for 'installing eclipse I get an unmet dependencies error', we translate it into: SELECT some-stuff FROM Bug, BugTask WHERE Bug.id = BugTask.bug AND BugTask.distribution = 1 AND Bug.fti @@ ftq('depend|eclips|error|get|instal|unmet') AND (Bug.private = FALSE OR EXISTS ( SELECT BugSubscription.bug FROM BugSubscription, TeamParticipation WHERE TeamParticipation.person = 2 AND BugSubscription.person = TeamParticipation.team AND BugSubscription.bug = Bug.id)) AND (1=1) ORDER BY -rank(bug.fti, ftq('depend|eclips|error|get|instal|unmet')), bugtask.id LIMIT 40 OFFSET 0; This evaluates 20 rows to sort and then discards them: count 216995 (1 row) Time: 4897.453 ms This 20 result set is arguably fallout from removing the table-wide scanning front end code, but it is still less work overall: and its exposed the deeper problem of using |. Consider this instead: SELECT count(*) FROM Bug, BugTask , ftq('(dependeclipserrorgetinstalunmet)|(errorgetinstalunmet)|(dependeclipsgetinstalunmet)|(dependeclipserrorinstalunmet)|(dependeclipserrorgetunmet)|(dependeclipserrorgetinstal)') as query WHERE Bug.id = BugTask.bug AND BugTask.distribution = 1 AND Bug.fti @@ query AND (Bug.private = FALSE OR EXISTS ( SELECT BugSubscription.bug FROM BugSubscription, TeamParticipation WHERE TeamParticipation.person = 2 AND BugSubscription.person = TeamParticipation.team AND BugSubscription.bug = Bug.id)) AND (1=1); count --- 260 (1 row) Time: 607.873 ms This hand crafted query looks for 'any four of the terms' When ranked and limited its: (40 rows) Time: 564.161 ms Which is tolerable. Alternatively, we can just search for all the terms supplied: SELECT count(*) FROM Bug, BugTask , ftq('(dependeclipserrorgetinstalunmet)') as query WHERE Bug.id = BugTask.bug AND BugTask.distribution = 1 AND Bug.fti @@ query AND (Bug.private = FALSE OR EXISTS ( SELECT BugSubscription.bug FROM BugSubscription, TeamParticipation WHERE TeamParticipation.person = 2 AND BugSubscription.person = TeamParticipation.team AND BugSubscription.bug = Bug.id)) AND (1=1); count --- 0 (1 row) Time: 62.975 ms This is perhaps going to show 'nothing visible' when really its a near miss. So, I'm thinking that perhaps a better way - and this is a band aid, until we fully review search - is to progressively return less and less refined searches until we have our N(defaults to 10 )or so similar bugs. That is (psuedocode): for bug in find(dependeclipserrorgetinstalunmet): yield bug for bug in find((errorgetinstalunmet)|(dependeclipsgetinstalunmet)|(dependeclipserrorinstalunmet)|(dependeclipserrorgetunmet)|(dependeclipserrorgetinstal)): yield bug for bug in find We should be able to represent this in sql with some effort - a first approximate would be SELECT 1, bugtask.id FROM Bug, BugTask , ftq('(dependeclipserrorgetinstalunmet)') as query WHERE Bug.id = BugTask.bug AND BugTask.distribution = 1 AND Bug.fti @@ query AND (Bug.private = FALSE OR EXISTS ( SELECT BugSubscription.bug FROM BugSubscription, TeamParticipation WHERE TeamParticipation.person = 2 AND BugSubscription.person = TeamParticipation.team AND BugSubscription.bug = Bug.id)) AND (1=1) union SELECT 2, bugtask.id FROM Bug, BugTask , ftq('(dependeclipserrorgetinstalunmet)|(errorgetinstalunmet)|(dependeclipsgetinstalunmet)|(dependeclipserrorinstalunmet)|(dependeclipserrorgetunmet)|(dependeclipserrorgetinstal)') as query WHERE Bug.id = BugTask.bug AND BugTask.distribution = 1 AND Bug.fti @@ query AND (Bug.private = FALSE OR EXISTS ( SELECT BugSubscription.bug FROM BugSubscription, TeamParticipation WHERE TeamParticipation.person = 2 AND BugSubscription.person = TeamParticipation.team AND BugSubscription.bug = Bug.id)) AND (1=1) order by 1 LIMIT 40 OFFSET 0; Where each successive expansion gets a higher ordinal - some brief experiments suggest this doesn't short circuit though - even though its satisfied early on it keeps processing just incase one of the other statements in the union has a different first column. The theory being: - highly refined searches are very fast - we'll get reasonably relevant (most terms found - sadly not tf-idf, but I have ideas for that) first - we can stop far short of provoking 14-30 second searches on prod There are, fortunately, only a few users of nl_search to modify in this way. Stuart, if you could comment on my reasoning here, and on whether there is a good way to represent this directly in SQL (so we can pass around expression objects rather than something more complex), that would be wonderful. Of course, the API wanting to count(*) will shoot us in the foot still, but we can fix API search once the web UI is fixed. -Rob
Re: [Launchpad-dev] status of search performance - why do we use | not ?
Sort of along these lines, when doing advanced search if you specify a set of tags to search on, by default it selects combinator 'Any', where as if it defaulted to 'All' it would return a smaller set. In addition, as a user that comes closer to my expectations of what the default behavior would be. On Sat, Jul 24, 2010 at 09:39:24AM +0200, Robert Collins wrote: I've been looking into our search performance, particularly on our duplicate finding page. The basic problem appears to be selectivity (now that the table-scanning front-end code is eliminated). We're doing an | operation to combine search terms. That means, than when a user searches for 'installing eclipse I get an unmet dependencies error', we translate it into: SELECT some-stuff FROM Bug, BugTask WHERE Bug.id = BugTask.bug AND BugTask.distribution = 1 AND Bug.fti @@ ftq('depend|eclips|error|get|instal|unmet') AND (Bug.private = FALSE OR EXISTS ( SELECT BugSubscription.bug FROM BugSubscription, TeamParticipation WHERE TeamParticipation.person = 2 AND BugSubscription.person = TeamParticipation.team AND BugSubscription.bug = Bug.id)) AND (1=1) ORDER BY -rank(bug.fti, ftq('depend|eclips|error|get|instal|unmet')), bugtask.id LIMIT 40 OFFSET 0; This evaluates 20 rows to sort and then discards them: count 216995 (1 row) Time: 4897.453 ms This 20 result set is arguably fallout from removing the table-wide scanning front end code, but it is still less work overall: and its exposed the deeper problem of using |. Consider this instead: SELECT count(*) FROM Bug, BugTask , ftq('(dependeclipserrorgetinstalunmet)|(errorgetinstalunmet)|(dependeclipsgetinstalunmet)|(dependeclipserrorinstalunmet)|(dependeclipserrorgetunmet)|(dependeclipserrorgetinstal)') as query WHERE Bug.id = BugTask.bug AND BugTask.distribution = 1 AND Bug.fti @@ query AND (Bug.private = FALSE OR EXISTS ( SELECT BugSubscription.bug FROM BugSubscription, TeamParticipation WHERE TeamParticipation.person = 2 AND BugSubscription.person = TeamParticipation.team AND BugSubscription.bug = Bug.id)) AND (1=1); count --- 260 (1 row) Time: 607.873 ms This hand crafted query looks for 'any four of the terms' When ranked and limited its: (40 rows) Time: 564.161 ms Which is tolerable. Alternatively, we can just search for all the terms supplied: SELECT count(*) FROM Bug, BugTask , ftq('(dependeclipserrorgetinstalunmet)') as query WHERE Bug.id = BugTask.bug AND BugTask.distribution = 1 AND Bug.fti @@ query AND (Bug.private = FALSE OR EXISTS ( SELECT BugSubscription.bug FROM BugSubscription, TeamParticipation WHERE TeamParticipation.person = 2 AND BugSubscription.person = TeamParticipation.team AND BugSubscription.bug = Bug.id)) AND (1=1); count --- 0 (1 row) Time: 62.975 ms This is perhaps going to show 'nothing visible' when really its a near miss. So, I'm thinking that perhaps a better way - and this is a band aid, until we fully review search - is to progressively return less and less refined searches until we have our N(defaults to 10 )or so similar bugs. That is (psuedocode): for bug in find(dependeclipserrorgetinstalunmet): yield bug for bug in find((errorgetinstalunmet)|(dependeclipsgetinstalunmet)|(dependeclipserrorinstalunmet)|(dependeclipserrorgetunmet)|(dependeclipserrorgetinstal)): yield bug for bug in find We should be able to represent this in sql with some effort - a first approximate would be SELECT 1, bugtask.id FROM Bug, BugTask , ftq('(dependeclipserrorgetinstalunmet)') as query WHERE Bug.id = BugTask.bug AND BugTask.distribution = 1 AND Bug.fti @@ query AND (Bug.private = FALSE OR EXISTS ( SELECT BugSubscription.bug FROM BugSubscription, TeamParticipation WHERE TeamParticipation.person = 2 AND BugSubscription.person = TeamParticipation.team AND BugSubscription.bug = Bug.id)) AND (1=1) union SELECT 2, bugtask.id FROM Bug, BugTask , ftq('(dependeclipserrorgetinstalunmet)|(errorgetinstalunmet)|(dependeclipsgetinstalunmet)|(dependeclipserrorinstalunmet)|(dependeclipserrorgetunmet)|(dependeclipserrorgetinstal)') as query WHERE Bug.id = BugTask.bug AND BugTask.distribution = 1 AND Bug.fti @@ query AND (Bug.private = FALSE OR EXISTS ( SELECT BugSubscription.bug FROM BugSubscription, TeamParticipation WHERE TeamParticipation.person = 2 AND BugSubscription.person = TeamParticipation.team AND BugSubscription.bug = Bug.id)) AND (1=1) order by 1 LIMIT 40 OFFSET 0; Where each successive expansion gets a higher ordinal - some brief experiments suggest this doesn't short circuit though - even though its satisfied early on it keeps processing just incase one of the other statements in the union has a different first column. The theory being: - highly refined searches are very fast - we'll get reasonably relevant (most terms found - sadly not tf-idf, but I have ideas for that) first - we can stop far short of provoking 14-30 second
[Launchpad-dev] APIs and len() of collections
I know that there is some work underway at the moment to defer the point where we call len() on collections. I'd like feedback on an even more ambitious proposal: - not calling len() ever I need input here - where do people use len(), why do they use len(), what would the impact of nuking it be? We need this input to build better interfaces - ones that scale and perform well. Some inputs that lead me to proposing this goal: - len() is a precise interface - highly precise counting is extremely expensive. - the results of such counting are also stale almost immediately: API's query in separate transactions each time - its not useful for users [20 open bugs vs 21 is a near-valueless distinction] I think that in an ideal world we'd just remove the facility in devel; I'm sure there are places where something-like-it will be /needed/, but I propose that we should make that be a rare exception, not the common case. A related issue is pagination in API's, which really doesn't make sense, I'll pull on that separately if possible though. So far, I've thought of two replacement interfaces: - estimate_size(collection) = {0..99, hundreds, thousands, millions...} This would be used for providing UI feedback on collections - closed_since|changed_since parameters on various searches, so that the use of len() to generate trend lines is able to be done - we can precisely identify recent work without precisely identifying total unfiltered collection size. What do you think? -Rob ___ Mailing list: https://launchpad.net/~launchpad-dev Post to : launchpad-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp
Re: [Launchpad-dev] New feature in Launchpad TestCase base class
Hi, I've just landed a change to use the great testtools feature of result attachments (addDetails()) to attach any generated oops to the result. I did this after I broke XMLRPC and had several tests that just failed with an OOPS id on the xmlrpc client side. On Tue, 4 May 2010 10:38:11 +0100, Jonathan Lange j...@canonical.com wrote: Perhaps we should go a step further and fail tests if they generate any oopses? It would be easy enough to provide a method that flushes out the stored oopses list. I think it would be good to do this as well. I'm unsure about a method to clear self.oopses though, as it could mask issues. How about def clearOopses(self, count=1): self.assertEqual( count, len(self.oopses), There were an unexpected number of oops generated by this test: expected %d, got %d. % (count, len(self.oopses))) self.oopses = [] This will ensure that if you manage to trigger two oops when you expect only one you will get a failure, and the code I just landed will make sure that you can see them. Thanks, James ___ Mailing list: https://launchpad.net/~launchpad-dev Post to : launchpad-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp
Re: [Launchpad-dev] status of search performance - why do we use | not ?
On 24 July 2010 09:47, Bryce Harrington br...@canonical.com wrote: Sort of along these lines, when doing advanced search if you specify a set of tags to search on, by default it selects combinator 'Any', where as if it defaulted to 'All' it would return a smaller set. In addition, as a user that comes closer to my expectations of what the default behavior would be. +10 Out of hundreds of searches I don't think I've ever wanted 'or', and a fairly popular bug on lp suggests others feel the same. To me the obvious next step is to just match only the bugs containing words that were actually requested. If there's a typo or a slightly incorrect search, just ignoring that term is of questionable benefit. -- Martin ___ Mailing list: https://launchpad.net/~launchpad-dev Post to : launchpad-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp
Re: [Launchpad-dev] status of search performance - why do we use | not ?
On 24 July 2010 12:25, Martin Pool m...@canonical.com wrote: On 24 July 2010 09:47, Bryce Harrington br...@canonical.com wrote: Sort of along these lines, when doing advanced search if you specify a set of tags to search on, by default it selects combinator 'Any', where as if it defaulted to 'All' it would return a smaller set. In addition, as a user that comes closer to my expectations of what the default behavior would be. +10 Out of hundreds of searches I don't think I've ever wanted 'or', and a fairly popular bug on lp suggests others feel the same. To me the obvious next step is to just match only the bugs containing words that were actually requested. If there's a typo or a slightly incorrect search, just ignoring that term is of questionable benefit. ps, this would be a great place to try out feature flags - give it to some users and see what they think. -- Martin ___ Mailing list: https://launchpad.net/~launchpad-dev Post to : launchpad-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp