[Launchpad-dev] status of search performance - why do we use | not ?

2010-07-24 Thread Robert Collins
I've been looking into our search performance, particularly on our
duplicate finding page.

The basic problem appears to be selectivity (now that the
table-scanning front-end code is eliminated).

We're doing an | operation to combine search terms.

That means, than when a user searches for 'installing eclipse I get an
unmet dependencies error', we translate it into:
SELECT some-stuff FROM Bug, BugTask WHERE Bug.id = BugTask.bug AND
BugTask.distribution = 1 AND Bug.fti @@
ftq('depend|eclips|error|get|instal|unmet') AND (Bug.private = FALSE
OR EXISTS ( SELECT BugSubscription.bug FROM BugSubscription,
TeamParticipation WHERE TeamParticipation.person = 2 AND
BugSubscription.person = TeamParticipation.team AND
BugSubscription.bug = Bug.id)) AND (1=1) ORDER BY -rank(bug.fti,
ftq('depend|eclips|error|get|instal|unmet')), bugtask.id LIMIT 40
OFFSET 0;

This evaluates 20 rows to sort and then discards them:
 count

 216995
(1 row)
Time: 4897.453 ms

This 20 result set is arguably fallout from removing the
table-wide scanning front end code, but it is still less work overall:
and its exposed the deeper problem of using |.

Consider this instead:
SELECT count(*) FROM Bug, BugTask ,
ftq('(dependeclipserrorgetinstalunmet)|(errorgetinstalunmet)|(dependeclipsgetinstalunmet)|(dependeclipserrorinstalunmet)|(dependeclipserrorgetunmet)|(dependeclipserrorgetinstal)')
as query WHERE Bug.id = BugTask.bug AND BugTask.distribution = 1 AND
Bug.fti @@ query AND (Bug.private = FALSE OR EXISTS ( SELECT
BugSubscription.bug FROM BugSubscription, TeamParticipation WHERE
TeamParticipation.person = 2 AND BugSubscription.person =
TeamParticipation.team AND BugSubscription.bug = Bug.id)) AND (1=1);
 count
---
   260
(1 row)
Time: 607.873 ms

This hand crafted query looks for 'any four of the terms'  When ranked
and limited its:
(40 rows)
Time: 564.161 ms

Which is tolerable.

Alternatively, we can just search for all the terms supplied:

SELECT count(*) FROM Bug, BugTask ,
ftq('(dependeclipserrorgetinstalunmet)') as query WHERE Bug.id =
BugTask.bug AND BugTask.distribution = 1 AND Bug.fti @@ query AND
(Bug.private = FALSE OR EXISTS ( SELECT BugSubscription.bug FROM
BugSubscription, TeamParticipation WHERE TeamParticipation.person = 2
AND BugSubscription.person = TeamParticipation.team AND
BugSubscription.bug = Bug.id)) AND (1=1);
 count
---
 0
(1 row)
Time: 62.975 ms

This is perhaps going to show 'nothing visible' when really its a near miss.

So, I'm thinking that perhaps a better way - and this is a band aid,
until we fully review search - is to progressively return less and
less refined searches until we have our N(defaults to 10 )or so
similar bugs.

That is (psuedocode):
for bug in find(dependeclipserrorgetinstalunmet): yield bug
for bug in 
find((errorgetinstalunmet)|(dependeclipsgetinstalunmet)|(dependeclipserrorinstalunmet)|(dependeclipserrorgetunmet)|(dependeclipserrorgetinstal)):
yield bug
for bug in find 

We should be able to represent this in sql with some effort - a first
approximate would be
SELECT 1, bugtask.id FROM Bug, BugTask ,
ftq('(dependeclipserrorgetinstalunmet)') as query WHERE Bug.id =
BugTask.bug AND BugTask.distribution = 1 AND Bug.fti @@ query AND
(Bug.private = FALSE OR EXISTS ( SELECT BugSubscription.bug FROM
BugSubscription, TeamParticipation WHERE TeamParticipation.person = 2
AND BugSubscription.person = TeamParticipation.team AND
BugSubscription.bug = Bug.id)) AND (1=1) union SELECT 2, bugtask.id
FROM Bug, BugTask ,
ftq('(dependeclipserrorgetinstalunmet)|(errorgetinstalunmet)|(dependeclipsgetinstalunmet)|(dependeclipserrorinstalunmet)|(dependeclipserrorgetunmet)|(dependeclipserrorgetinstal)')
as query WHERE Bug.id = BugTask.bug AND BugTask.distribution = 1 AND
Bug.fti @@ query AND (Bug.private = FALSE OR EXISTS ( SELECT
BugSubscription.bug FROM BugSubscription, TeamParticipation WHERE
TeamParticipation.person = 2 AND BugSubscription.person =
TeamParticipation.team AND BugSubscription.bug = Bug.id)) AND (1=1)
order by 1 LIMIT 40 OFFSET 0;

Where each successive expansion gets a higher ordinal - some brief
experiments suggest this doesn't short circuit though - even though
its satisfied early on it keeps processing just incase one of the
other statements in the union has a different first column.

The theory being:
 - highly refined searches are very fast
 - we'll get reasonably relevant (most terms found - sadly not tf-idf,
but I have ideas for that) first
 - we can stop far short of provoking 14-30 second searches on prod

There are, fortunately, only a few users of nl_search to modify in this way.

Stuart, if you could comment on my reasoning here, and on whether
there is a good way to represent this directly in SQL (so we can pass
around expression objects rather than something more complex), that
would be wonderful.

Of course, the API wanting to count(*) will shoot us in the foot
still, but we can fix API search once the web UI is fixed.

-Rob


Re: [Launchpad-dev] status of search performance - why do we use | not ?

2010-07-24 Thread Bryce Harrington
Sort of along these lines, when doing advanced search if you specify a
set of tags to search on, by default it selects combinator 'Any', where
as if it defaulted to 'All' it would return a smaller set.

In addition, as a user that comes closer to my expectations of what the
default behavior would be.

On Sat, Jul 24, 2010 at 09:39:24AM +0200, Robert Collins wrote:
 I've been looking into our search performance, particularly on our
 duplicate finding page.
 
 The basic problem appears to be selectivity (now that the
 table-scanning front-end code is eliminated).
 
 We're doing an | operation to combine search terms.
 
 That means, than when a user searches for 'installing eclipse I get an
 unmet dependencies error', we translate it into:
 SELECT some-stuff FROM Bug, BugTask WHERE Bug.id = BugTask.bug AND
 BugTask.distribution = 1 AND Bug.fti @@
 ftq('depend|eclips|error|get|instal|unmet') AND (Bug.private = FALSE
 OR EXISTS ( SELECT BugSubscription.bug FROM BugSubscription,
 TeamParticipation WHERE TeamParticipation.person = 2 AND
 BugSubscription.person = TeamParticipation.team AND
 BugSubscription.bug = Bug.id)) AND (1=1) ORDER BY -rank(bug.fti,
 ftq('depend|eclips|error|get|instal|unmet')), bugtask.id LIMIT 40
 OFFSET 0;
 
 This evaluates 20 rows to sort and then discards them:
  count
 
  216995
 (1 row)
 Time: 4897.453 ms
 
 This 20 result set is arguably fallout from removing the
 table-wide scanning front end code, but it is still less work overall:
 and its exposed the deeper problem of using |.
 
 Consider this instead:
 SELECT count(*) FROM Bug, BugTask ,
 ftq('(dependeclipserrorgetinstalunmet)|(errorgetinstalunmet)|(dependeclipsgetinstalunmet)|(dependeclipserrorinstalunmet)|(dependeclipserrorgetunmet)|(dependeclipserrorgetinstal)')
 as query WHERE Bug.id = BugTask.bug AND BugTask.distribution = 1 AND
 Bug.fti @@ query AND (Bug.private = FALSE OR EXISTS ( SELECT
 BugSubscription.bug FROM BugSubscription, TeamParticipation WHERE
 TeamParticipation.person = 2 AND BugSubscription.person =
 TeamParticipation.team AND BugSubscription.bug = Bug.id)) AND (1=1);
  count
 ---
260
 (1 row)
 Time: 607.873 ms
 
 This hand crafted query looks for 'any four of the terms'  When ranked
 and limited its:
 (40 rows)
 Time: 564.161 ms
 
 Which is tolerable.
 
 Alternatively, we can just search for all the terms supplied:
 
 SELECT count(*) FROM Bug, BugTask ,
 ftq('(dependeclipserrorgetinstalunmet)') as query WHERE Bug.id =
 BugTask.bug AND BugTask.distribution = 1 AND Bug.fti @@ query AND
 (Bug.private = FALSE OR EXISTS ( SELECT BugSubscription.bug FROM
 BugSubscription, TeamParticipation WHERE TeamParticipation.person = 2
 AND BugSubscription.person = TeamParticipation.team AND
 BugSubscription.bug = Bug.id)) AND (1=1);
  count
 ---
  0
 (1 row)
 Time: 62.975 ms
 
 This is perhaps going to show 'nothing visible' when really its a near miss.
 
 So, I'm thinking that perhaps a better way - and this is a band aid,
 until we fully review search - is to progressively return less and
 less refined searches until we have our N(defaults to 10 )or so
 similar bugs.
 
 That is (psuedocode):
 for bug in find(dependeclipserrorgetinstalunmet): yield bug
 for bug in 
 find((errorgetinstalunmet)|(dependeclipsgetinstalunmet)|(dependeclipserrorinstalunmet)|(dependeclipserrorgetunmet)|(dependeclipserrorgetinstal)):
 yield bug
 for bug in find 
 
 We should be able to represent this in sql with some effort - a first
 approximate would be
 SELECT 1, bugtask.id FROM Bug, BugTask ,
 ftq('(dependeclipserrorgetinstalunmet)') as query WHERE Bug.id =
 BugTask.bug AND BugTask.distribution = 1 AND Bug.fti @@ query AND
 (Bug.private = FALSE OR EXISTS ( SELECT BugSubscription.bug FROM
 BugSubscription, TeamParticipation WHERE TeamParticipation.person = 2
 AND BugSubscription.person = TeamParticipation.team AND
 BugSubscription.bug = Bug.id)) AND (1=1) union SELECT 2, bugtask.id
 FROM Bug, BugTask ,
 ftq('(dependeclipserrorgetinstalunmet)|(errorgetinstalunmet)|(dependeclipsgetinstalunmet)|(dependeclipserrorinstalunmet)|(dependeclipserrorgetunmet)|(dependeclipserrorgetinstal)')
 as query WHERE Bug.id = BugTask.bug AND BugTask.distribution = 1 AND
 Bug.fti @@ query AND (Bug.private = FALSE OR EXISTS ( SELECT
 BugSubscription.bug FROM BugSubscription, TeamParticipation WHERE
 TeamParticipation.person = 2 AND BugSubscription.person =
 TeamParticipation.team AND BugSubscription.bug = Bug.id)) AND (1=1)
 order by 1 LIMIT 40 OFFSET 0;
 
 Where each successive expansion gets a higher ordinal - some brief
 experiments suggest this doesn't short circuit though - even though
 its satisfied early on it keeps processing just incase one of the
 other statements in the union has a different first column.
 
 The theory being:
  - highly refined searches are very fast
  - we'll get reasonably relevant (most terms found - sadly not tf-idf,
 but I have ideas for that) first
  - we can stop far short of provoking 14-30 second 

[Launchpad-dev] APIs and len() of collections

2010-07-24 Thread Robert Collins
I know that there is some work underway at the moment to defer the
point where we call len() on collections. I'd like feedback on an even
more ambitious proposal:

 - not calling len() ever

I need input here - where do people use len(), why do they use len(),
what would the impact of nuking it be? We need this input to build
better interfaces - ones that scale and perform well.

Some inputs that lead me to proposing this goal:
 - len() is a precise interface

 - highly precise counting is extremely expensive.

 - the results of such counting are also stale almost immediately:
API's query in separate transactions each time

 - its not useful for users [20 open bugs vs 21 is a
near-valueless distinction]

I think that in an ideal world we'd just remove the facility in devel;
I'm sure there are places where something-like-it will be /needed/,
but I propose that we should make that be a rare exception, not the
common case.

A related issue is pagination in API's, which really doesn't make
sense, I'll pull on that separately if possible though.

So far, I've thought of two replacement interfaces:
 - estimate_size(collection) = {0..99, hundreds, thousands, millions...}
   This would be used for providing UI feedback on collections

 - closed_since|changed_since parameters on various searches, so that
the use of len() to generate trend lines is able to be done - we can
precisely identify recent work without precisely identifying total
unfiltered collection size.

What do you think?

-Rob

___
Mailing list: https://launchpad.net/~launchpad-dev
Post to : launchpad-dev@lists.launchpad.net
Unsubscribe : https://launchpad.net/~launchpad-dev
More help   : https://help.launchpad.net/ListHelp


Re: [Launchpad-dev] New feature in Launchpad TestCase base class

2010-07-24 Thread James Westby
Hi,

I've just landed a change to use the great testtools feature of result
attachments (addDetails()) to attach any generated oops to the result.

I did this after I broke XMLRPC and had several tests that just failed
with an OOPS id on the xmlrpc client side.

On Tue, 4 May 2010 10:38:11 +0100, Jonathan Lange j...@canonical.com wrote:
 Perhaps we should go a step further and fail tests if they generate
 any oopses? It would be easy enough to provide a method that flushes
 out the stored oopses list.

I think it would be good to do this as well. I'm unsure about a method
to clear self.oopses though, as it could mask issues.

How about

  def clearOopses(self, count=1):
  self.assertEqual(
  count, len(self.oopses), There were an unexpected number 
  of oops generated by this test: expected %d, got %d.
  % (count, len(self.oopses)))
  self.oopses = []

This will ensure that if you manage to trigger two oops when you expect
only one you will get a failure, and the code I just landed will make
sure that you can see them.

Thanks,

James

___
Mailing list: https://launchpad.net/~launchpad-dev
Post to : launchpad-dev@lists.launchpad.net
Unsubscribe : https://launchpad.net/~launchpad-dev
More help   : https://help.launchpad.net/ListHelp


Re: [Launchpad-dev] status of search performance - why do we use | not ?

2010-07-24 Thread Martin Pool
On 24 July 2010 09:47, Bryce Harrington br...@canonical.com wrote:
 Sort of along these lines, when doing advanced search if you specify a
 set of tags to search on, by default it selects combinator 'Any', where
 as if it defaulted to 'All' it would return a smaller set.

 In addition, as a user that comes closer to my expectations of what the
 default behavior would be.

+10

Out of hundreds of searches I don't think I've ever wanted 'or', and a
fairly popular bug on lp suggests others feel the same.   To me the
obvious next step is to just match only the bugs containing words that
were actually requested.  If there's a typo or a slightly incorrect
search, just ignoring that term is of questionable benefit.

-- 
Martin

___
Mailing list: https://launchpad.net/~launchpad-dev
Post to : launchpad-dev@lists.launchpad.net
Unsubscribe : https://launchpad.net/~launchpad-dev
More help   : https://help.launchpad.net/ListHelp


Re: [Launchpad-dev] status of search performance - why do we use | not ?

2010-07-24 Thread Martin Pool
On 24 July 2010 12:25, Martin Pool m...@canonical.com wrote:
 On 24 July 2010 09:47, Bryce Harrington br...@canonical.com wrote:
 Sort of along these lines, when doing advanced search if you specify a
 set of tags to search on, by default it selects combinator 'Any', where
 as if it defaulted to 'All' it would return a smaller set.

 In addition, as a user that comes closer to my expectations of what the
 default behavior would be.

 +10

 Out of hundreds of searches I don't think I've ever wanted 'or', and a
 fairly popular bug on lp suggests others feel the same.   To me the
 obvious next step is to just match only the bugs containing words that
 were actually requested.  If there's a typo or a slightly incorrect
 search, just ignoring that term is of questionable benefit.

ps, this would be a great place to try out feature flags - give it to
some users and see what they think.


-- 
Martin

___
Mailing list: https://launchpad.net/~launchpad-dev
Post to : launchpad-dev@lists.launchpad.net
Unsubscribe : https://launchpad.net/~launchpad-dev
More help   : https://help.launchpad.net/ListHelp