On Mon, Nov 29, 2010 at 12:27 PM, Ian Booth <[email protected]> wrote: > >> Actual query code should go in/under the persistence layer. I imagine >> we'll have some general code and some code specific to the backend >> stores that we have (which today is the three pg stores - session, >> launchpad, launchpad_slave). I include in 'actual query code' >> collection size estimates. It would be nice to enable systematic use >> of size estimates in this layer, though its not a deliberate scoped >> task. >> > > Just to check I understand what you are saying - in the past, I've > augmented collection queries with a batch size to reflect the required > number of elements to be loaded per query, often reflecting for example > the pagination size of the view or processing batch size of some > business logic operation; the idea being that there's a chance that not > all of the collection will be required (eg if the user only views the > first page of results) so why load what's not likely to be needed. Is > this what you mean by "size estimates"?
There are two places where we need counts (vs actual data) - collection sizing for batching (both web UI (first/prev/next/last) and API (472 items) - aggregates for side bars (50 open bugs) In the former case, any figure we return is stale as soon as we return it: more objects can be added or deleted. In the latter case the figure is also stale, but unlike the former no followup requests happen. So, why does this matter? count(*) is rarely cheaper than executing the full query in postgresql. If determining the ids for all open bugs in Ubuntu takes 5 seconds, determining the count will also take 5 seconds. There are many open bugs in Ubuntu :) What can we do about it? - tune queries to be 'fast enough' that count(*) is ok - denormalise counts rather than doing it by count(*) - use the table stats in postgresql to make educated guesses I'm saying that this layer is the point at which we choose - code above it doesn't keen to know /how/ (but perhaps may need to say 'accuracy is needed'); code below implements a chosen approach. >> Relations that are not traversed are not queried; we can select down >> to individual attributes in a similar fashion to the .filter attribute >> - using a .get or .retrieve attribute. >> > > As well as not querying relations that are not required, it's also key > to minimise the query count to get the data (attributes and collections) > that is required, and execute the most efficient queries possible > according to the underlying database's capabilities and quirks. A small note: less queries != more efficient. More efficient is more efficient, and sometimes that means returning one table and then querying directly a related table. Query count should be /constant/ once efficiency tradeoffs are established : but less is not always more. > One > thing I don't think I have seen explicitly mentioned is the notion of an > object query language (or maybe I missed it). While conceivably a > separate problem and out of scope to what's being discussed here, the > type of high level constructs available tend to make it easier for > developers to specify what they want in terms closer aligned to the end > representation of the data, and help constrain the ways in which data is > accessed and hence improve the ability to optimise under the covers as > part of the mapping from the object query language to sql. > > I think the pseudo code which I have snipped out reflects it, but in my > view we also need to ensure where is a clear separation between the > verbs/actions and the nouns/model. eg so the bugs collection class > (whatever it is called - IBugCollection, IBugs, IBugManager) should have > methods like findUnassignedBugs() or findBugAssignedTo(IPerson) rather > than the apis just mentioned being on the IBug interface. mmm, I'm not convinced either way here. I think we have room to play it a couple of ways. > > One extra point I would like to make in relation to the LEP: > > "Not requiring a cache in the layer" > > In my view, we need to distinguish the type of cache we are talking > about. If we are talking about a L2 type cache with an object > lifecycle/ttl which spans individual system interactions with the > persistence layer and which implies the need for replication in a > clustered environment to maintain data consistency, then I agree that we > should try and avoid the need for this. However, I think some form of > caching within the bounds of a single interaction is useful and perhaps > necessary to minimise unnecessary hits on the database. The cache is > discarded when the interaction ends but allows objects already loaded > (whether via a single getById type operation or as a result of a query) > to be accessed from the cache if required. This is all done > transparently by the implementation so no explicit user code is required > to make it work. Hibernate uses this concept with its Session construct. The hibernate Session lives at the ORM layer; this is above that layer I think : there may well be an ORM layer, but lets not conflate responsibilities. -Rob _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

