Hi Robert I'm really happy you've done this :-) The principals you describe in the email and LEP reflect the sort of thing I gave Tim an ear bashing about when I first started so I think it's great that we can look to progress things in this area.
> > One major principle I have is that on-demand loading is actively > harmful in high performance software: while its not as convenience for > adhoc scripts, its very hard to reliably avoid poor performance due to > object traversal triggering expensive (e.g. 3-4ms) queries thousands > of times in a single web query. > +1. I have also found that explicitly considering the data requirements of the specific use case at hand, rather than wandering over the object model resolving references as needed in an adhoc way, forces more thought to be put into how best to efficiently load/query the underlying data model. This is especially true when it comes to loading one-many references and avoiding the N+1 select problem. One statement in the LEP really highlights to me a major cause of the problem: "much of the Zope machinery we use is hostile to that structure : it assumes individual Python object and attribute access is cheap or free". So we need to put in place a solution which mitigates this issue and I think the LEP is working towards a solution which does that. > Actual query code should go in/under the persistence layer. I imagine > we'll have some general code and some code specific to the backend > stores that we have (which today is the three pg stores - session, > launchpad, launchpad_slave). I include in 'actual query code' > collection size estimates. It would be nice to enable systematic use > of size estimates in this layer, though its not a deliberate scoped > task. > Just to check I understand what you are saying - in the past, I've augmented collection queries with a batch size to reflect the required number of elements to be loaded per query, often reflecting for example the pagination size of the view or processing batch size of some business logic operation; the idea being that there's a chance that not all of the collection will be required (eg if the user only views the first page of results) so why load what's not likely to be needed. Is this what you mean by "size estimates"? > Code that *requests* a partial object graph should become a consumer > of the persistence layer. > > Code that works on objects must live above the persistence layer. > +1 <snip> pseudo code </snip> > > Relations that are not traversed are not queried; we can select down > to individual attributes in a similar fashion to the .filter attribute > - using a .get or .retrieve attribute. > As well as not querying relations that are not required, it's also key to minimise the query count to get the data (attributes and collections) that is required, and execute the most efficient queries possible according to the underlying database's capabilities and quirks. One thing I don't think I have seen explicitly mentioned is the notion of an object query language (or maybe I missed it). While conceivably a separate problem and out of scope to what's being discussed here, the type of high level constructs available tend to make it easier for developers to specify what they want in terms closer aligned to the end representation of the data, and help constrain the ways in which data is accessed and hence improve the ability to optimise under the covers as part of the mapping from the object query language to sql. I think the pseudo code which I have snipped out reflects it, but in my view we also need to ensure where is a clear separation between the verbs/actions and the nouns/model. eg so the bugs collection class (whatever it is called - IBugCollection, IBugs, IBugManager) should have methods like findUnassignedBugs() or findBugAssignedTo(IPerson) rather than the apis just mentioned being on the IBug interface. One extra point I would like to make in relation to the LEP: "Not requiring a cache in the layer" In my view, we need to distinguish the type of cache we are talking about. If we are talking about a L2 type cache with an object lifecycle/ttl which spans individual system interactions with the persistence layer and which implies the need for replication in a clustered environment to maintain data consistency, then I agree that we should try and avoid the need for this. However, I think some form of caching within the bounds of a single interaction is useful and perhaps necessary to minimise unnecessary hits on the database. The cache is discarded when the interaction ends but allows objects already loaded (whether via a single getById type operation or as a result of a query) to be accessed from the cache if required. This is all done transparently by the implementation so no explicit user code is required to make it work. Hibernate uses this concept with its Session construct. There, that's my 2c. _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

