On Tue, Dec 30, 2014 at 5:01 PM, Branko Čibej <br...@wandisco.com> wrote:
> On 30.12.2014 16:35, Ivan Zhakov wrote: > > Just to make things clear. The thorough review was already provided. > > In short, subpools are not a solution for the root cause of the > > problem being discussed. > The last bit is where we disagree. I see more then one issue here: 1. There is an L1 DAG node cache. Problem: Added complexity. 2. It uses (now: used) ref counting to allow for copy-less access. Problems: Lock management adds complexity. Binding lock lifetime to pools triggered the memory consumption problem. 3. Poor isolation of API users from implementation details. This is what allowed problem (2) to manifest outside FSFS. 4. Having single-pool API functions. Problems: No clear distinction between data that needs to be preserved (results) and temporaries. Attempts to tighten pool usage will likely introduce bugs or break implicit / undocumented assumptions. It is also unclear - without reading the docstrings carefully - whether the caller expects a reference to e.g. something in svn_fs_t or an object with independent lifetime. Complexity in (1) is mitigated by having a narrow interface (few callers) and is justified by various advantages over the main membuffer cache. Problems from (2) should be solved as of r1648538. While DAG nodes are still heavy objects, they have shrunk and flattened since the 1.7 times when this code got written. Copying them is no longer much of an overhead. Issue (3) still stands. A possible solution has been reverted in r1648538. Number (4) is a temporary problem and possibly caused the crash reported in the original post. Tightening pool usage is worthwhile because there is almost no FS API function thatis inherently trivial. They have to walk path hierarchies, read files, retry on alternative code paths etc. This ties back into issue (3) where traces of that work leak to the API caller in the form of used pool memory. All the caching makes it feel like constant memory consumption but it is probably possible to allocate a few MB with a seemingly harmless sequence of FS calls. It strikes me that the root problem is amazingly similar to the issue we > had with pool lifetimes and locking in the implementation of DB_REGISTER > support in the BDB back-end (that one was my bug). At the time, I > believe we found that no possible pool usage pattern could avoid bad > interaction with reference-counted objects in pools, given the way pools > currently work in APR. > > This is going to be a problem with any long-lived cache that depends on > pools that are created outside our control, as is the case with > HTTPd/mod_dav_svn, where pool lifetimes in general are completely > unpredictable. Combined with unpredictable request patterns and even > less predictable data patterns (since repository content is completely > outside our control), this makes pools amazingly unsuitable for > cross-request caches. Even the lifetime of global pools is unpredictable > because we don't control global pool destruction order. > > Given that our API is what it is, I can see no alternative to fixing the > way the caches work. Reference counting only works when you have > complete control over the lifetime of both the cache and the referrer. > I'm not familiar with the DB_REGISTER issue but I assume that it somehow bound to or extended the effective lifetime of an svn_fs_t. The DAG node cache is different in that references to it never need to survive past the boundary of any FS API call. They are merely there to allow the equivalent of fs_api_func() { /* Careful: requesting one node must not invalidate the respective other. */ do_func(get_node("A", lock-it), get_node("B", lock-it)); /* nobody cares about A and B any more at this point. */ } Local sub-pools would safely limit the lock lifetime and solve the problem of leaking implementation details. -- Stefan^2.