https://issues.apache.org/jira/browse/SVN-4667

I am currently contracting for WANdisco to help a customer whose merge is using excessive RAM. The merge will not complete with 4 GB RAM and will complete with 5 GB RAM available.

The branches involved have subtree mergeinfo on over 3500 files, each referring to about 350 branches on average, and just over 1 revision range on average per mergeinfo line. Average path length is under 100 bytes.

This seems already far too much memory usage for the size of the data set, and the size of the data set is growing.

Issue #4667 is about reducing the amount of RAM Subversion uses given this data set. Another way to approach the problem is to reduce the amount of subtree mergeinfo by changing the work flow practices; that approach is also being investigated but is not in the scope of this issue, except to note that the tools "svn-mergeinfo-normalizer" and "svn-clean-mergeinfo.pl" both also fail to execute in the available RAM.

The reproduction recipe I'm using so far is attached to the issue. It generates a repository with N=300 (for example) branches, each with a unique file changed, and merged to trunk such that trunk gets N files with subtree mergeinfo, each referring to up to N branches (half of N, on average).

I can then run test merges, with debugging prints in them, to view the memory increase:

# this runs a merge from trunk to branch,
# with WC directory 'A' switched to a branch:
$ (cd obj-dir/subversion/tests/cmdline/svn-test-work/working_copies/mergeinfo_tests-14/ && \
  svn revert -q -R A/ && \
  svn merge -q ^/A A)
DBG: merge.c:12587: using 8+3 MB; increase +2 MB
DBG: merge.c:12418: using 8+25 MB; increase +21 MB
DBG: merge.c:12455: using 8+34 MB; increase +9 MB
DBG: merge.c:9378: using 8+37 MB; increase +3 MB
DBG: merge.c:9378: using 8+43 MB; increase +6 MB

I don't know how representative this repro-test is of the customer's use case, but it provides a starting point.

Monitoring the memory usage (RSS on Linux) of the 'svn' process (see the issue for code used), I find:

original: baseline 8 MB (after process started) + growth of 75 MB
after r1776742: baseline 8 MB + growth of 50 MB
after r1776788: baseline 8 MB + growth of 43 MB

Those two commits introduce subpools to discard temporary mergeinfo after use. There are no doubt more possibilities to tighten the memory usage using subpools. This approach might be very useful, but seems unlikely to deliver an order-of-magnitude or an order-of-complexity reduction that probably will be needed.

I would like to try a different approach. We read, parse and store all the mergeinfo, whereas I believe our merge algorithm is only interested in the mergeinfo that refers to one of exactly two branches ('source' and 'target') in a typical merge. The algorithm never searches the 'graph' of merge ancestry beyond those two branches. We should be able to read, parse and store only the mergeinfo we need.

Another possible approach could be to store subtree mergeinfo in a "delta" form relative to a parent path's mergeinfo.

- Julian

Reply via email to