On 03.01.2017 15:58, Julian Foad wrote:
https://issues.apache.org/jira/browse/SVN-4667
I am currently contracting for WANdisco to help a customer whose merge is using
excessive RAM. The merge will not complete with 4 GB RAM and will complete with
5 GB RAM available.
The branches involved have subtree mergeinfo on over 3500 files, each referring
to about 350 branches on average, and just over 1 revision range on average per
mergeinfo line. Average path length is under 100 bytes.
What is the result of 'svn pg "svn:mergeinfo" -R | wc -c'?
This seems already far too much memory usage for the size of the data set, and
the size of the data set is growing.
Issue #4667 is about reducing the amount of RAM Subversion uses given this data
set. Another way to approach the problem is to reduce the amount of subtree
mergeinfo by changing the work flow practices; that approach is also being
investigated but is not in the scope of this issue, except to note that the
tools "svn-mergeinfo-normalizer" and "svn-clean-mergeinfo.pl" both also fail to
execute in the available RAM.
You may run svn-mergeinfo-normalizer on arbitrary sub-trees.
A lot of memory will be used to hold that part of the repository
history that is relevant to the branches mentioned in the m/i.
This may easily grow to several GB if there have been tens of
millions of changes.
If the tool manages to read the mergeinfo, it will print m/i
stats before fetching the log. Does it get to this stage?
The reproduction recipe I'm using so far is attached to the issue. It generates
a repository with N=300 (for example) branches, each with a unique file changed,
and merged to trunk such that trunk gets N files with subtree mergeinfo, each
referring to up to N branches (half of N, on average).
I can then run test merges, with debugging prints in them, to view the memory
increase:
# this runs a merge from trunk to branch,
# with WC directory 'A' switched to a branch:
$ (cd
obj-dir/subversion/tests/cmdline/svn-test-work/working_copies/mergeinfo_tests-14/
&&
\
svn revert -q -R A/ && \
svn merge -q ^/A A)
DBG: merge.c:12587: using 8+3 MB; increase +2 MB
DBG: merge.c:12418: using 8+25 MB; increase +21 MB
DBG: merge.c:12455: using 8+34 MB; increase +9 MB
DBG: merge.c:9378: using 8+37 MB; increase +3 MB
DBG: merge.c:9378: using 8+43 MB; increase +6 MB
I don't know how representative this repro-test is of the customer's use case,
but it provides a starting point.
Monitoring the memory usage (RSS on Linux) of the 'svn' process (see the issue
for code used), I find:
original: baseline 8 MB (after process started) + growth of 75 MB
after r1776742: baseline 8 MB + growth of 50 MB
after r1776788: baseline 8 MB + growth of 43 MB
I noticed that the w/c context object seems to use a fluctuating
amount of memory, raising the baseline closer to 16 MB. IOW, your
relative savings may actually be larger.
Those two commits introduce subpools to discard temporary mergeinfo after use.
There are no doubt more possibilities to tighten the memory usage using
subpools. This approach might be very useful, but seems unlikely to deliver an
order-of-magnitude or an order-of-complexity reduction that probably will be
needed.
I would like to try a different approach. We read, parse and store all the
mergeinfo, whereas I believe our merge algorithm is only interested in the
mergeinfo that refers to one of exactly two branches ('source' and 'target') in
a typical merge. The algorithm never searches the 'graph' of merge ancestry
beyond those two branches. We should be able to read, parse and store only the
mergeinfo we need.
That seems to be the path to take. I would have assumed that we only
need the m/i for the source branch as the target m/i is implied as
being all of the target history.
Another possible approach could be to store subtree mergeinfo in a "delta" form
relative to a parent path's mergeinfo.
I can see two problems here. First, you can only use the new scheme
after all "relevant", i.e. merging, clients have been upgraded. More
importantly, the in-memory data model would need to be something
delta-like. That sounds like a lot of code-churn.
-- Stefan^2.