Re: RFE: API for an efficient retrieval of server-side mergeinfo data
Julian: Given the required RA protocol changes, when could this change be shipped? What version of SVN? Thank you. Doug On Wed, Feb 19, 2014 at 10:06 AM, Julian Foad julianf...@btopenworld.comwrote: Marc Strapetz wrote: Julian Foad wrote: It looks like we have an agreement in principle. Would you like to file an enhancement issue? Great. I've filed an issue now: http://subversion.tigris.org/issues/show_bug.cgi?id=4469 Would you please review the various attributes (Subcomponent, ...)? That's great, thanks. I added a reference to this email thread, added myself to the CC list, and tweaked the type from 'feature' to 'enhancement' (just my personal interpretation) and schedule from '---' to 'unscheduled' (which just indicates I've thought about it and am stating that it's not currently tied to any particular release, it doesn't mean it has to stay that way). I talked with Brane about this and we discussed how it might make more sense to do a higher level API. Instead of asking what is the absolute difference in the mergeinfo representations? it could ask What merges and other interesting events have occurred in the lifetime of this path?. There are a couple of reasons. The API as sketched so far is pretty straightforward, but even so the effort needed to implement it is not trivial. It requires RA protocol changes as well as all the layers of API change. The mergeinfo representation is subject to change. It feels like a backward step to invest effort in adding more support that is tied specifically to the current format. SmartSVN and other front ends like to be able to draw a merge graph. Even the 'svn mergeinfo' command-line command now draws a little ASCII-art graph showing limited information about the most recent merge. At present they all have to interpret mergeinfo themselves, at a pretty low level, and the interpretation is subtle and poorly understood. (I don't understand the edge cases related to adds and deletes properly, and I've been working with it for years.) So it seems like a good idea to encapsulate the interpretation of mergeinfo a bit more, and expose data in a form that is geared specifically towards explaining the history in the way that users can understand it. Maybe think of it as an extended 'log' operation, adding a small number of new notification types such as: * there is a full merge into here, bringing in all the new changes from PATH up to REV; * there is a partial merge to here, bringing in some changes from PATH between REV1 and REV2; What do you think of that sort of interface? Does your code already calculate something like that? - Julian -- Douglas B. Robinson | *Senior Product Manager* WANdisco // *Non-Stop Data* t. 925-396-1125 e. doug.robin...@wandisco.com -- Listed on the London Stock Exchange: WANDhttp://www.bloomberg.com/quote/WAND:LN THIS MESSAGE AND ANY ATTACHMENTS ARE CONFIDENTIAL, PROPRIETARY, AND MAY BE PRIVILEGED. If this message was misdirected, WANdisco, Inc. and its subsidiaries, (WANdisco) does not waive any confidentiality or privilege. If you are not the intended recipient, please notify us immediately and destroy the message without disclosing its contents to anyone. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. The views and opinions expressed in this e-mail message are the author's own and may not reflect the views and opinions of WANdisco, unless the author is authorized by WANdisco to express such views or opinions on its behalf. All email sent to or from this address is subject to electronic storage and review by WANdisco. Although WANdisco operates anti-virus programs, it does not accept responsibility for any damage whatsoever caused by viruses being passed.
Re: RFE: API for an efficient retrieval of server-side mergeinfo data
On 21.02.2014 15:50, Doug Robinson wrote: Julian: Given the required RA protocol changes, when could this change be shipped? What version of SVN? We treat a protocol extension the same way as an API extension: new protocol-level features can only appear in minor version releases (e.g., 1.9.0 or 1.10.0), and they must be implemented in such a way that they do not affect older clients. -- Brane -- Branko Čibej | Director of Subversion WANdisco // Non-Stop Data e. br...@wandisco.com
Re: RFE: API for an efficient retrieval of server-side mergeinfo data
Doug Robinson wrote: Julian: Given the required RA protocol changes, when could this change be shipped? What version of SVN? Hi Doug. A change like that could be shipped in a 1.x.0 version. - Julian Julian Foad wrote: Marc Strapetz wrote: Julian Foad wrote: It looks like we have an agreement in principle. Would you like to file an enhancement issue? Great. I've filed an issue now: http://subversion.tigris.org/issues/show_bug.cgi?id=4469 [...] I talked with Brane about this and we discussed how it might make more sense to do a higher level API. [...]
Re: RFE: API for an efficient retrieval of server-side mergeinfo data
On 18.02.2014 15:26, Julian Foad wrote: Marc Strapetz wrote: On 17.02.2014 18:36, Julian Foad wrote: Marc Strapetz wrote: Hence an API like the following should work well for us: interface MergeinfoDiffCallback { void mergeinfoDiff(int revision, MapString, Mergeinfo pathToAddedMergeinfo, MapString, Mergeinfo pathToRemovedMergeinfo); } void getMergeinfoDiff(String rootPath, long fromRev, long toRev, MergeinfoDiffCallback callback) throws ClientException; This should give us all mergeinfo which affects any path at or below rootPath. [...] let's use the simpler version that's sufficient for your use case. That will be fine. [...] From cache perspective it's easier to build the cache starting at r0: [...] Anyway, I agree that receiving mergeinfo for more recent revisions first is reasonable as well. Hence if you say the effort is the same, then we could allow both: fromRev = toRev, in which case we will received mergeinfo in ascending order and fromRev toRev in which case it will be descending order? Could do. It seems like a relatively minor decision. [...] important that ranges for which no mergeinfo diff is present will be processed quickly on the server-side, otherwise we could run into some kind of endless loop, if the cache building process is shutdown and resumed frequently. [...] There is a client-side work-around: request ranges of say a thousand revisions at a time, and then you can easily keep track of how many of these requests have been completed. OK, that will work. It looks like we have an agreement in principle. Would you like to file an enhancement issue? Great. I've filed an issue now: http://subversion.tigris.org/issues/show_bug.cgi?id=4469 Would you please review the various attributes (Subcomponent, ...)? -Marc
Re: RFE: API for an efficient retrieval of server-side mergeinfo data
Marc Strapetz wrote: Julian Foad wrote: It looks like we have an agreement in principle. Would you like to file an enhancement issue? Great. I've filed an issue now: http://subversion.tigris.org/issues/show_bug.cgi?id=4469 Would you please review the various attributes (Subcomponent, ...)? That's great, thanks. I added a reference to this email thread, added myself to the CC list, and tweaked the type from 'feature' to 'enhancement' (just my personal interpretation) and schedule from '---' to 'unscheduled' (which just indicates I've thought about it and am stating that it's not currently tied to any particular release, it doesn't mean it has to stay that way). I talked with Brane about this and we discussed how it might make more sense to do a higher level API. Instead of asking what is the absolute difference in the mergeinfo representations? it could ask What merges and other interesting events have occurred in the lifetime of this path?. There are a couple of reasons. The API as sketched so far is pretty straightforward, but even so the effort needed to implement it is not trivial. It requires RA protocol changes as well as all the layers of API change. The mergeinfo representation is subject to change. It feels like a backward step to invest effort in adding more support that is tied specifically to the current format. SmartSVN and other front ends like to be able to draw a merge graph. Even the 'svn mergeinfo' command-line command now draws a little ASCII-art graph showing limited information about the most recent merge. At present they all have to interpret mergeinfo themselves, at a pretty low level, and the interpretation is subtle and poorly understood. (I don't understand the edge cases related to adds and deletes properly, and I've been working with it for years.) So it seems like a good idea to encapsulate the interpretation of mergeinfo a bit more, and expose data in a form that is geared specifically towards explaining the history in the way that users can understand it. Maybe think of it as an extended 'log' operation, adding a small number of new notification types such as: * there is a full merge into here, bringing in all the new changes from PATH up to REV; * there is a partial merge to here, bringing in some changes from PATH between REV1 and REV2; What do you think of that sort of interface? Does your code already calculate something like that? - Julian
Re: RFE: API for an efficient retrieval of server-side mergeinfo data
On 19.02.2014 16:06, Julian Foad wrote: Marc Strapetz wrote: Julian Foad wrote: It looks like we have an agreement in principle. Would you like to file an enhancement issue? Great. I've filed an issue now: http://subversion.tigris.org/issues/show_bug.cgi?id=4469 Would you please review the various attributes (Subcomponent, ...)? [...] SmartSVN and other front ends like to be able to draw a merge graph. Even the 'svn mergeinfo' command-line command now draws a little ASCII-art graph showing limited information about the most recent merge. At present they all have to interpret mergeinfo themselves, at a pretty low level, and the interpretation is subtle and poorly understood. (I don't understand the edge cases related to adds and deletes properly, and I've been working with it for years.) So it seems like a good idea to encapsulate the interpretation of mergeinfo a bit more, and expose data in a form that is geared specifically towards explaining the history in the way that users can understand it. Maybe think of it as an extended 'log' operation, adding a small number of new notification types such as: * there is a full merge into here, bringing in all the new changes from PATH up to REV; * there is a partial merge to here, bringing in some changes from PATH between REV1 and REV2; What do you think of that sort of interface? That definitely sounds good. Just to note that the extended-log-information should be easily receivable and cacheable for the entire repository and it must be rich enough to easily extract information for a specific path. Examples: - allow to include/exclude subtree merges for merge arrows - allow merge arrow display for sub-directories and individual files Ultimately, when having received all extended-log-information for all revisions, one should be able to recreate raw svn:mergeinfo for all paths of all revisions. I think this will guarantee that we won't miss any possible use case when defining the protocol and data structures. Does your code already calculate something like that? Yes, and I recall having a hard time when writing this code :) -Marc
Re: RFE: API for an efficient retrieval of server-side mergeinfo data
On 17.02.2014 18:36, Julian Foad wrote: Marc Strapetz wrote: ... I'll dig into the cache code ... I did that now and the storage is quite simple: we have a main file which contains the diff (added, removed) for every path in every revision and a revision-based index file with constant record length (to quickly locate entries in the main file). This storage allows to efficiently query for the mergeinfo diff for a path in a certain revision. That's sufficient to build the merge arrows. Assembling the complete mergeinfo for a certain revision is hard with this cache, but actually not necessary for our use case. Hence an API like the following should work well for us: interface MergeinfoDiffCallback { void mergeinfoDiff(int revision, MapString, Mergeinfo pathToAddedMergeinfo, MapString, Mergeinfo pathToRemovedMergeinfo); } void getMergeinfoDiff(String rootPath, long fromRev, long toRev, MergeinfoDiffCallback callback) throws ClientException; This should give us all mergeinfo which affects any path at or below rootPath. When disregarding our particular use case, a more consistent API could be: void getMergeinfoDiff(IterableString paths, long fromRev, long toRev, Mergeinfo.Inheritance inherit, boolean includeDescendants, MergeinfoDiffCallback callback) throws ClientException; I want to discourage callers from knowing or caring how the mergeinfo is stored, so I want to leave out the 'inherit' parameter. I also think it makes sense not to offer the options of ignoring descendants (that is, subtree mergeinfo), or specifying multiple paths. After all, this is not a low level API to be used for implementing the mergeinfo subsystem, it's a high level query. So let's use the simpler version that's sufficient for your use case. That will be fine. The mergeinfo diff should be received starting at fromRev and ending at toRev. No callback is expected if there is no mergeinfo diff for a certain revision. Depending on the server-side storage, we may require to always have fromRev = toRev or always fromRev = toRev. If it doesn't matter, better have always fromRev = toRev (for reasons given below). The same procedure could work either forwards or backwards, it doesn't really matter as long as you know which way it is going. Often it is useful to know about the more recent changes first, and have the option to look back right to revision 0 if necessary. From cache perspective it's easier to build the cache starting at r0: then cache files will contain information for older revision at lower positions. This allows to crop files easily at a certain revision and rebuild them from there. That's something we do, if a Log message is modified from within the GUI (it might not play a role for mergeinfo, though). Anyway, I agree that receiving mergeinfo for more recent revisions first is reasonable as well. Hence if you say the effort is the same, then we could allow both: fromRev = toRev, in which case we will received mergeinfo in ascending order and fromRev toRev in which case it will be descending order? Regarding the usage, let's assume always fromRev = toRev, then we will invoke getMergeinfoDiff(cacheRoot, 0, head, callback) This should start returning mergeinfo diff immediately, starting at revision 0, so we quickly make at least a bit of progress. Now, if the cache building process is shutdown and restarted later, it will resume with the latest known revision: getMergeinfoDiff(cacheRoot, latestKnownRevision, head, callback) This procedure will be performed until we have caught up with head. Note, that the latestKnownRevision is the last revision for which we have received a callback. Depending on the server-side storage, this may be different from the current revision which the server is currently processing at the time the cache building process is shutdown. Hence it will be important that ranges for which no mergeinfo diff is present will be processed quickly on the server-side, otherwise we could run into some kind of endless loop, if the cache building process is shutdown and resumed frequently. Yes -- if the server takes a long time to work its way through a large range of (say a million) revisions where there are no mergeinfo changes, there is no graceful way to stop the procedure part way through, and no way to discover how far it has searched when you kill it. Maybe that is not important. There is a client-side work-around: request ranges of say a thousand revisions at a time, and then you can easily keep track of how many of these requests have been completed. OK, that will work. -Marc
Re: RFE: API for an efficient retrieval of server-side mergeinfo data
Marc Strapetz wrote: On 17.02.2014 18:36, Julian Foad wrote: Marc Strapetz wrote: Hence an API like the following should work well for us: interface MergeinfoDiffCallback { void mergeinfoDiff(int revision, MapString, Mergeinfo pathToAddedMergeinfo, MapString, Mergeinfo pathToRemovedMergeinfo); } void getMergeinfoDiff(String rootPath, long fromRev, long toRev, MergeinfoDiffCallback callback) throws ClientException; This should give us all mergeinfo which affects any path at or below rootPath. [...] let's use the simpler version that's sufficient for your use case. That will be fine. [...] From cache perspective it's easier to build the cache starting at r0: [...] Anyway, I agree that receiving mergeinfo for more recent revisions first is reasonable as well. Hence if you say the effort is the same, then we could allow both: fromRev = toRev, in which case we will received mergeinfo in ascending order and fromRev toRev in which case it will be descending order? Could do. It seems like a relatively minor decision. [...] important that ranges for which no mergeinfo diff is present will be processed quickly on the server-side, otherwise we could run into some kind of endless loop, if the cache building process is shutdown and resumed frequently. [...] There is a client-side work-around: request ranges of say a thousand revisions at a time, and then you can easily keep track of how many of these requests have been completed. OK, that will work. It looks like we have an agreement in principle. Would you like to file an enhancement issue? http://subversion.tigris.org/issues/ When you are logged in, that page includes links for filing a new issue. Please note that filing an issue doesn't affect whether or when the work will be done, but it's useful as a central place to refer to the task. Do you have the resources to work on implementing this or are you looking for a volunteer? - Julian
Re: RFE: API for an efficient retrieval of server-side mergeinfo data
Marc Strapetz wrote: ... I'll dig into the cache code ... I did that now and the storage is quite simple: we have a main file which contains the diff (added, removed) for every path in every revision and a revision-based index file with constant record length (to quickly locate entries in the main file). This storage allows to efficiently query for the mergeinfo diff for a path in a certain revision. That's sufficient to build the merge arrows. Assembling the complete mergeinfo for a certain revision is hard with this cache, but actually not necessary for our use case. Hence an API like the following should work well for us: interface MergeinfoDiffCallback { void mergeinfoDiff(int revision, MapString, Mergeinfo pathToAddedMergeinfo, MapString, Mergeinfo pathToRemovedMergeinfo); } void getMergeinfoDiff(String rootPath, long fromRev, long toRev, MergeinfoDiffCallback callback) throws ClientException; This should give us all mergeinfo which affects any path at or below rootPath. When disregarding our particular use case, a more consistent API could be: void getMergeinfoDiff(IterableString paths, long fromRev, long toRev, Mergeinfo.Inheritance inherit, boolean includeDescendants, MergeinfoDiffCallback callback) throws ClientException; I want to discourage callers from knowing or caring how the mergeinfo is stored, so I want to leave out the 'inherit' parameter. I also think it makes sense not to offer the options of ignoring descendants (that is, subtree mergeinfo), or specifying multiple paths. After all, this is not a low level API to be used for implementing the mergeinfo subsystem, it's a high level query. So let's use the simpler version that's sufficient for your use case. The mergeinfo diff should be received starting at fromRev and ending at toRev. No callback is expected if there is no mergeinfo diff for a certain revision. Depending on the server-side storage, we may require to always have fromRev = toRev or always fromRev = toRev. If it doesn't matter, better have always fromRev = toRev (for reasons given below). The same procedure could work either forwards or backwards, it doesn't really matter as long as you know which way it is going. Often it is useful to know about the more recent changes first, and have the option to look back right to revision 0 if necessary. Regarding the usage, let's assume always fromRev = toRev, then we will invoke getMergeinfoDiff(cacheRoot, 0, head, callback) This should start returning mergeinfo diff immediately, starting at revision 0, so we quickly make at least a bit of progress. Now, if the cache building process is shutdown and restarted later, it will resume with the latest known revision: getMergeinfoDiff(cacheRoot, latestKnownRevision, head, callback) This procedure will be performed until we have caught up with head. Note, that the latestKnownRevision is the last revision for which we have received a callback. Depending on the server-side storage, this may be different from the current revision which the server is currently processing at the time the cache building process is shutdown. Hence it will be important that ranges for which no mergeinfo diff is present will be processed quickly on the server-side, otherwise we could run into some kind of endless loop, if the cache building process is shutdown and resumed frequently. Yes -- if the server takes a long time to work its way through a large range of (say a million) revisions where there are no mergeinfo changes, there is no graceful way to stop the procedure part way through, and no way to discover how far it has searched when you kill it. Maybe that is not important. There is a client-side work-around: request ranges of say a thousand revisions at a time, and then you can easily keep track of how many of these requests have been completed. OK, that sounds good enough. - Julian
Re: RFE: API for an efficient retrieval of server-side mergeinfo data
I took a stab at writing the JavaHL boiler-plate code for this, as attached, though I'm unfamiliar with JavaHL. It seems to require modifying 5 java files and creating 3 new ones. Is that right, JavaHL experts? It seems a lot. The implementation in the core library is empty, as yet, in the attached patch. - Julian interface MergeinfoDiffCallback { void mergeinfoDiff(int revision, MapString, Mergeinfo pathToAddedMergeinfo, MapString, Mergeinfo pathToRemovedMergeinfo); } void getMergeinfoDiff(String rootPath, long fromRev, long toRev, MergeinfoDiffCallback callback) throws ClientException;Add boiler-plate code in JavaHL for a new API to get per-revision mergeinfo diffs. Suggested by: Marc Strapetz marc.strapetz{_AT_}syntevo.com * subversion/include/svn_ra.h, subversion/libsvn_ra/ra_loader.c (svn_ra_get_mergeinfo): New function, with an empty implementation. In subversion/bindings/javahl/: * native/MergeinfoDiffCallback.h, native/MergeinfoDiffCallback.cpp New files, copied from LogMessageCallback.* and adjusted. * native/org_apache_subversion_javahl_remote_RemoteSession.cpp (Java_org_apache_subversion_javahl_remote_RemoteSession_getMergeinfoDiffs): New function. * native/RemoteSession.h, native/RemoteSession.cpp (getMergeinfoDiffs): New method. * src/org/apache/subversion/javahl/callback/MergeinfoDiffCallback.java New file, copied from LogMessageCallback.java and adjusted. * src/org/apache/subversion/javahl/ISVNRemote.java (getMergeinfoDiffs): New method. * src/org/apache/subversion/javahl/remote/RemoteSession.java (svn_ra_get_mergeinfo_diffs): New function. --This line, and those below, will be ignored-- Index: subversion/bindings/javahl/native/MergeinfoDiffCallback.cpp === --- subversion/bindings/javahl/native/MergeinfoDiffCallback.cpp (revision 1568992) +++ subversion/bindings/javahl/native/MergeinfoDiffCallback.cpp (working copy) @@ -17,60 +17,65 @@ *KIND, either express or implied. See the License for the *specific language governing permissions and limitations *under the License. * * @endcopyright * - * @file LogMessageCallback.cpp - * @brief Implementation of the class LogMessageCallback + * @file MergeinfoDiffCallback.cpp + * @brief Implementation of the class MergeinfoDiffCallback */ -#include LogMessageCallback.h +#include MergeinfoDiffCallback.h #include CreateJ.h #include EnumMapper.h #include JNIUtil.h #include svn_time.h #include svn_sorts.h #include svn_compat.h /** - * Create a LogMessageCallback object + * Create a MergeinfoDiffCallback object * @param jcallback the Java callback object. */ -LogMessageCallback::LogMessageCallback(jobject jcallback) +MergeinfoDiffCallback::MergeinfoDiffCallback(jobject jcallback) { m_callback = jcallback; } /** - * Destroy a LogMessageCallback object + * Destroy a MergeinfoDiffCallback object */ -LogMessageCallback::~LogMessageCallback() +MergeinfoDiffCallback::~MergeinfoDiffCallback() { // The m_callback does not need to be destroyed because it is the - // passed in parameter to the Java SVNClientInterface.logMessages + // passed in parameter to the Java ISVNRemote.getMergeinfoDiffs // method. } svn_error_t * -LogMessageCallback::callback(void *baton, - svn_log_entry_t *log_entry, - apr_pool_t *pool) +MergeinfoDiffCallback::callback(void *baton, +svn_revnum_t revision, +svn_mergeinfo_t *added_mergeinfo, +svn_mergeinfo_t *deleted_mergeinfo, +apr_pool_t *pool) { if (baton) -return static_castLogMessageCallback *(baton)-singleMessage( -log_entry, pool); +return static_castMergeinfoDiffCallback *(baton)-singleMessage( +revision, added_mergeinfo, deleted_mergeinfo, pool); return SVN_NO_ERROR; } /** - * Callback called for a single log message + * Callback called for a single mergeinfo diff */ svn_error_t * -LogMessageCallback::singleMessage(svn_log_entry_t *log_entry, apr_pool_t *pool) +MergeinfoDiffCallback::singleMessage(svn_revnum_t revision, + svn_mergeinfo_t *added_mergeinfo, + svn_mergeinfo_t *deleted_mergeinfo, + apr_pool_t *pool) { JNIEnv *env = JNIUtil::getEnv(); // Create a local frame for our references env-PushLocalFrame(LOCAL_FRAME_SIZE); if (JNIUtil::isJavaExceptionThrown()) @@ -78,55 +83,41 @@ LogMessageCallback::singleMessage(svn_lo // The method id will not change during the time this library is // loaded, so it can be cached.
Re: RFE: API for an efficient retrieval of server-side mergeinfo data
On 17.02.2014 22:25, Julian Foad wrote: I took a stab at writing the JavaHL boiler-plate code for this, as attached, though I'm unfamiliar with JavaHL. It seems to require modifying 5 java files and creating 3 new ones. Is that right, JavaHL experts? It seems a lot. It's about right. Welcome to Java and JNI. If this were a real attempt, we'd want to use the new jniwrapper for the native code; see, for example, NativeStream.hpp/.cpp. -- Brane -- Branko Čibej | Director of Subversion WANdisco // Non-Stop Data e. br...@wandisco.com
Re: RFE: API for an efficient retrieval of server-side mergeinfo data
Marc Strapetz wrote: For SmartSVN we are optionally displaying merge arrows in the Revision Graph. Here is a sample image, how this looks like: http://imgur.com/MzrLq00 From the JavaHL sources I understand that there is currently only one method to retrieve server-side mergeinfo and this one works on a single revision only: MapString, Mergeinfo getMergeinfo(IterableString paths, long revision, Mergeinfo.Inheritance inherit, boolean includeDescendants) Right. This is a wrapper around the core library function svn_ra_get_mergeinfo(). This makes the Merge Arrow feature practically unusable for larger graphs. To improve performance, in earlier versions we were using a client-side mergeinfo cache (similar as the main log-cache, which TSVN is using as well). However, populating this cache (i.e. querying for mergeinfo for *every* revision of the repository) often resulted in bringing the entire Apache server down, especially if many users were building their log cache at the same time. To address these problems, it would be great to have a more powerful API, which allows either to retrieve all mergeinfo for a *revision range* or for a *set of revisions*. The request for a more powerful API certainly makes sense, but what form of API? In the Subversion project source code: # How many lines/bytes of mergeinfo in trunk, right now? $ svn pg -R svn:mergeinfo | wc -lc 245 24063 # How many branches and tags? $ svn ls ^/subversion/tags/ ^/subversion/branches/ | wc -l 288 # Approx. total lines/bytes mergeinfo per revision? $ echo $((245 * 289)) $((24063 * 289)) 70805 6954207 So in each revision there are roughly 70,000 lines of mergeinfo, occupying 7 MB in plain text representation. The mergeinfo properties change whenever a merge is done. All other commits leave all the mergeinfo unchanged. So mergeinfo is unchanged in, what, 99% of revisions? It doesn't seem logical to simply request all the mergeinfo for each revision in turn, and return it all in raw form. Can we think of a better way to design the API so that it returns the interesting data without all the redundancy? Basically I think we want to describe changes to mergeinfo, rather than raw mergeinfo. - Julian Querying a set of revisions would be more flexible and would allow to generate merge arrows on the fly. On the other hand, to alleviate the server, it's desirable to cache retrieved mergeinfo on the client-side anyway, hence a range query would be fine as well. -Marc
Re: RFE: API for an efficient retrieval of server-side mergeinfo data
On 14.02.2014 11:38, Julian Foad wrote: Marc Strapetz wrote: For SmartSVN we are optionally displaying merge arrows in the Revision Graph. Here is a sample image, how this looks like: http://imgur.com/MzrLq00 From the JavaHL sources I understand that there is currently only one method to retrieve server-side mergeinfo and this one works on a single revision only: MapString, Mergeinfo getMergeinfo(IterableString paths, long revision, Mergeinfo.Inheritance inherit, boolean includeDescendants) Right. This is a wrapper around the core library function svn_ra_get_mergeinfo(). This makes the Merge Arrow feature practically unusable for larger graphs. To improve performance, in earlier versions we were using a client-side mergeinfo cache (similar as the main log-cache, which TSVN is using as well). However, populating this cache (i.e. querying for mergeinfo for *every* revision of the repository) often resulted in bringing the entire Apache server down, especially if many users were building their log cache at the same time. To address these problems, it would be great to have a more powerful API, which allows either to retrieve all mergeinfo for a *revision range* or for a *set of revisions*. The request for a more powerful API certainly makes sense, but what form of API? In the Subversion project source code: # How many lines/bytes of mergeinfo in trunk, right now? $ svn pg -R svn:mergeinfo | wc -lc 245 24063 # How many branches and tags? $ svn ls ^/subversion/tags/ ^/subversion/branches/ | wc -l 288 # Approx. total lines/bytes mergeinfo per revision? $ echo $((245 * 289)) $((24063 * 289)) 70805 6954207 So in each revision there are roughly 70,000 lines of mergeinfo, occupying 7 MB in plain text representation. The mergeinfo properties change whenever a merge is done. All other commits leave all the mergeinfo unchanged. So mergeinfo is unchanged in, what, 99% of revisions? It doesn't seem logical to simply request all the mergeinfo for each revision in turn, and return it all in raw form. Can we think of a better way to design the API so that it returns the interesting data without all the redundancy? Basically I think we want to describe changes to mergeinfo, rather than raw mergeinfo. I wonder, Julian, could something like this be useful for improving merge in general? We know that clients can cache most of the mergeinfo in the repository, if they want to; I just don't have any feeling for how much sense it would make to maintain such a cache, and if it can be made smart enough to speed up merging significantly. -- Brane -- Branko Čibej | Director of Subversion WANdisco // Non-Stop Data e. br...@wandisco.com
Re: RFE: API for an efficient retrieval of server-side mergeinfo data
I (Julian Foad) wrote: Can we think of a better way to design the API so that it returns the interesting data without all the redundancy? Basically I think we want to describe changes to mergeinfo, rather than raw mergeinfo. Marc, Perhaps a better way to ask the question is: Can I encourage you to write the API that you want? You already designed a cache for the data. What is the shape of the data in your cache, and can the API get the data you want in the form you want it, directly? We'd be glad to help implement it. Even if you start with an API which simply iterates over a range of revisions, at least that would allow for the possibility of improving the efficiency internally at various layers. - Julian
Re: RFE: API for an efficient retrieval of server-side mergeinfo data
On 14.02.2014 11:38, Julian Foad wrote: Marc Strapetz wrote: For SmartSVN we are optionally displaying merge arrows in the Revision Graph. Here is a sample image, how this looks like: http://imgur.com/MzrLq00 From the JavaHL sources I understand that there is currently only one method to retrieve server-side mergeinfo and this one works on a single revision only: MapString, Mergeinfo getMergeinfo(IterableString paths, long revision, Mergeinfo.Inheritance inherit, boolean includeDescendants) Right. This is a wrapper around the core library function svn_ra_get_mergeinfo(). This makes the Merge Arrow feature practically unusable for larger graphs. To improve performance, in earlier versions we were using a client-side mergeinfo cache (similar as the main log-cache, which TSVN is using as well). However, populating this cache (i.e. querying for mergeinfo for *every* revision of the repository) often resulted in bringing the entire Apache server down, especially if many users were building their log cache at the same time. To address these problems, it would be great to have a more powerful API, which allows either to retrieve all mergeinfo for a *revision range* or for a *set of revisions*. The request for a more powerful API certainly makes sense, but what form of API? In the Subversion project source code: # How many lines/bytes of mergeinfo in trunk, right now? $ svn pg -R svn:mergeinfo | wc -lc 245 24063 # How many branches and tags? $ svn ls ^/subversion/tags/ ^/subversion/branches/ | wc -l 288 # Approx. total lines/bytes mergeinfo per revision? $ echo $((245 * 289)) $((24063 * 289)) 70805 6954207 So in each revision there are roughly 70,000 lines of mergeinfo, occupying 7 MB in plain text representation. The mergeinfo properties change whenever a merge is done. All other commits leave all the mergeinfo unchanged. So mergeinfo is unchanged in, what, 99% of revisions? It doesn't seem logical to simply request all the mergeinfo for each revision in turn, and return it all in raw form. Can we think of a better way to design the API so that it returns the interesting data without all the redundancy? Basically I think we want to describe changes to mergeinfo, rather than raw mergeinfo. True, actually on the client-side we interested in the diff, anyway. So some kind of callback: interface MergeInfoDiffCallback { void mergeInfoDiff(int revision, Mergeinfo added, Mergeinfo removed); } would be convenient. This would work for revision ranges as well as a set of revisions. -Marc
Re: RFE: API for an efficient retrieval of server-side mergeinfo data
Can we think of a better way to design the API so that it returns the interesting data without all the redundancy? Basically I think we want to describe changes to mergeinfo, rather than raw mergeinfo. Marc, Perhaps a better way to ask the question is: Can I encourage you to write the API that you want? You already designed a cache for the data. What is the shape of the data in your cache, and can the API get the data you want in the form you want it, directly? We'd be glad to help implement it. Even if you start with an API which simply iterates over a range of revisions, at least that would allow for the possibility of improving the efficiency internally at various layers. Looks like our emails have crossed :) I'll dig into the cache code and will try to come back with a more detailed API suggestion soon. -Marc On 14.02.2014 14:09, Julian Foad wrote: I (Julian Foad) wrote: Can we think of a better way to design the API so that it returns the interesting data without all the redundancy? Basically I think we want to describe changes to mergeinfo, rather than raw mergeinfo. Marc, Perhaps a better way to ask the question is: Can I encourage you to write the API that you want? You already designed a cache for the data. What is the shape of the data in your cache, and can the API get the data you want in the form you want it, directly? We'd be glad to help implement it. Even if you start with an API which simply iterates over a range of revisions, at least that would allow for the possibility of improving the efficiency internally at various layers. - Julian
Re: RFE: API for an efficient retrieval of server-side mergeinfo data
Marc Strapetz wrote: Can we think of a better way to design the API so that it returns the interesting data without all the redundancy? Basically I think we want to describe changes to mergeinfo, rather than raw mergeinfo. Marc, Perhaps a better way to ask the question is: Can I encourage you to write the API that you want? You already designed a cache for the data. What is the shape of the data in your cache, and can the API get the data you want in the form you want it, directly? We'd be glad to help implement it. Even if you start with an API which simply iterates over a range of revisions, at least that would allow for the possibility of improving the efficiency internally at various layers. Looks like our emails have crossed :) I'll dig into the cache code and will try to come back with a more detailed API suggestion soon. Excellent! Thanks. - Julian
Re: RFE: API for an efficient retrieval of server-side mergeinfo data
Branko Čibej wrote: On 14.02.2014 11:38, Julian Foad wrote: Can we think of a better way to design the API so that it returns the interesting data without all the redundancy? Basically I think we want to describe changes to mergeinfo, rather than raw mergeinfo. I wonder, Julian, could something like this be useful for merge in general? We know that clients can cache most of the mergeinfo in the repository, if they want to; I just don't have any feeling for how much sense it would make to maintain such a cache, and if it can be made smart enough to speed up merging significantly. I wasn't sure how much mergeinfo we fetch in a typical merge so I tried some merges with current svn branches. They all fetched mergeinfo either two or three times, all at the head revision, and the time taken to fetch it was not a substantial portion of the overall merge time. So I think the answer is we wouldn't currently benefit from this within the scope of one merge. (A persistent cache on the client machine is a different matter.) - Julian
Re: RFE: API for an efficient retrieval of server-side mergeinfo data
On 14.02.2014 14:18, Marc Strapetz wrote: Can we think of a better way to design the API so that it returns the interesting data without all the redundancy? Basically I think we want to describe changes to mergeinfo, rather than raw mergeinfo. Marc, Perhaps a better way to ask the question is: Can I encourage you to write the API that you want? You already designed a cache for the data. What is the shape of the data in your cache, and can the API get the data you want in the form you want it, directly? We'd be glad to help implement it. Even if you start with an API which simply iterates over a range of revisions, at least that would allow for the possibility of improving the efficiency internally at various layers. Looks like our emails have crossed :) I'll dig into the cache code and will try to come back with a more detailed API suggestion soon. I did that now and the storage is quite simple: we have a main file which contains the diff (added, removed) for every path in every revision and a revision-based index file with constant record length (to quickly locate entries in the main file). This storage allows to efficiently query for the mergeinfo diff for a path in a certain revision. That's sufficient to build the merge arrows. Assembling the complete mergeinfo for a certain revision is hard with this cache, but actually not necessary for our use case. Hence an API like the following should work well for us: interface MergeinfoDiffCallback { void mergeinfoDiff(int revision, MapString, Mergeinfo pathToAddedMergeinfo, MapString, Mergeinfo pathToRemovedMergeinfo); } void getMergeinfoDiff(String rootPath, long fromRev, long toRev, MergeinfoDiffCallback callback) throws ClientException; This should give us all mergeinfo which affects any path at or below rootPath. When disregarding our particular use case, a more consistent API could be: void getMergeinfoDiff(IterableString paths, long fromRev, long toRev, Mergeinfo.Inheritance inherit, boolean includeDescendants, MergeinfoDiffCallback callback) throws ClientException; The mergeinfo diff should be received starting at fromRev and ending at toRev. No callback is expected if there is no mergeinfo diff for a certain revision. Depending on the server-side storage, we may require to always have fromRev = toRev or always fromRev = toRev. If it doesn't matter, better have always fromRev = toRev (for reasons given below). Regarding the usage, let's assume always fromRev = toRev, then we will invoke getMergeinfoDiff(cacheRoot, 0, head, callback) This should start returning mergeinfo diff immediately, starting at revision 0, so we quickly make at least a bit of progress. Now, if the cache building process is shutdown and restarted later, it will resume with the latest known revision: getMergeinfoDiff(cacheRoot, latestKnownRevision, head, callback) This procedure will be performed until we have caught up with head. Note, that the latestKnownRevision is the last revision for which we have received a callback. Depending on the server-side storage, this may be different from the current revision which the server is currently processing at the time the cache building process is shutdown. Hence it will be important that ranges for which no mergeinfo diff is present will be processed quickly on the server-side, otherwise we could run into some kind of endless loop, if the cache building process is shutdown and resumed frequently. -Marc