from:"Johan Corveleyn"

Re: svn commit: r982057 - in /subversion/branches/performance/subversion/svnserve: main.c server.h

2010-08-04 Thread Johan Corveleyn

On Wed, Aug 4, 2010 at 12:46 AM,  stef...@apache.org wrote:
 Author: stefan2
 Date: Tue Aug  3 22:46:10 2010
 New Revision: 982057

 URL: http://svn.apache.org/viewvc?rev=982057view=rev
 Log:
 Add compression, memory-cache-size and open-file-count
 command line parameters to svnserve. The latter two are only
 available (on the CL) if FSFS is supported.

Just wondering: are these features only relevant for svnserve, or
could they also be applicable (or a variation) to mod_dav_svn? If the
latter, do you plan to add a way to configure those options for
mod_dav_svn as well? Or are they available in some other way?

It would be nice if mod_dav_svn users could enjoy the same performance
improvements as svnserve users (unless it's not possible for some
technical reason of course).

Cheers,
-- 
Johan

Re: Looking to improve performance of svn annotate

2010-08-10 Thread Johan Corveleyn

Hi all,

Other priorities have unfortunately kept me from focusing on the
project of speeding up blame. But recently I've spent some time
thinking about it, reading the other mail threads, studying the code
and profiling a little bit. I hope I can still do something useful for
blame, whether it be sooner or later.

I will first give some updates, info and discussion, and end up with a
couple of questions. Any input is greatly appreciated. A lot of this
stuff is still very fuzzy to me :-).

First, some profiling info (After sprinkling some printf's and some
time-counters into blame.c):

- Most of the client-side blame-time (~90%) is spent on producing the
diffs (more specifically, the call to svn_diff_file_diff_2 in the
add_file_blame function in blame.c).

- Applying the deltas to produce full-texts in the first place accounts for ~5%.

- Time spent on producing the blame info (inserting/deleting blame
chunks) is negligible (  0.5%).


Coming back with this information to the threads that Mark Phippard suggested:

On Mon, Mar 22, 2010 at 11:08 PM, Mark Phippard markp...@gmail.com wrote:
 If you haven't, you should review some of the old threads on speeding
 up blame.  Dan Berlin made a lot of improvements in the SVN 1.2
 timeframe and learned a lot about what does NOT speed it up.

 http://svn.haxx.se/dev/archive-2005-02/0275.shtml

This thread mainly ended up focusing on the delta combiner, with the
conclusion of replacing vdelta with xdelta. I think this was a very
good optimization of which we now reap the benefits. If I understand
my profiling numbers correctly, the delta combining stuff is no longer
the biggest bottleneck (at least not on the client-side).

There is however another interesting point/question from this thread, see below.

 You should really search around for all of the threads he commented on
 to get the complete picture.  I think he also provided ideas on what
 more could potentially be done in the future.  Such as this one that I
 do not recall was ever done.

 http://svn.haxx.se/dev/archive-2007-09/0459.shtml

 Maybe the thread will explain why.

The thread didn't really explain why it wasn't implemented, but the
poster said he had little time to work on it further. The suggestions
were improvements in the processing of the blame info (optimizations
in iterating through the linked list of blame chunks, and discussion
about the most optimal data structure to represent these blame
chunks). The thread sort of ended with the question whether this would
be a significant optimization:
On Tue, Jan 11, 2005, Mark Benedetto King wrote:
 I'd also be interested in profiling output from your 1750-rev file to
 see how much time is spent in the blame_* functions themselves vs
 the svn_diff_* functions, so that we can at least give a nod towards
 Amdahl's Law before we run off optimizing the wrong thing.

From my profiling data, it seems optimizing those blame_* functions
would indeed not be significant. Most of the time is spent in the
svn_diff_* functions.


Which brings us back to the idea of the binary blame, and Julian's
description of the algorithm. Now, some questions:

1) Dan Berlin's thread prompted an interesting response from Branko
Čibej (http://svn.haxx.se/dev/archive-2005-02/0270.shtml):
 If by diff format you mean the binary delta, then... There was a time
 when I thought this would be possible. Now I'm not so sure. The trouble
 is that the vdelta doesn't generate an edit stream, it generates a
 compressed block-copy stream. Which means that can never be 100% sure,
 just from looking at the delta, which bytes in the target are really new
 and which are just (offset) copies from the source. The only blocks you
 can really be sure about are those that are represented by new data in
 the delta (either NEW blocks or target copies that take data from NEW
 blocks). This problem is made worse by our use of skip deltas in (at
 least) the BDB back-end.

 I agree that it would be nice if the server could generate some sort of
 byte-range oriented info, but I don't think it can be done just by
 looking at the deltas. It's sad, I know...

What? So this idea was considered before, but rejected? Is this
argument still valid (e.g. with xdelta i.s.o. vdelta)? Is there no way
around that, does it mean this simply cannot work? Maybe Branko can
comment on that? An example or some more detailed explanation would be
very welcome.

2) A concrete problem with the algorithm as described by Julian would
be the delete of bytes within a line. E.g. suppose one deletes a
single character in r60 (oops, a boolean test was inverted, let's
delete that stray '!'). That deleted character wouldn't be
represented in the binary blame structure (it's simply cut out), and
it's obviously not present in the final text. But the line on which it
occurred was now last edited in r60.

Current blame (diff-based) has no problem with this situation. It
correctly shows r60 as the last revision for that line. But the binary

Re: Looking to improve performance of svn annotate

2010-08-17 Thread Johan Corveleyn

On Thu, Aug 12, 2010 at 5:30 PM, Greg Hudson ghud...@mit.edu wrote:
 On Thu, 2010-08-12 at 10:57 -0400, Julian Foad wrote:
 I'm wary of embedding any client functionality in the server, but I
 guess it's worth considering if it would be that useful.  If so, let's
 take great care to ensure it's only lightly coupled to the core server
 logic.

 Again, it's possible that binary diffs between sequential revisions
 could be used for blame purposes (not the binary deltas we have now, but
 edit-stream-style binary diffs), which would decouple the
 line-processing logic from the server.

 (But again, I haven't thought through the problem in enough detail to be
 certain.)

If such edit-stream-style binary diffs could do the job, and they are
fast enough (I'm guessing that line based vs. binary wouldn't make
that much of a difference for the eventual blame processing), it seems
like a good compromise: we get the performance benefits of
blame-oriented delta's (supposedly fast and easy to calculate blame
info from), possibly cached on the server, while still not introducing
unnecessary coupling of the server to line-processing logic.

Greg, could you explain a bit more what you mean with
edit-stream-style binary diffs, vs. the binary deltas we have now?
Could you perhaps give an example similar to Julian's? Wouldn't you
have the same problem with pieces of the source text being copied
out-of-order (100 bytes from the end/middle of the source being copied
to the beginning of the target, followed by the rest of the source)?
Wouldn't you also have to do the work of discovering the largest
contiguous block of source text as the main stream, so determine
that those first 100 bytes are to be interpreted as new bytes, etc?

Caching this stuff on the server would of course be ideal. Whether it
be post-commit or on-demand (first guy requesting the blame takes
the hit), both approaches seem good to me. Working on that would be
severely out of my league though :-). At least for now.


Another thing that occurred to me: since most time of the current
blame implementation is spent on diff (svn_diff_file_diff_2), maybe
a quick win could be to simply (?) optimize the diff code? Or write a
specialized faster version for blame.

On my tests with a 1,5 Mb file (61000 lines), svn diffing it takes
about 500 ms on my machine. GNU diff is much faster (300 ms for the
first run, 72 ms on following runs). This seems to indicate that there
is much room for optimization of svn diff. Or is there something extra
that svn diff does, necessary in the svn context?

I have looked a little bit at the svn diff code, and saw that most of
the time is spent in the while loop inside svn_diff__get_tokens in
token.c, presumably extracting the tokens (lines) from the file(s).
Haven't looked any further/deeper. Anybody have any brilliant
ideas/suggestions? Or is this a bad idea, not worthy of further
exploration :-) ?

BTW, I also tested with Stefan Fuhrmann's performance bra...@r985697,
just for kicks (had some trouble building it on Windows, but
eventually managed to get an svn.exe out of it). The timing of svn
diff of such a large file was about the same, so that didn't help. But
maybe the branch isn't ready for prime time just yet ...

Cheers,
-- 
Johan

svn diff optimization to make blame faster?

2010-08-17 Thread Johan Corveleyn

Hi devs,

While Looking to improve performance of svn annotate [1], I found
that the current blame algorithm is mainly client-side bound, and that
most of its time is spent on svn diff (calls to svn_diff_file_diff_2
from add_file_blame in blame.c). Apart from avoiding to build
full-texts and diffing them altogether (which is subject of further
discussion in [1]), I'm wondering if optimization of svn diff
wouldn't also be an interesting way to improve the speed of blame.

So the main question is: is it worth it to spend time to analyze this
further and try to improve performance? Or has this already been
optimized in the past, or is it simply already as optimal as it can
get? I have no idea really, so if anyone can shed some light ...

Gut feeling tells me that there must be room for optimization, since
GNU diff seems a lot faster than svn diff for the same large file
(with one line changed) on my machine [1]. But maybe svn's diff
algorithm is purposefully different (better? more accurate? ...) than
GNU's, or there are specific things in the svn context so svn diff has
to do more work.

Any thoughts?

-- 
Johan

[1] http://svn.haxx.se/dev/archive-2010-08/0408.shtml

Re: Performance branch ready for review

2010-08-18 Thread Johan Corveleyn

On Wed, Aug 18, 2010 at 9:14 PM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de wrote:
 Hi @all,

 I just finished my porting work; the performance branch
 is now fully synchronized with my prototype code.
 From my point of view, review can start now.

 According to my measurements, the code is now faster
 than the original prototype. Large caches provided, a
 single multi-threaded svnserve instance on a modern
 quad-core machine should be able to saturate a 10Gb
 server connection.

 Open issues / things still to do

 * there is an issue with log triggering an assertion()
  - I will investigate that next
 * test mod_web_dav and add FSFS cache configuration
  parameters to it
 * tune membuffer cache eviction strategy such that even
  small caches can have a large impact
 * add tests for the new APIs
 * provide APR patches.

 There are many things I would like to do but they may
 better be deferred to 1.8.

I tried compiling your branch on Windows (XP) with Visual C++ Express
2008 (which I also use successfully to build trunk). I had a couple of
issues. FWIW I'm listing them here. I'm not an expert, just
pragmatically trying to get the thing to build, so some of these
things may be user error. Eventually, I was able to build svn.exe,
but svnadmin.exe and svnserve.exe still fail for me.

1) In build.conf, I added private\svn_temp_serializer.h and
private\svn_file_handle_cache.h to the list of msvc-export of
libsvn_subr. Otherwise, the linker gave problems:

libsvn_fs_fs-1.lib(fs_fs.obj) : error LNK2019: unresolved external
symbol _svn_file_handle_cache__has_file referenced in function
_svn_fs_fs__path_rev_absolute
libsvn_fs_fs-1.lib(fs_fs.obj) : error LNK2019: unresolved external
symbol _svn_stream__from_cached_file_handle referenced in function
_get_node_revision_body
libsvn_fs_fs-1.lib(fs_fs.obj) : error LNK2019: unresolved external
symbol _svn_file_handle_cache__open referenced in function
_get_node_revision_body
libsvn_fs_fs-1.lib(fs_fs.obj) : error LNK2019: unresolved external
symbol _svn_file_handle_cache__get_apr_handle referenced in function
_open_pack_or_rev_file
libsvn_fs_fs-1.lib(fs_fs.obj) : error LNK2019: unresolved external
symbol _svn_file_handle_cache__flush referenced in function
_sync_file_handle_cache
libsvn_fs_fs-1.lib(fs_fs.obj) : error LNK2019: unresolved external
symbol _svn_file_handle_cache__close referenced in function
_svn_fs_fs__rev_get_root
libsvn_fs_fs-1.lib(caching.obj) : error LNK2019: unresolved external
symbol _svn_file_handle_cache__create_cache referenced in function
_get_global_file_handle_cache
libsvn_fs_fs-1.lib(id.obj) : error LNK2019: unresolved external symbol
_svn_temp_serializer__pop referenced in function
_svn_fs_fs__id_serialize
libsvn_fs_fs-1.lib(temp_serializer.obj) : error LNK2001: unresolved
external symbol _svn_temp_serializer__pop
libsvn_fs_fs-1.lib(id.obj) : error LNK2019: unresolved external symbol
_svn_temp_serializer__push referenced in function
_svn_fs_fs__id_serialize
libsvn_fs_fs-1.lib(temp_serializer.obj) : error LNK2001: unresolved
external symbol _svn_temp_serializer__push
libsvn_fs_fs-1.lib(id.obj) : error LNK2019: unresolved external symbol
_svn_temp_serializer__add_string referenced in function
_serialize_id_private
libsvn_fs_fs-1.lib(temp_serializer.obj) : error LNK2001: unresolved
external symbol _svn_temp_serializer__add_string
libsvn_fs_fs-1.lib(dag.obj) : error LNK2001: unresolved external
symbol _svn_temp_serializer__add_string
libsvn_fs_fs-1.lib(id.obj) : error LNK2019: unresolved external symbol
_svn_temp_deserializer__resolve referenced in function
_svn_fs_fs__id_deserialize
libsvn_fs_fs-1.lib(temp_serializer.obj) : error LNK2001: unresolved
external symbol _svn_temp_deserializer__resolve
libsvn_fs_fs-1.lib(dag.obj) : error LNK2001: unresolved external
symbol _svn_temp_deserializer__resolve
libsvn_fs_fs-1.lib(temp_serializer.obj) : error LNK2019: unresolved
external symbol _svn_temp_serializer__get referenced in function
_svn_fs_fs__serialize_txdelta_window
libsvn_fs_fs-1.lib(dag.obj) : error LNK2001: unresolved external
symbol _svn_temp_serializer__get
libsvn_fs_fs-1.lib(temp_serializer.obj) : error LNK2019: unresolved
external symbol _svn_temp_serializer__init referenced in function
_svn_fs_fs__serialize_txdelta_window
libsvn_fs_fs-1.lib(dag.obj) : error LNK2001: unresolved external
symbol _svn_temp_serializer__init
libsvn_fs_fs-1.lib(temp_serializer.obj) : error LNK2019: unresolved
external symbol _svn_temp_deserializer__ptr referenced in function
_svn_fs_fs__extract_dir_entry
libsvn_fs_fs-1.lib(dag.obj) : error LNK2019: unresolved external
symbol _svn_temp_serializer__set_null referenced in function
_svn_fs_fs__dag_serialize
..\..\..\Debug\subversion\libsvn_fs\libsvn_fs-1.dll : fatal error
LNK1120: 15 unresolved externals

Here's a patch that solves this for me:
[[[
Index: build.conf
===
--- build.conf  (revision 986928)
+++ build.conf  (working copy)
@@ -322,6 +322,7 @@

Re: svn diff optimization to make blame faster?

2010-08-19 Thread Johan Corveleyn

On Wed, Aug 18, 2010 at 3:44 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Wed, Aug 18, 2010 at 12:49 PM, Stefan Sperling s...@elego.de wrote:
 Can you show a profiler run that illustrates where the client is
 spending most of its time during diff? That would probably help with
 getting opinions from people, because it saves them from spending time
 doing this research themselves.
 You've already hinted at svn_diff__get_tokens() in another mail, but
 a real profiler run would show more candidates.

 Sorry, I'm still very much a beginner in C (I've been programming for
 10 years, but mainly in Java). Especially the tooling is a steep part
 of the learning curve :-), so I don't know (how to use) a profiler
 yet. Any suggestions? I'm on Windows (XP), using VCE 2008, and used
 cygwin to compare with GNU diff.

I googled around a bit for C profilers on Windows. I found several
which seem useful:
- very sleepy (http://www.codersnotes.com/sleepy/sleepy)
- Shiny (http://sourceforge.net/projects/shinyprofiler/)
- AMD CodeAnalyst (http://developer.amd.com/cpu/CodeAnalyst/Pages/default.aspx)
- AQTime - not free, but has a trial of 30 days
(http://www.automatedqa.com/products/aqtime/)

Before I dive in and try them out: any preference/favorites from the
windows devs on this list? Or other suggestions?

Further, some additional context to the manual-profiling numbers: see below.

 For the time being, I've helped myself using apr_time_now() from
 apr_time.h and printf statements. Not terribly accurate and probably
 somewhat overheadish, but it gives me some hints about the
 bottlenecks.

 FWIW, below is the output of a recent run with my instrumented trunk
 build. I've instrumented svn_diff_diff in diff.c and
 svn_diff__get_tokens in token.c. I checked out bigfile.xml from a
 repository, and edited a single line of it in my working copy. The two
 statements Starting diff and Diff finished are the first and last
 statements inside the svn_diff_diff function. These are numbers from
 the second run (any following runs show approximately the same
 numbers).

 $ time svn diff bigfile.xml
 Starting diff ... (entered svn_diff_diff in diff.c)
  - calling svn_diff__get_tokens for svn_diff_datasource_original
      == svn_diff__get_tokens datasource_open: 0 usec
      == svn_diff__get_tokens while loop: 265625 usec
         calls to datasource_get_next_token: 62500 usec
                svn_diff__tree_insert_token: 171875 usec
                                       rest: 15625 usec
  - done: 281250 usec
  - calling svn_diff__get_tokens for svn_diff_datasource_modified
      == svn_diff__get_tokens datasource_open: 234375 usec
      == svn_diff__get_tokens while loop: 312500 usec
         calls to datasource_get_next_token: 62500 usec
                svn_diff__tree_insert_token: 171875 usec
                                       rest: 46875 usec
  - done: 562500 usec
  - calling svn_diff__lcs
  - done: 0 usec
  - calling svn_diff__diff
  - done: 0 usec
 Diff finished in 875000 usec (875 millis)
 Index: bigfile.xml
 ===
 [snip]

 real    0m1.266s
 user    0m0.015s
 sys     0m0.031s

For comparison: a debug build from trunk, with only one
instrumentation spot (at start and end of svn_diff_diff):
$ time svn diff bigfile.xml
Starting diff ... (entered svn_diff_diff in diff.c)
Diff finished in 703125 usec (703 millis)
[snip]
real0m1.109s
user0m0.015s
sys 0m0.031s

So the instrumentation in the critical loop probably costs me around
150-200 ms.

A release build will also probably be faster, but I have no time to
make one and test it now. If I time my 1.6.5 build from tigris.org,
that's still a lot faster:
$ time svn diff bigfile.xml
[snip]
real0m0.468s
user0m0.015s
sys 0m0.031s

Maybe that can be totally attributed to the build (release vs. debug),
or maybe 1.6.5 was faster at svn diff than trunk is?

 Some observations:
 - svn_diff__get_tokens takes most of the time
 - for some reason, in the case of datasource_modified, the call to
 datasource_open takes a long time (234 ms). In case of
 datasource_original, it's instantaneous. Maybe because of translation,
 ... of the pristine file? But I'd think that would be the original??
 - apart from that, most of the time goes to the while loop in
 svn_diff__get_tokens.
 - Inside the while loop, most time is spent on calls to
 svn_diff__tree_insert_token (which compares tokens (=lines) to insert
 them into some tree structure). Calls to datasource_get_next_token
 also take some time.

 I'm not too sure of the accuracy of those last numbers with my simple
 profiling method, because it's the accumulated time of calls inside a
 while loop (with 61000 iterations).


 For completeness, the same diff with /usr/bin/diff (under cygwin), of
 the edited bigfile.xml vs. the original, as of the second diff run:

 $ ls -l settings.xml
 -rwxr-xr-x+ 1 User None 1447693 2010-08-17 14:20 bigfile.xml
 $ time diff

Re: svn diff optimization to make blame faster?

2010-08-20 Thread Johan Corveleyn

On Thu, Aug 19, 2010 at 11:38 PM, Johan Corveleyn jcor...@gmail.com wrote:
 Feh, I just redid my apr_time_now+printf profiling with a release
 build (of trunk), instead of a debug build, and that makes a *big*
 difference. Total time of the svn_diff_diff call is now down to ~300
 ms. Still a lot slower than GNU diff (78 ms), but a lot better than
 with the debug build. That will teach me not to do performance tests
 with a debug build.

 Still, the challenge holds: why is svn diff ~4 times slower than GNU diff?

 Also, still puzzling to me: why does datasource_open take so long in
 the modified case? Now it takes twice as long as the while loop ...

@#...@#!!!
/me bangs head against wall

The slowness of datasource_open was because of my antivirus (AVG free
edition). After disabling it, it's completely gone.

Running a release build, with antivirus disabled, and minimal
instrumentation brings svn diff performance (after a couple of runs)
more or less on par with GNU diff:

$ time svn diff bigfile.xml
Starting diff ... (entered svn_diff_diff in diff.c)
Diff finished in 78125 usec (78 millis)
[snip]
real0m0.359s
user0m0.015s
sys 0m0.031s

I guess this closes this thread ... on to more useful stuff. There is
no point in trying to optimize svn diff, so I guess blame can only be
sped up by more fundamental changes (i.e. avoiding diffs, which still
amount to ~90% of blame time on the client side).

Sorry for any wasted time.

(on the bright side, I learned quite a lot about accurately measuring
performance)

Cheers,
-- 
Johan

Re: Looking to improve performance of svn annotate

2010-08-20 Thread Johan Corveleyn

On Tue, Aug 17, 2010 at 3:26 PM, Johan Corveleyn jcor...@gmail.com wrote:
 Another thing that occurred to me: since most time of the current
 blame implementation is spent on diff (svn_diff_file_diff_2), maybe
 a quick win could be to simply (?) optimize the diff code? Or write a
 specialized faster version for blame.

 On my tests with a 1,5 Mb file (61000 lines), svn diffing it takes
 about 500 ms on my machine. GNU diff is much faster (300 ms for the
 first run, 72 ms on following runs). This seems to indicate that there
 is much room for optimization of svn diff. Or is there something extra
 that svn diff does, necessary in the svn context?

 I have looked a little bit at the svn diff code, and saw that most of
 the time is spent in the while loop inside svn_diff__get_tokens in
 token.c, presumably extracting the tokens (lines) from the file(s).
 Haven't looked any further/deeper. Anybody have any brilliant
 ideas/suggestions? Or is this a bad idea, not worthy of further
 exploration :-) ?

As noted in http://svn.haxx.se/dev/archive-2010-08/0517.shtml, my
performance measurements of svn diff were quite inaccurate, to say the
least.

After eliminating antivirus, and running with a release build instead
of a debug build, svn diff is just about on par with GNU diff. So this
eliminates the option of optimizing diff ...

Cheers,
-- 
Johan

Re: Looking to improve performance of svn annotate

2010-08-20 Thread Johan Corveleyn

On Fri, Aug 20, 2010 at 1:40 PM, Johan Corveleyn jcor...@gmail.com wrote:
 After eliminating antivirus, and running with a release build instead
 of a debug build, svn diff is just about on par with GNU diff. So this
 eliminates the option of optimizing diff ...

Unless ...

For every diff during blame calculation, tokens (lines) are extracted
and processed each time for the original and the modified. This
takes up most of the time of svn diff. However, the file that is
playing the modified role in one step, will be the original in the
next step of blame processing. So in theory we already have those
tokens from the previous step, and don't have to extract/compare/...
them again.

If one could find a way to recycle the tokens from the modified of
the previous diff into the tokens of the original of the next diff,
that would probably make the diff part of the blame operation almost
twice as fast. And since diffing still accounts for ~90% of blame time
on the client, that should speed it up considerably.

Sounds like a plan?

I'll try to write some sort of POC for this idea soon, unless someone
tells me it's a stupid idea :-).

Cheers,
-- 
Johan

Re: Performance branch ready for review

2010-08-23 Thread Johan Corveleyn

On Sun, Aug 22, 2010 at 2:56 PM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de wrote:
 Johan Corveleyn wrote:

 On Wed, Aug 18, 2010 at 9:14 PM, Stefan Fuhrmann
 stefanfuhrm...@alice-dsl.de wrote:


 Hi @all,

 I just finished my porting work; the performance branch
 is now fully synchronized with my prototype code.
 From my point of view, review can start now.

 According to my measurements, the code is now faster
 than the original prototype. Large caches provided, a
 single multi-threaded svnserve instance on a modern
 quad-core machine should be able to saturate a 10Gb
 server connection.

 Open issues / things still to do

 * there is an issue with log triggering an assertion()
  - I will investigate that next
 * test mod_web_dav and add FSFS cache configuration
  parameters to it
 * tune membuffer cache eviction strategy such that even
  small caches can have a large impact
 * add tests for the new APIs
 * provide APR patches.

 There are many things I would like to do but they may
 better be deferred to 1.8.


 I tried compiling your branch on Windows (XP) with Visual C++ Express
 2008 (which I also use successfully to build trunk). I had a couple of
 issues. FWIW I'm listing them here. I'm not an expert, just
 pragmatically trying to get the thing to build, so some of these
 things may be user error. Eventually, I was able to build svn.exe,
 but svnadmin.exe and svnserve.exe still fail for me.

 1) In build.conf, I added private\svn_temp_serializer.h and
 private\svn_file_handle_cache.h to the list of msvc-export of
 libsvn_subr. Otherwise, the linker gave problems:

 libsvn_fs_fs-1.lib(fs_fs.obj) : error LNK2019: unresolved external
 symbol _svn_file_handle_cache__has_file referenced in function
 _svn_fs_fs__path_rev_absolute
 libsvn_fs_fs-1.lib(fs_fs.obj) : error LNK2019: unresolved external
 symbol _svn_stream__from_cached_file_handle referenced in function
 _get_node_revision_body
 libsvn_fs_fs-1.lib(fs_fs.obj) : error LNK2019: unresolved external
 symbol _svn_file_handle_cache__open referenced in function
 _get_node_revision_body
 libsvn_fs_fs-1.lib(fs_fs.obj) : error LNK2019: unresolved external
 symbol _svn_file_handle_cache__get_apr_handle referenced in function
 _open_pack_or_rev_file
 libsvn_fs_fs-1.lib(fs_fs.obj) : error LNK2019: unresolved external
 symbol _svn_file_handle_cache__flush referenced in function
 _sync_file_handle_cache
 libsvn_fs_fs-1.lib(fs_fs.obj) : error LNK2019: unresolved external
 symbol _svn_file_handle_cache__close referenced in function
 _svn_fs_fs__rev_get_root
 libsvn_fs_fs-1.lib(caching.obj) : error LNK2019: unresolved external
 symbol _svn_file_handle_cache__create_cache referenced in function
 _get_global_file_handle_cache
 libsvn_fs_fs-1.lib(id.obj) : error LNK2019: unresolved external symbol
 _svn_temp_serializer__pop referenced in function
 _svn_fs_fs__id_serialize
 libsvn_fs_fs-1.lib(temp_serializer.obj) : error LNK2001: unresolved
 external symbol _svn_temp_serializer__pop
 libsvn_fs_fs-1.lib(id.obj) : error LNK2019: unresolved external symbol
 _svn_temp_serializer__push referenced in function
 _svn_fs_fs__id_serialize
 libsvn_fs_fs-1.lib(temp_serializer.obj) : error LNK2001: unresolved
 external symbol _svn_temp_serializer__push
 libsvn_fs_fs-1.lib(id.obj) : error LNK2019: unresolved external symbol
 _svn_temp_serializer__add_string referenced in function
 _serialize_id_private
 libsvn_fs_fs-1.lib(temp_serializer.obj) : error LNK2001: unresolved
 external symbol _svn_temp_serializer__add_string
 libsvn_fs_fs-1.lib(dag.obj) : error LNK2001: unresolved external
 symbol _svn_temp_serializer__add_string
 libsvn_fs_fs-1.lib(id.obj) : error LNK2019: unresolved external symbol
 _svn_temp_deserializer__resolve referenced in function
 _svn_fs_fs__id_deserialize
 libsvn_fs_fs-1.lib(temp_serializer.obj) : error LNK2001: unresolved
 external symbol _svn_temp_deserializer__resolve
 libsvn_fs_fs-1.lib(dag.obj) : error LNK2001: unresolved external
 symbol _svn_temp_deserializer__resolve
 libsvn_fs_fs-1.lib(temp_serializer.obj) : error LNK2019: unresolved
 external symbol _svn_temp_serializer__get referenced in function
 _svn_fs_fs__serialize_txdelta_window
 libsvn_fs_fs-1.lib(dag.obj) : error LNK2001: unresolved external
 symbol _svn_temp_serializer__get
 libsvn_fs_fs-1.lib(temp_serializer.obj) : error LNK2019: unresolved
 external symbol _svn_temp_serializer__init referenced in function
 _svn_fs_fs__serialize_txdelta_window
 libsvn_fs_fs-1.lib(dag.obj) : error LNK2001: unresolved external
 symbol _svn_temp_serializer__init
 libsvn_fs_fs-1.lib(temp_serializer.obj) : error LNK2019: unresolved
 external symbol _svn_temp_deserializer__ptr referenced in function
 _svn_fs_fs__extract_dir_entry
 libsvn_fs_fs-1.lib(dag.obj) : error LNK2019: unresolved external
 symbol _svn_temp_serializer__set_null referenced in function
 _svn_fs_fs__dag_serialize
 ..\..\..\Debug\subversion\libsvn_fs\libsvn_fs-1.dll : fatal error
 LNK1120: 15 unresolved externals

 Here's a patch that solves

Performance branch - svnserve crash in fs_history_prev

2010-08-23 Thread Johan Corveleyn

Hi,

I've taken the performance branch for a spin. Some of performance
increases are awesome (svn log is ~4 times faster on my machine
(tested with a file with 300 revisions)).

However, I also experienced a crash of svnserve, for both svn log
and svn blame of a big file with 2000 revisions (so this is quite
an extreme case). See both .log files in attachment. Something goes
wrong in svn_fs_fs__id_deserialize, during processing of
fs_history_prev.

I also have crash dump files of the crashes. If you want those, let me
know and I'll send them via private email.

This was on Windows XP, a release build with VCE 2008.

Cheers,
-- 
Johan

Re: Performance branch - svnserve crash in fs_history_prev

2010-08-23 Thread Johan Corveleyn

Trying again with .txt extension added.

Johan

On Mon, Aug 23, 2010 at 11:40 AM, Julian Foad julian.f...@wandisco.com wrote:
 On Mon, 2010-08-23, Lieven Govaerts wrote:
 Either you forgot the attachments, or they were dropped by our mailing
 sw. Try adding a .txt extension.

 AFAIK, the mailing list strips attachments of MIME type
 'application/octet-stream' and a bunch of other types - see
 https://issues.apache.org/jira/browse/INFRA-1194.  Regrettably, it
 doesn't inform either the sender or the recipients that it's done so.

 - Julian




Process info:
Cmd line: 
C:\research\svn\client_build\svn_branch_performance\dist\bin\svnserve.exe -d -r 
c:/research/svn/experiment/repos
Working Dir: C:\research\svn\experiment\repos
Version:  1.7.0 (dev build), compiled Aug 22 2010, 21:57:27
Platform: Windows OS version 5.1 build 2600 Service Pack 3

Exception: ACCESS_VIOLATION

Registers:
eax=75262f72 ebx=00c2ecc8 ecx=00c2ee6c edx=00c2ee68 esi=00c2ee6c edi=019f
eip=003a4dbf esp=00d7fb5c ebp=0001 efl=00010206
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=

Stacktrace:
#1  0x003a4dbf in svn_fs_fs__id_deserialize()
buffer = 0x003a7031
id = 
#2  0x003a7031 in svn_fs_fs__dag_deserialize()
out = 
data = 
data_len = c2ecc8
pool = 
#3  0x00426d83 in membuffer_cache_get()
#4  0x00426f70 in svn_membuffer_cache_get()
value_p = 
found = 0x00d7fc30
cache_void = 0x00d7fc34
key = 
pool = 
#5  0x0042737a in svn_cache__get()
value_p = 
found = 0x00d7fc30
cache = 
key = 
pool = 
#6  0x0039fde5 in dag_node_cache_get()
node_p = 
#7  0x003a20c6 in open_path()
parent_path_p = 
root = 
path = 
flags = 12991688
txn_id = 
pool = 
fs = 
parent_path = 
path_so_far = 
here = 
entry = 0x00bb2148 �...@¹
next = 
child = 
inherit = 
copy_path = 
cached_node = 
#8  0x003a3d05 in history_prev()
#9  0x003a3ff5 in fs_history_prev()
prev_history_p = 
history = 
cross_copies = 12991760
pool = 
prev_history = 
args = 
#10  0x003d54fd in get_history()
#11  0x003d6708 in do_logs()
fs = 
paths = 
hist_start = c1e110
hist_end = 0
limit = 96305
discover_changed_paths = 0
strict_node_history = 0
include_merged_revisions = 0
revprops = 
descending_order = 12706024
receiver = 
receiver_baton = 0x00405140
authz_read_func = 
authz_read_baton = 0x00402480
pool = 
iterpool = 
any_histories_left = 0
histories = 
rev_mergeinfo = 
revs = 
i = 12934224
current = 17831
send_count = 12706696
mergeinfo = 
has_children = 0
mergeinfo = 
#12  0x003d6cc9 in svn_repos_get_logs4()
repos = 
paths = 
start = c1e110
end = 0
limit = 96305
discover_changed_paths = 0
strict_node_history = 0
include_merged_revisions = 0
revprops = 
authz_read_func = 
authz_read_baton = 0x00402480
receiver = 
receiver_baton = 0x00405140
pool = 
head = b9a0b0
fs = 
descending_order = 96305
send_count = C1E27000BB2148
#13  0x0040559c in log_cmd()
conn = 
pool = 
params = 
baton = 0x00c1dc80
lb = 
full_paths = 
include_merged_revisions = 12704984
limit = C1DEF8
paths = 
revprop_items = 
changed_paths = 0
include_merged_revs_param = 0
start_rev = 0
end_rev = 0
strict_node = 12706064
#14  0x0040962b in svn_ra_svn_handle_commands2()
conn = 
pool = 
commands = 
baton = 0x0040e320
error_on_disconnect = 14155532
subpool = 
cmdname = 
params = 
cmd_hash = 
#15  0x0040781f in serve()
conn = 
params = 
pool = 
warn_baton = 
uuid = 
client_string = 
cap_log = 
client_url = 
ra_client_string = 
b = 
ver = 200C15508
cap_words = 
caplist = 
supports_mergeinfo = 12173800
#16  0x00401325 in serve_thread()
tid = 
data = 0x00b9c1c0
#17  0x6eed47b0 in dummy_worker()
opaque = 0x78543433
#18  0x78543433 in endthreadex()
#19  0x785434c7 in endthreadex()
#20  0x7c80b729 in GetModuleFileNameA()


Loaded modules:
0x0040  
C:\research\svn\client_build\svn_branch_performance\dist\bin\svnserve.exe 
(1.7.0.0, 94208 bytes)
0x7c90  C:\Windows\system32\ntdll.dll (5.1.2600.5755, 729088 bytes)
0x7c80  C:\Windows\system32\kernel32.dll (5.1.2600.5781, 1007616 bytes)
0x77dd  C:\Windows\system32\advapi32.dll (5.1.2600.5755, 634880 bytes)

Re: Performance branch ready for review

2010-08-23 Thread Johan Corveleyn

On Tue, Aug 24, 2010 at 12:22 AM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de wrote:
 Daniel Shahaf wrote:

 Stefan, you did mention Patch by for Johan's patches which you
 committed, did you intend to mention Found by or Suggested by for
 the other two (quoted below)?


 http://subversion.apache.org/docs/community-guide/conventions.html#crediting

 Thanks,

 Daniel


 Oh, I just was not aware that there are tons of ... by schemes.
 r987868 and r987869 now rightfully mention Johan.

Thanks. Not terribly important to me, but nice anyway.

I just hope some of your performance-work makes it into 1.7. So
anything I can do to help ...

Cheers,
-- 
Johan

Re: svn diff optimization to make blame faster?

2010-08-24 Thread Johan Corveleyn

On Sun, Aug 22, 2010 at 4:02 PM, Branko Čibej br...@xbc.nu wrote:
 On 18.08.2010 00:59, Johan Corveleyn wrote:
 Hi devs,

 While Looking to improve performance of svn annotate [1], I found
 that the current blame algorithm is mainly client-side bound, and that
 most of its time is spent on svn diff (calls to svn_diff_file_diff_2
 from add_file_blame in blame.c). Apart from avoiding to build
 full-texts and diffing them altogether (which is subject of further
 discussion in [1]), I'm wondering if optimization of svn diff
 wouldn't also be an interesting way to improve the speed of blame.

 So the main question is: is it worth it to spend time to analyze this
 further and try to improve performance? Or has this already been
 optimized in the past, or is it simply already as optimal as it can
 get? I have no idea really, so if anyone can shed some light ...

 Gut feeling tells me that there must be room for optimization, since
 GNU diff seems a lot faster than svn diff for the same large file
 (with one line changed) on my machine [1]. But maybe svn's diff
 algorithm is purposefully different (better? more accurate? ...) than
 GNU's, or there are specific things in the svn context so svn diff has
 to do more work.

 Any thoughts?


 svn_diff uses basically the same algorithm as GNU diff but implemented
 slightly differently and IIRC it doesn't have some of GNU diff's
 optimizations. I'm sure it can be speeded up, but haven't a clue about
 how much.

Ok, thanks. In the meantime I saw that there is not that much
difference anymore between GNU diff and svn_diff, after running the
latter from a release build, and disabling my anti-virus (which makes
me wonder why my anti-virus slows down svn_diff (impact when opening
the modified datasource), but not on GNU diff).

There may still be some slight speed difference (enough to be
significant for a blame operation doing 100's or 1000's of diffs), but
not that much as I thought at first. So I don't think I'm going to
spend more time on trying to speed up svn_diff (also, I'm not really
an expert at optimizing C code, so ... I'll leave that to others :-)).

Cheers,
-- 
Johan

Re: Looking to improve performance of svn annotate

2010-08-24 Thread Johan Corveleyn

On Fri, Aug 20, 2010 at 9:11 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Fri, Aug 20, 2010 at 1:40 PM, Johan Corveleyn jcor...@gmail.com wrote:
 After eliminating antivirus, and running with a release build instead
 of a debug build, svn diff is just about on par with GNU diff. So this
 eliminates the option of optimizing diff ...

 Unless ...

 For every diff during blame calculation, tokens (lines) are extracted
 and processed each time for the original and the modified. This
 takes up most of the time of svn diff. However, the file that is
 playing the modified role in one step, will be the original in the
 next step of blame processing. So in theory we already have those
 tokens from the previous step, and don't have to extract/compare/...
 them again.

 If one could find a way to recycle the tokens from the modified of
 the previous diff into the tokens of the original of the next diff,
 that would probably make the diff part of the blame operation almost
 twice as fast. And since diffing still accounts for ~90% of blame time
 on the client, that should speed it up considerably.

 Sounds like a plan?

 I'll try to write some sort of POC for this idea soon, unless someone
 tells me it's a stupid idea :-).

Hm, after several attempts to get this working, it seems to be a dead
end. Whatever I do, I can't seem to avoid lifetime issues, dangling
pointers, tokens referring to files which are no longer there, ...

I first tried just to pass through the position_list[1] (position_list
from the modified file) to the next diff run as the new
position_list[0]. I did this by returning the position_list[1] in an
output parameter, up to the call in blame.c!add_file_blame, and then
reusing that as input parameter for the next diff call (by stashing it
away in the file_rev_baton, until the next run). That works, but it
doesn't help, because the real power of the algorithm is in the tree,
in which all the tokens from both original and modified are sorted and
made unique (i.e. each unique token is represented only once in the
tree). After the tree is correctly set up, the rest of the algo is
very fast, because it only needs to compare token references to find
the longest common subsequence.

At the end of each run the tree is filled with a combination of the
tokens of original and modified, so I can't just reuse/recycle that,
because that would lead to a major memory leak (building up the tree
with all the tokens from all revisions of the entire blame operation).
And refilling a clean tree with the tokens already present in the
passed-on-position_list also doesn't work, because the token content
really needs to still be accessible (either in memory, or readable
from the original file), so that token_compare can work when adding
the new tokens from the new modified file ...

Anyway, I don't know if this still makes sense. I'm just noting it
down to order my thoughts, and maybe help someone else thinking about
this. I'm gonna let it rest now, because I seem to have hit a brick
wall with this.

Will focus my efforts on other approaches to speed up blame, such as
the fundamental changes of binary blaming, avoiding diffs, ... but I'm
not sure if I'll be able to (have the time to) climb the learning
curve for that.

Cheers,
-- 
Johan

Re: Performance branch - svnserve crash in fs_history_prev

2010-08-25 Thread Johan Corveleyn

On Wed, Aug 25, 2010 at 11:34 AM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de wrote:
 Johan Corveleyn wrote:
 However, I also experienced a crash of svnserve, for both svn log
 and svn blame of a big file with 2000 revisions (so this is quite
 an extreme case). See both .log files in attachment. Something goes
 wrong in svn_fs_fs__id_deserialize, during processing of
 fs_history_prev.


 Hm, I couldn't reproduce that problem so far even with large
 repositories. Could you rerun your tests in debug mode?
 The stack trace should be more reliable then (e.g. data_len
 on level #3 is either bogus or the reason for the crash).

Ok, here they are in attachment (one for blame and one for log).
Reproduced with a debug build (of branches/performa...@now), on the
same machine.

Crashed with both blame and log of large file with large number of
revisions (but I don't know if that's actually relevant, maybe it's
not related). With blame it crashed much sooner as my release build.

Cheers,
-- 
Johan

Process info:
Cmd line: 
C:\research\svn\client_build\svn_branch_performance\dist\bin\svnserve.exe -d -r 
c:/research/svn/experiment/repos
Working Dir: C:\research\svn
Version:  1.7.0 (dev build), compiled Aug 19 2010, 00:50:40
Platform: Windows OS version 5.1 build 2600 Service Pack 3

Exception: ACCESS_VIOLATION

Registers:
eax=dc307ce7 ebx=00d31370 ecx=00db54eb edx=00db54eb esi=00f0f8d0 edi=00f0f930
eip=003bb02c esp=00f0f8b4 ebp=00f0f8b4 efl=00010216
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=

Stacktrace:
#1  0x003bb02c in svn_fs_fs__id_deserialize()
buffer = 0x00db54e7
id = 
#2  0x003bb5ca in svn_fs_fs__noderev_deserialize()
buffer = 0x00db54e8
noderev_p = 
noderev = 
#3  0x003bfaa9 in svn_fs_fs__dag_deserialize()
out = 
data = 
data_len = 1e0
pool = 
node = 
#4  0x004bcdfe in membuffer_cache_get()
cache = 
key = 
key_len = 31
item = 
deserializer = 
pool = 
buffer = 0x00db5308 
group_index = faa
to_find = 
entry = 
#5  0x004bcb64 in svn_membuffer_cache_get()
svn_err__temp = 
value_p = 
found = 0x00f0f9e8
cache_void = 0x00d6bc40
key = 
pool = 
full_key = 0x00db52b8
cache = 
full_key_len = 31
#6  0x004bdc63 in svn_cache__get()
value_p = 
found = 0x00f0f9e8
cache = 
key = 
pool = 
#7  0x003b1188 in dag_node_cache_get()
svn_err__temp = 
node_p = 
root = 
path = 
pool = 
node = 
cache = 
key = 
found = 0
#8  0x003b1436 in open_path()
svn_err__temp = 
inherit = 
err = 
copy_path = 
cached_node = 
entry = 0x00db5258 s9server
next = 
child = 
parent_path_p = 
root = 
path = 
flags = 0
txn_id = 
pool = 
canon_path = 
fs = 
parent_path = 
rest = 
path_so_far = 
here = 
#9  0x003b85b3 in history_prev()
svn_err__temp = 
baton = 0x00f0fbb4
pool = 
node = 
path = 
commit_rev = 
retpool = 
retry = 0
copyroot_path = 
fhd = 
history = 
revision = a1d1
fs = 
src_path = 
parent_path = 
copyroot_rev = 
commit_path = 
src_rev = 
dst_rev = 
root = 
reported = 1
prev_history = 
args = 
#10  0x003b83d0 in fs_history_prev()
svn_err__temp = 
args = 
prev_history_p = 
history = 
cross_copies = 1
pool = 
fhd = 
fs = 
prev_history = 
#11  0x00395c26 in svn_fs_history_prev()
prev_history_p = 
history = 
cross_copies = 1
pool = 
#12  0x00631c86 in find_interesting_revisions()
svn_err__temp = 
tmp_pool = 
path_rev = 
path_revisions = 
repos = 
path = 
start = 0
end = 17831
include_merged_revisions = 0
mark_as_merged = 0
duplicate_path_revs = 
authz_read_func = 
authz_read_baton = 0x00f0fe9c
pool = 
history = 
iter_pool = 
last_pool = 
root = 
kind = 
#13  0x00631706 in svn_repos_get_file_revs2()
svn_err__temp = 
repos = 
path = 
start = 0
end = 17831
include_merged_revisions = 0
authz_read_func = 
authz_read_baton = 0x00f0fe9c
handler = 
handler_baton = 0x00f0fdb0
pool = 
duplicate_path_revs = 
mainline_pos = -858993460
merged_path_revisions = 
merged_pos = -858993460
sb = 
mainline_path_revisions = 
#14  0x0040d45a in get_file_revs()
conn

Re: svn commit: r990537 - /subversion/branches/performance/subversion/libsvn_subr/svn_temp_serializer.c

2010-08-29 Thread Johan Corveleyn

Hi Stefan,

On Sun, Aug 29, 2010 at 12:32 PM,  stef...@apache.org wrote:
 Author: stefan2
 Date: Sun Aug 29 10:32:08 2010
 New Revision: 990537

 URL: http://svn.apache.org/viewvc?rev=990537view=rev
 Log:
 Looking for the cause of Johan Corveleyn's crash (see
 http://svn.haxx.se/dev/archive-2010-08/0652.shtml), it
 seems that wrong / corrupted data contains backward
 pointers, i.e. negative offsets. That cannot happen if
 everything works as intended.

I've just retried my test after this change (actually with
performance-bra...@990579, so updated just 10 minutes ago). Now I get
the assertion error, after running log or blame on that particular
file:

[[[
$ svnserve -d -r c:/research/svn/experiment/repos
Assertion failed: *ptr  buffer, file
..\..\..\subversion\libsvn_subr\svn_temp_serializer.c, line 282

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
]]]

Is there any way I can find more information about this failure, so I
can help you diagnose the problem?

Just to be clear: the very same repos does not have this problem when
accessed by a trunk svnserve.

-- 
Johan

Re: svn commit: r990537 - /subversion/branches/performance/subversion/libsvn_subr/svn_temp_ser ializer.c

2010-08-29 Thread Johan Corveleyn

On Sun, Aug 29, 2010 at 4:36 PM, Daniel Shahaf d...@daniel.shahaf.namewrote:

 Johan Corveleyn wrote on Sun, Aug 29, 2010 at 16:24:24 +0200:
  Hi Stefan,
 
  On Sun, Aug 29, 2010 at 12:32 PM,  stef...@apache.org wrote:
   Author: stefan2
   Date: Sun Aug 29 10:32:08 2010
   New Revision: 990537
  
   URL: http://svn.apache.org/viewvc?rev=990537view=rev
   Log:
   Looking for the cause of Johan Corveleyn's crash (see
   http://svn.haxx.se/dev/archive-2010-08/0652.shtml), it
   seems that wrong / corrupted data contains backward
   pointers, i.e. negative offsets. That cannot happen if
   everything works as intended.
 
  I've just retried my test after this change (actually with
  performance-bra...@990579, so updated just 10 minutes ago). Now I get
  the assertion error, after running log or blame on that particular
  file:
 
  [[[
  $ svnserve -d -r c:/research/svn/experiment/repos
  Assertion failed: *ptr  buffer, file
  ..\..\..\subversion\libsvn_subr\svn_temp_serializer.c, line 282
 
  This application has requested the Runtime to terminate it in an unusual
 way.
  Please contact the application's support team for more information.
  ]]]
 
  Is there any way I can find more information about this failure, so I
  can help you diagnose the problem?
 

 s/assert/SVN_ERR_MALFUNCTION_NO_RETURN/



I think you meant SVN_ERR_ASSERT_NO_RETURN.

If I do that, I get:

[[[
$ svnserve -d -r c:/research/svn/experiment/repos
..\..\..\subversion\libsvn_fs_fs\fs_fs.c:2363: (apr_err=235000)
svn: In file '..\..\..\subversion\libsvn_subr\svn_temp_serializer.c' line
282: assertion failed (*ptr  buffer)

This application has requested the Runtime to terminate it in an unusual
way.
Please contact the application's support team for more information.

This application has requested the Runtime to terminate it in an unusual
way.
Please contact the application's support team for more information.
]]]


-- 
Johan

Re: svn commit: r990537 - /subversion/branches/performance/subversion/libsvn_subr/svn_temp_serializer.c

2010-08-30 Thread Johan Corveleyn

On Mon, Aug 30, 2010 at 12:05 PM, Stefan Fuhrmann 
stefanfuhrm...@alice-dsl.de wrote:

 Johan Corveleyn wrote:

 On Sun, Aug 29, 2010 at 12:32 PM, stefan2_at_apache.org wrote:
 / Author: stefan2 /
 / Date: Sun Aug 29 10:32:08 2010 /
 / New Revision: 990537 /
 / /
 / URL: http://svn.apache.org/viewvc?rev=990537view=rev 
 http://svn.apache.org/viewvc?rev=990537view=rev /

 / Log: /
 / Looking for the cause of Johan Corveleyn's crash (see /
 / http://svn.haxx.se/dev/archive-2010-08/0652.shtml), it /
 / seems that wrong / corrupted data contains backward /
 / pointers, i.e. negative offsets. That cannot happen if /
 / everything works as intended. /

 I've just retried my test after this change (actually with
 performance-branch_at_990579, so updated just 10 minutes ago). Now I get

 the assertion error, after running log or blame on that particular
 file:

 [[[
 $ svnserve -d -r c:/research/svn/experiment/repos
 Assertion failed: *ptr  buffer, file
 ..\..\..\subversion\libsvn_subr\svn_temp_serializer.c, line 282

 This application has requested the Runtime to terminate it in an unusual
 way.
 Please contact the application's support team for more information.
 ]]]

  That is what I expected looking at the call stacks you posted.
 My preliminary analysis goes as follows:

 * The error seems to be limited to relatively rare occasions.
  That sufficiently excludes alignment issues and plainly wrong
  parameters / function calls.

 * It might be a (still rare) 32-bit-only issue.

 * There seems to be no miscast of types, i.e. the DAG node
  being read and causing the PF is actually a DAG node. Even
  if conflicting keys were used, the structure could still be read
  from the cache and would lead to some logic failure elsewhere.

 What else could it be? Most of the following are rather

 * concurrency issue
 * data corruption within the cache itself
 * some strange serialization issue that needs very specific data
  and / or 32 bit pointers to show up


  Is there any way I can find more information about this failure, so I
 can help you diagnose the problem?

  In fact there is. Just some questions:

 * You are the only one accessing the server and you use
  a single client process?


Yes. All on the same machine actually (my laptop). Accessing the server with
svn://localhost.


 * Does it happen if you log / blame the file for the first time
  and no other requests have been made to the server before?


Yes


 * Does a command line svn log produce some output
  before the crash? If so, is there something unusual happening
  around these revisions (branch replacement or so)?


Yes. Running svn log svn://localhost/trunk/some/path/bigfile.xml yields
969
of the 2279 log entries. From r95849 (last change to this file) down to
r42100. Then it suddenly stops.

I've checked r42100 with log -v, and it only mentions text modification of
bigfile.xml. Same goes for the previous and next revisions in which
bigfile.xml was affected (r42104 and r42042).




 Also, please verify that the crash gets triggered if the server is started
 with the following extra parameters:

 * -c0 -M0 -F0


No crash


 * -c0 -M0


No crash


 * -c0 -M1500 -F0


Crash (actually I did it with -M1000, because M1500 would give me an Out of
memory immediately).


 * -c0 -M1500


Crash (with -M1000 that is)





 Just to be clear: the very same repos does not have this problem when
 accessed by a trunk svnserve.

 I thought so ;) To narrow down the nature of the problem,
 I added some checks that  should be able to discern plain
 data corruption from (de-)serialization issues. Please apply
 either the patch or replace the original files with the versions
 in the .zip file.

 A debug build should then, hopefully, trigger a different
 and more specific assertion.


Ok, will try that now.

-- 
Johan

Re: svn commit: r990537 - /subversion/branches/performance/subversion/libsvn_subr/svn_temp_serializer.c

2010-08-30 Thread Johan Corveleyn

On Mon, Aug 30, 2010 at 9:32 PM, Johan Corveleyn jcor...@gmail.com wrote:

 On Mon, Aug 30, 2010 at 12:05 PM, Stefan Fuhrmann 
 stefanfuhrm...@alice-dsl.de wrote:

 I thought so ;) To narrow down the nature of the problem,
 I added some checks that  should be able to discern plain
 data corruption from (de-)serialization issues. Please apply
 either the patch or replace the original files with the versions
 in the .zip file.

 A debug build should then, hopefully, trigger a different
 and more specific assertion.


 Ok, will try that now.


:-(, I updated to r990759 and applied your attached debug.patch, but I just
get the same assertion (offset by 4 lines), with no additional information:

[[[
$ svnserve -d -r c:/research/svn/experiment/repos
Assertion failed: *ptr  buffer, file
..\..\..\subversion\libsvn_subr\svn_temp_serializer.c, line 286
]]]

Anything else I can try?

FWIW, I do see a lot of warnings during compilation of cache-membuffer.c
(this is after applying your patch, so line-numbers should be interpreted
accordingly):

[[[
..\..\..\subversion\libsvn_subr\cache-membuffer.c(353): warning C4245: '=' :
conversion from 'int' to 'apr_uint64_t', signed/unsigned mismatch
..\..\..\subversion\libsvn_subr\cache-membuffer.c(384): warning C4245: '=' :
conversion from 'int' to 'apr_uint32_t', signed/unsigned mismatch
..\..\..\subversion\libsvn_subr\cache-membuffer.c(433): warning C4245:
'return' : conversion from 'int' to 'apr_uint32_t', signed/unsigned mismatch
..\..\..\subversion\libsvn_subr\cache-membuffer.c(546): warning C4244:
'initializing' : conversion from 'apr_uint64_t' to 'apr_size_t', possible
loss of data
..\..\..\subversion\libsvn_subr\cache-membuffer.c(656): warning C4244:
'initializing' : conversion from 'apr_uint64_t' to 'int', possible loss of
data
..\..\..\subversion\libsvn_subr\cache-membuffer.c(663): warning C4018: '='
: signed/unsigned mismatch
..\..\..\subversion\libsvn_subr\cache-membuffer.c(731): warning C4245: '=' :
conversion from 'int' to 'apr_uint32_t', signed/unsigned mismatch
..\..\..\subversion\libsvn_subr\cache-membuffer.c(732): warning C4245: '=' :
conversion from 'int' to 'apr_uint32_t', signed/unsigned mismatch
..\..\..\subversion\libsvn_subr\cache-membuffer.c(733): warning C4245: '=' :
conversion from 'int' to 'apr_uint32_t', signed/unsigned mismatch
..\..\..\subversion\libsvn_subr\cache-membuffer.c(736): warning C4244:
'function' : conversion from 'apr_uint64_t' to 'apr_size_t', possible loss
of data
..\..\..\subversion\libsvn_subr\cache-membuffer.c(737): warning C4305: 'type
cast' : truncation from 'apr_uint64_t' to 'unsigned char *'
..\..\..\subversion\libsvn_subr\cache-membuffer.c(771): warning C4018: '' :
signed/unsigned mismatch
..\..\..\subversion\libsvn_subr\cache-membuffer.c(773): warning C4245: '=' :
conversion from 'int' to 'apr_uint64_t', signed/unsigned mismatch
..\..\..\subversion\libsvn_subr\cache-membuffer.c(944): warning C4244: '=' :
conversion from 'apr_uint64_t' to 'apr_size_t', possible loss of data
..\..\..\subversion\libsvn_subr\cache-membuffer.c(946): warning C4305: 'type
cast' : truncation from 'apr_uint64_t' to 'char *'
]]]

Cheers,
-- 
Johan

Re: svn commit: r990537 - /subversion/branches/performance/subversion/libsvn_subr/svn_temp_serializer.c

2010-08-31 Thread Johan Corveleyn

On Tue, Aug 31, 2010 at 12:37 AM, Stefan Fuhrmann 
stefanfuhrm...@alice-dsl.de wrote:

 Johan Corveleyn wrote:

 On Mon, Aug 30, 2010 at 9:32 PM, Johan Corveleyn jcor...@gmail.commailto:
 jcor...@gmail.com wrote:

On Mon, Aug 30, 2010 at 12:05 PM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de mailto:stefanfuhrm...@alice-dsl.de

wrote:

I thought so ;) To narrow down the nature of the problem,
I added some checks that  should be able to discern plain
data corruption from (de-)serialization issues. Please apply
either the patch or replace the original files with the versions
in the .zip file.

A debug build should then, hopefully, trigger a different
and more specific assertion.


Ok, will try that now.


 :-(, I updated to r990759 and applied your attached debug.patch, but I
 just get the same assertion (offset by 4 lines), with no additional
 information:

 [[[
 $ svnserve -d -r c:/research/svn/experiment/repos
 Assertion failed: *ptr  buffer, file
 ..\..\..\subversion\libsvn_subr\svn_temp_serializer.c, line 286
 ]]]

 Anything else I can try?

  The failure was expected. With the patched code, I just
 hope to get more context information.

 Could you send me the call stack dump? My suspicion is
 that the DAG node is already (partially) garbage before it
 actually gets added to the cache.


Ok, here it is in attachment (I removed the assert that you added in this
commit, so that the app would really crash; otherwise I couldn't get a crash
dump (or can I?)).

(I hope the attachments (file svn-crash...log.txt and svn-crash...dmp) get
through; if not, let me know and I'll zip them or something)

Some additional info:
- I couldn't reproduce the crash with a narrow range. Not even 9:0 would
crash it (right after startup).
- BUT: if after 9:0 I run log without -r arguments, I get an error on
the client side:
[[[
..\..\..\subversion\svn\log-cmd.c:746: (apr_err=160013)
..\..\..\subversion\libsvn_client\log.c:606: (apr_err=160013)
..\..\..\subversion\libsvn_repos\log.c:1474: (apr_err=160013)
svn: File not found: revision 90799, path '?\239'
..\..\..\subversion\libsvn_repos\log.c:372: (apr_err=160013)
svn: File not found: revision 90799, path '?\239'
..\..\..\subversion\libsvn_fs_fs\tree.c:3313: (apr_err=160013)
svn: File not found: revision 90799, path '?\239'
..\..\..\subversion\libsvn_fs_fs\tree.c:3313: (apr_err=160013)
svn: File not found: revision 90799, path '?\239'
..\..\..\subversion\libsvn_fs_fs\tree.c:3159: (apr_err=160013)
svn: File not found: revision 90799, path '?\239'
..\..\..\subversion\libsvn_fs_fs\tree.c:668: (apr_err=160013)
svn: File not found: revision 90799, path '?\239'
]]]
- This also happens when the first run is 6:0 or even 42000:0. If the
first run is 41000:0, then the second run doesn't get the client-side error,
but the server crashes on the expected spot (after rev 42100).
- The above client-side error also happens if the second run is 96000:9
instead of a log without -r argument.
- However, if I run log -r96000:9 right after startup, no problem.
- Other than that, it crashes reproducibly after 42100 if I run log with no
-r arguments right after startup.

Hmm, I hope you can figure this out ...

-- 
Johan

Process info:
Cmd line: 
C:\research\svn\client_build\svn_branch_performance\dist\bin\svnserve.exe -d -r 
c:/research/svn/experiment/repos
Working Dir: C:\research\svn\experiment\repos
Version:  1.7.0 (dev build), compiled Aug 19 2010, 00:50:40
Platform: Windows OS version 5.1 build 2600 Service Pack 3

Exception: ACCESS_VIOLATION

Registers:
eax=de12e56f ebx=00d31370 ecx=00dd3573 edx=00dd3573 esi=00f0f7c4 edi=00f0f810
eip=003baffc esp=00f0f784 ebp=00f0f784 efl=00010286
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=

Stacktrace:
#1  0x003baffc in svn_fs_fs__id_deserialize()
buffer = 0x00dd356f
id = 
#2  0x003bb598 in svn_fs_fs__noderev_deserialize()
buffer = 0x00dd3570
noderev_p = 
noderev = 
#3  0x003bfa4e in svn_fs_fs__dag_deserialize()
out = 
data = 
data_len = 1e0
pool = 
node = 
#4  0x004bc6fa in membuffer_cache_get()
cache = 
key = 
key_len = 31
item = 
deserializer = 
pool = 
buffer = 0x00dd3570 
group_index = 6587
to_find = 
entry = 
size = 1e0
#5  0x004bc3b4 in svn_membuffer_cache_get()
svn_err__temp = 
value_p = 
found = 0x00f0f8c8
cache_void = 0x00d6bc40
key = 
pool = 
full_key = 0x00dd3520
cache = 
full_key_len = 31
#6  0x004bdf93 in svn_cache__get()
value_p = 
found = 0x00f0f8c8
cache = 
key = 
pool = 
#7  0x003b1158 in dag_node_cache_get()
svn_err__temp = 
node_p = 
root = 
path = 
pool = 
node = 
cache = 
key = 
found = 0
#8

Re: svn commit: r990537 - /subversion/branches/performance/subversion/libsvn_subr/svn_temp_serializer.c

2010-08-31 Thread Johan Corveleyn

Stefan,

On Tue, Aug 31, 2010 at 10:07 PM, Johan Corveleyn jcor...@gmail.com wrote:
 Some additional info:
 - I couldn't reproduce the crash with a narrow range. Not even 9:0 would
 crash it (right after startup).
 - BUT: if after 9:0 I run log without -r arguments, I get an error on
 the client side:
 [[[
 ..\..\..\subversion\svn\log-cmd.c:746: (apr_err=160013)
 ..\..\..\subversion\libsvn_client\log.c:606: (apr_err=160013)
 ..\..\..\subversion\libsvn_repos\log.c:1474: (apr_err=160013)
 svn: File not found: revision 90799, path '?\239'
 ..\..\..\subversion\libsvn_repos\log.c:372: (apr_err=160013)
 svn: File not found: revision 90799, path '?\239'
 ..\..\..\subversion\libsvn_fs_fs\tree.c:3313: (apr_err=160013)
 svn: File not found: revision 90799, path '?\239'
 ..\..\..\subversion\libsvn_fs_fs\tree.c:3313: (apr_err=160013)
 svn: File not found: revision 90799, path '?\239'
 ..\..\..\subversion\libsvn_fs_fs\tree.c:3159: (apr_err=160013)
 svn: File not found: revision 90799, path '?\239'
 ..\..\..\subversion\libsvn_fs_fs\tree.c:668: (apr_err=160013)
 svn: File not found: revision 90799, path '?\239'
 ]]]
 - This also happens when the first run is 6:0 or even 42000:0. If the
 first run is 41000:0, then the second run doesn't get the client-side error,
 but the server crashes on the expected spot (after rev 42100).
 - The above client-side error also happens if the second run is 96000:9
 instead of a log without -r argument.
 - However, if I run log -r96000:9 right after startup, no problem.
 - Other than that, it crashes reproducibly after 42100 if I run log with no
 -r arguments right after startup.

I experimented some more. There must be something strange with that
revision 90799 that also causes the client-side error.

Some log runs immediately after startup:
- svn log -r90798:0 svn://localhost/path/bigfile.xml: no crash
- svn log -r90799:0 svn://localhost/path/bigfile.xml: crash (last
log entry: 42104 (one before 42100 of the regular crash))
- svn log -r90921:0 svn://localhost/path/bigfile.xml: crash (last
log entry: 42130 (two before 42100 of the regular crash)). r90921 is
one before 90799.
- svn log -r90998:0 svn://localhost/path/bigfile.xml: crash (last
log entry: 42149 (three before 42100 of the regular crash)). r90998
is two before 90799.
- svn log svn://localhost/path/bigfile.xml: still crashes
consistently with last log entry 42100.

Still r90799 itself seems a very normal commit, with only text
modifications to bigfile.xml.

One more note: the repository was once created by converting our CVS
repository with cvs2svn (it's an old conversion that we did as an
experiment, after which we did the real conversion; but I still use
the old converted repo to test things). I just now notice that we did
that old conversion with the cvs-revnum option, i.e. updating the
cvs2svn:cvs-rev property on every commit, to make it contain the cvs
revision number of the file. So every commit also contains prop
changes. Probably not relevant, but you never know :-).

Cheers,
-- 
Johan

Worried about single-db performance

2010-09-03 Thread Johan Corveleyn

Hi devs,

From what I understand about the performance problems of WC-1 vs.
WC-NG, and what I'm reading on this list, I expect(ed) a huge
performance boost from WC-NG for certain client operations (especially
on Windows, where the locking of WC-1 is quite problematic). Also, I
knew I had to wait for single-db to see any real performance benifits.
So after you guys switched on single-db, I eagerly gave trunk a
spin... Now I'm a little worried, because I don't see any great speed
increases (quite the contrary). Some details below.

Maybe it's just too early to be looking at this (maybe it's a simple
matter of optimizing the data model, adding indexes, optimizing some
code paths, ...). So it's fine if you guys say: chill, just wait a
couple more weeks. I just need to know whether I should be worried or
not :-).

Some details ...

Setup:
- Win XP 32-bit client machine, with antivirus switched off.
- Single-db client: tr...@992042 (yesterday), release build with VCE2008
- 1.6 client: 1.6.5 binary from tigris.org that I still had lying around.
- Medium size working copy (944 dirs, 10286 files), once checked out
with the 1.6 client (WC-1), once checked out with the trunk-single-db
client.
- 1st run means after reboot, 2nd run means immediately after 1st run.

Numbers:

1) Status (svn st)

1.6 client 1st run:
real0m41.593s
user0m0.015s
sys 0m0.015s

1.6 client 2nd run:
real0m1.203s
user0m0.015s
sys 0m0.031s

Single-db client 1st run:
real0m34.984s
user0m0.015s
sys 0m0.031s

Single-db client 2nd run:
real0m6.938s
user0m0.015s
sys 0m0.031s


2) Update (no changes, wc already up to date) (svn up)

1.6 client 1st run:
real0m38.484s
user0m0.015s
sys 0m0.015s

1.6 client 2nd run:
real0m1.141s
user0m0.015s
sys 0m0.015s

Single-db client 1st run:
real0m31.375s
user0m0.015s
sys 0m0.031s

Single-db client 2nd run:
real0m5.468s
user0m0.031s
sys 0m0.015s


Anyone able to take away my worries :-) ?

Cheers,
-- 
Johan

Re: svn commit: r990537 - /subversion/branches/performance/subversion/libsvn_subr/svn_temp_serializer.c

2010-09-04 Thread Johan Corveleyn

On Tue, Aug 31, 2010 at 11:35 PM, Johan Corveleyn jcor...@gmail.com wrote:
 Stefan,

 On Tue, Aug 31, 2010 at 10:07 PM, Johan Corveleyn jcor...@gmail.com wrote:
 Some additional info:
 - I couldn't reproduce the crash with a narrow range. Not even 9:0 would
 crash it (right after startup).
 - BUT: if after 9:0 I run log without -r arguments, I get an error on
 the client side:
 [[[
 ..\..\..\subversion\svn\log-cmd.c:746: (apr_err=160013)
 ..\..\..\subversion\libsvn_client\log.c:606: (apr_err=160013)
 ..\..\..\subversion\libsvn_repos\log.c:1474: (apr_err=160013)
 svn: File not found: revision 90799, path '?\239'
 ..\..\..\subversion\libsvn_repos\log.c:372: (apr_err=160013)
 svn: File not found: revision 90799, path '?\239'
 ..\..\..\subversion\libsvn_fs_fs\tree.c:3313: (apr_err=160013)
 svn: File not found: revision 90799, path '?\239'
 ..\..\..\subversion\libsvn_fs_fs\tree.c:3313: (apr_err=160013)
 svn: File not found: revision 90799, path '?\239'
 ..\..\..\subversion\libsvn_fs_fs\tree.c:3159: (apr_err=160013)
 svn: File not found: revision 90799, path '?\239'
 ..\..\..\subversion\libsvn_fs_fs\tree.c:668: (apr_err=160013)
 svn: File not found: revision 90799, path '?\239'
 ]]]
 - This also happens when the first run is 6:0 or even 42000:0. If the
 first run is 41000:0, then the second run doesn't get the client-side error,
 but the server crashes on the expected spot (after rev 42100).
 - The above client-side error also happens if the second run is 96000:9
 instead of a log without -r argument.
 - However, if I run log -r96000:9 right after startup, no problem.
 - Other than that, it crashes reproducibly after 42100 if I run log with no
 -r arguments right after startup.

 I experimented some more. There must be something strange with that
 revision 90799 that also causes the client-side error.

 Some log runs immediately after startup:
 - svn log -r90798:0 svn://localhost/path/bigfile.xml: no crash
 - svn log -r90799:0 svn://localhost/path/bigfile.xml: crash (last
 log entry: 42104 (one before 42100 of the regular crash))
 - svn log -r90921:0 svn://localhost/path/bigfile.xml: crash (last
 log entry: 42130 (two before 42100 of the regular crash)). r90921 is
 one before 90799.
 - svn log -r90998:0 svn://localhost/path/bigfile.xml: crash (last
 log entry: 42149 (three before 42100 of the regular crash)). r90998
 is two before 90799.
 - svn log svn://localhost/path/bigfile.xml: still crashes
 consistently with last log entry 42100.

 Still r90799 itself seems a very normal commit, with only text
 modifications to bigfile.xml.

Stefan,

If needed, I could send some individual rev files (like r90779 and
42100) via private email. If that would help in finding the problem
...

Cheers,
-- 
Johan

Re: svn commit: r990537 - /subversion/branches/performance/subversion/libsvn_subr/svn_temp_serializer.c

2010-09-08 Thread Johan Corveleyn

On Wed, Sep 8, 2010 at 12:41 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Wed, Sep 8, 2010 at 12:00 PM, Stefan Fuhrmann
 stefanfuhrm...@alice-dsl.de wrote:
 Johan Corveleyn wrote:

 On Mon, Sep 6, 2010 at 2:10 AM, Stefan Fuhrmann
 stefanfuhrm...@alice-dsl.de wrote:


 The only way to get closer to understanding the nature of the
 problem, I've added extensive logging code to the cache.
 Please run svnserve -M1000 -d --forground -r ...  svnserve.log,
 zip the log file after the programm crashed  got terminated and
 send it to me directly.


 Ok, here it is in attachment, zipped together with the .log and .dmp
 files from the crash.

 The exact log command I executed was:
 $ svn log
 svn://localhost/trunk/sourcesEJB/uz/s9server/services/toegang/driver/settings.xml

 Cheers,


 Thanks Johan!

 I finally found the problem and it actually was more likely
 to hit 32 bit systems then 64 bit ones. The relevant part
 from the log is this:

 create cache 1 prefix =
 fsfs:8d52877e-3a3d-4b59-a976-c8364f526eee/C:/research/svn/experiment/repos/db:RRI
 create cache 2 prefix =
 fsfs:8d52877e-3a3d-4b59-a976-c8364f526eee/C:/research/svn/experiment/repos/db:DAG
 create cache 3 prefix =
 fsfs:8d52877e-3a3d-4b59-a976-c8364f526eee/C:/research/svn/experiment/repos/db:DIR
 create cache 4 prefix =
 fsfs:8d52877e-3a3d-4b59-a976-c8364f526eee/C:/research/svn/experiment/repos/db:PACK-MANIFEST
 create cache 5 prefix =
 fsfs:8d52877e-3a3d-4b59-a976-c8364f526eee/C:/research/svn/experiment/repos/db:TEXT
 create cache 6 prefix =
 fsfs:8d52877e-3a3d-4b59-a976-c8364f526eee/C:/research/svn/experiment/repos/db:NODEREVS
 get 6: key=P+, Y3 MD5=52408b02cc58866e204010717f31c0ef hash=42680402
 (0,257607) not found
 set 6: key=P+, Y3 MD5=52408b02cc58866e204010717f31c0ef hash=42680402
 (0,257607) 4...@2113312
 get 6: key=P+, Y3 MD5=52408b02cc58866e204010717f31c0ef hash=42680402
 (0,257607) 4...@2113312
 get 2: key=2( /trunk/sourcesEJB/uz/s9server
 MD5=52408b029f15368c2de222f27705f679 hash=42680402 (0,257607) 4...@2113312
 crash

 As you can see, bucket (0,257607) contains a NODEREV
 the last get will read it as a DAG node. It only tested the first
 4 bytes of the MD5 key hash because the to_find variable
 in find_entry has been a local array once and has become a
 pointer later on. sizeof(to_find) was now 4 .. 8 instead of 16.

 r994956 should fix the problem.

 That's great news! I'll give it a spin tonight.

Yes! My tests pass now, there are no more crashes. It all works fine.

Cheers,
-- 
Johan

Re: [Issue 3474] making a new subdir, moving files into it and then renaming the subdir, breaks history of the moved files

2010-09-10 Thread Johan Corveleyn

On Fri, Sep 10, 2010 at 11:45 PM,  joha...@tigris.org wrote:
 http://subversion.tigris.org/issues/show_bug.cgi?id=3474

 --- Additional comments from joha...@tigris.org Fri Sep 10 14:45:17 -0700 
 2010 ---
 This issue seems to be fixed on trunk. The described scenario now goes as 
 follows:

 (assuming we're in a working copy with a versioned file a.txt)
 [[[
 $ svn mkdir subdir
 A         subdir             # same as in 1.6

 $ svn mv a.txt subdir
 A         subdir\a.txt
 D         a.txt              # same as in 1.6

 $ svn st
 A       subdir
 A  +    subdir\a.txt
 D       a.txt                # same as in 1.6

 $ svn mv subdir subdir2
 A         subdir2
 D         subdir\a.txt
 D         subdir             # different! will ask on dev list about this.

Is the above output an expected change of behavior? Previously (in
1.6) the above action generated the following output (as can be seen
in the original description of issue 3474):

[[[
$ svn mv subdir/ subdir2
A subdir2
A subdir2\a.txt
D subdir
]]]

It's no problem for me (the result is perfectly fine), just wondering ...

In fact, I never quite understood the output in 1.6 (why are all
children of a moved directory shown as deleted, and not as added in
the new dir?). But I don't understand it now either (I can reverse the
question). Is there any rational explanation, any underlying
reasoning, principles, ... ?

Cheers,
--
Johan

Wrong reply-to address in mails from issue tracker

2010-09-10 Thread Johan Corveleyn

Hi,

The mails coming from the SVN issue tracker (on the issues mailing
list) still contain the old d...@s.t.o address as reply-to. Could
someone change this to the d...@s.a.o address?

Cheers,
-- 
Johan

Re: svn diff optimization to make blame faster?

2010-09-15 Thread Johan Corveleyn

On Sun, Sep 5, 2010 at 1:53 AM, Johan Corveleyn jcor...@gmail.com wrote:
 On Tue, Aug 24, 2010 at 11:11 AM, Philip Martin
 philip.mar...@wandisco.com wrote:
 Johan Corveleyn jcor...@gmail.com writes:
 On Sun, Aug 22, 2010 at 4:02 PM, Branko Čibej br...@xbc.nu wrote:
 svn_diff uses basically the same algorithm as GNU diff but implemented
 slightly differently and IIRC it doesn't have some of GNU diff's
 optimizations. I'm sure it can be speeded up, but haven't a clue about
 how much.

 Ok, thanks. In the meantime I saw that there is not that much
 difference anymore between GNU diff and svn_diff, after running the
 latter from a release build, and disabling my anti-virus (which makes
 me wonder why my anti-virus slows down svn_diff (impact when opening
 the modified datasource), but not on GNU diff).

 The big difference between Subversion's diff and GNU diff is that GNU
 uses heuristics to cut short the diff algorithm whereas Subversion
 grinds on to the end.  Certain files trigger the heuristics and then
 GNU diff is much faster than Subversion.  Typically the file has a
 large number of matches and non-matches repeated through the file,
 machine generated files sometimes fit this pattern.

 GNU diff's heuristics work well so when they trigger the resulting
 diff is usually good enough.  They can be disabled using the --minimal
 option and using that makes GNU diff performance much more like
 Subversion.

 Hmm, I thought harder about this. If I try --minimal with GNU diff,
 it's just as fast, still significantly faster than svn diff.

 Then I read a little bit more about diff algorithms, like the blog
 entry that Greg Hudson suggested in the other thread [1], about the
 Patience Diff algorithm [2]. In there the following two lines caught
 my eye, first steps in the diff algo:

 [[[
 1) Match the first lines of both if they're identical, then match the
 second, third, etc. until a pair doesn't match.
 2) Match the last lines of both if they're identical, then match the
 next to last, second to last, etc. until a pair doesn't match.
 ]]]
 (then the longest common subsequence algorithm kicks in)

 Now, these steps are not specific to Patience Diff, they are common to
 most diff algorithms. And I bet it's a much more efficient to exclude
 the common prefix/suffix this way, than to include them into the
 entire lcs algorithm.

 Right now, svn diff doesn't do this AFAICS. However, GNU diff does,
 and I think that's where most of the performance difference comes
 from. For big files, where a couple of lines are modified somewhere in
 the middle (and the first 3 and last 3 lines are identical), I
 think this can make a big difference.

 Thoughts?

 If no-one else fancies this, I think I'll take a stab at implementing
 this optimization (first with a POC to see if it's significant).
 Unless feedback tells me it's not worth it, not a good idea, ...

Some update on this: I have implemented this for svn_diff (excluding
the identical prefix and suffix of both files, and only then starting
to fill up the token tree and let the lcs-agorithm to its thing). It
makes a *huge* difference. On my bigfile.xml (1.5 Mb) with only one
line changed, the call to svn_diff_diff is ~10 times faster (15-20 ms
vs. 150-170 ms).

This means blame is much faster for such big files as well, because
diff accounts for 90% of the client-side work of blame. This means
that in most cases the client cpu won't be the bottleneck anymore, so
it's more constrained by server-side and network. In my tests, blame
was about twice as fast for bigfile.xml (since all my tests were done
with client and server on the same machine, I'm guessing the server is
now the bottleneck (as well as them running on the same machine)).
Same factor of performance increase for a medium-sized file (a java
source file of 300 Kb).

The patch is still very crude, with lots of code duplication, and
everything stuffed in one big function (and I may have overlooked some
edge conditions), so I don't feel it's ready to post yet (don't want
to waste people's time :)). I'll try to clean it up as soon as I can
and post a reasonable version to the list.

Some notes already:
- I basically wrote a new function datasources_open in diff_file.c
(similar to datasource_open), that opens both files (original and
modified), and compares bytes from the beginning until the first
nonmatching byte (counting newlines when it sees them), and backwards
from the end to find the first (or last, depending how you look at it)
nonmatching byte. The number of prefix_lines is then used to determine
the starting offset of the tokens (lines) that will later be
extracted.
- I think I should generalize this function to take an array of
datasources, and not just two, so it can be used by diff3 and diff4 as
well. But for now I didn't want to take on that extra complication.
- The code is complicated by the fact that the svn_diff code
(diff_file.c) works with chunks of data (doesn't read the entire
file(s

Re: svn diff optimization to make blame faster?

2010-09-20 Thread Johan Corveleyn

On Mon, Sep 20, 2010 at 11:52 AM, Branko Čibej br...@xbc.nu wrote:
  On 15.09.2010 14:20, Johan Corveleyn wrote:
 Some update on this: I have implemented this for svn_diff (excluding
 the identical prefix and suffix of both files, and only then starting
 to fill up the token tree and let the lcs-agorithm to its thing). It
 makes a *huge* difference. On my bigfile.xml (1.5 Mb) with only one
 line changed, the call to svn_diff_diff is ~10 times faster (15-20 ms
 vs. 150-170 ms).


 Hmmm ... looks to me like test data tailored to the optimization. :)

Nope, that's real data from a real repository, with a normal kind of
change that happens here every day.

Of course this optimization is most effective if there are a lot of
common prefix/suffix lines. If there is a single change in the first
line, and a single change in the last one, this optimization will do
nothing but introduce a little bit of extra overhead. And it will
obviously make the most impact on large files (in fact it's just
relative to the ratio of the number of common prefix/suffix lines to
the number of lines in between).

I'm sorry it takes me longer than expected to post a version of this
to the list, but I'm still having some problems with a couple of edge
conditions (I'm learning C as I go, and I'm struggling with a couple
of pointer calculations/comparisons). I plan to post something during
this week...

Cheers,
-- 
Johan

[WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-09-26 Thread Johan Corveleyn

Hi devs,

As discussed in [1], here is a patch that makes svn_diff_diff
(libsvn_diff/diff.c) skip the identical prefix and suffix of the
original and modified files, before starting the LCS (longest common
subsequence) algorithm on the non-matching part.

This makes diff a lot faster (especially for large files), thereby
also speeding up blame in environments where the client is the
bottleneck (with a fast enough server and network, blame is
constrained by the speed of diff on the client side).

This patch still has some rough edges (see below), hence the WIP
marker. Nevertheless, it works most of the time, and it can
demonstrate the speed improvement that can be expected. I will
continue working on the rough edges, but in the meantime any
feedback/review/thoughts are very much appreciated.

The rough edges:
1) While scanning through the identical prefix, I have to count the
number of lines (to have correct line offsets for the rest of the
algorithm). To do that, I currently count the number of \n's, so it
won't work for files with \r eol-style. Not so difficult to fix
(similar to how diff_file.c#datasource_get_next_token does it), but I
haven't done it yet.

2) Two tests still fail:
- blame_tests.py 2: annotate a binary file
- blame_tests.py 7: blame with different eol styles
Maybe this is because of 1), I'll have to investigate.

3) It's only implemented for 2 files. I'd like to generalize this for
an array of datasources, so it can also be used for diff3 and diff4.

4) As a small hack, I had to add a flag datasource_opened to
token.c#svn_diff__get_tokens, so I could have different behavior for
regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is
no longer needed.

5) I've struggled with making the code readable/understandable. It's
likely that there is still a lot of room for improvement. I also
probably need to document it some more.

6) I've not paid too much attention to low level optimizations, so
here also there's probably room for improvement, which may be
significant for the critical loops.

7) There are two warnings left conversion from 'apr_off_t' to 'int',
in diff_file.c#find_identical_suffix. To completely eliminate this, I
should probably make all chunks of type apr_off_t instead of int
(but I have not done that yet, because the original implementation
also used int for the chunk in the svn_diff__file_baton_t struct).
Should I do this (also in the baton struct)? Or should I use an
explicit cast somewhere?

8) A bigger problem: the output of diff/blame is sometimes different
from the original implementation. It's always syntactically correct,
but sometimes the less unique lines are taken differently (like when
a new function is added, and diff thinks the new block starts from the
closing brace of the previous function, ... stuff like that). This
happens only because of the identical-suffix-scanning (it doesn't
happen if I only enable the identical-prefix-scanning).

I think I know the cause of this change in behavior: I completely
eliminate the suffix from the LCS-building algorithm. And that
probably causes it to sometimes come up with another longest common
subsequence. Right now I don't know how to solve this completely. A
workaround might be to add a certain number of suffix-lines to the
non-matching-block, so they can be part of the LCS-searching. But
probably one can always come up with examples where the number of
suffix-lines is not enough. I'll have to look into this some more.
Ideas welcome :-).

Log message:
[[[
Make svn_diff_diff skip identical prefix and suffix to make diff and blame
faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function types datasources_open and
   get_prefix_lines to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (get_prefix_lines): New function (does nothing).
  (svn_diff__mem_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
   and suffix_offset_in_chunk.
  (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
  (find_identical_prefix, find_identical_suffix): New fuctions.
  (datasources_open): New function, to open both datasources and find their
   identical prefix and suffix.
  (get_prefix_lines): New function.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument datasource_opened, to indicate that
   the datasource was already opened.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added argument datasource_opened. Only open the
   datasource if datasource_opened is FALSE. Set the starting offset of the
   position list to the number of prefix lines.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-09-28 Thread Johan Corveleyn

Hi Daniel,

Thanks for the feedback.

On Tue, Sep 28, 2010 at 4:11 PM, Daniel Shahaf d...@daniel.shahaf.name wrote:
 Index: subversion/include/svn_diff.h
 ===
 --- subversion/include/svn_diff.h     (revision 1001548)
 +++ subversion/include/svn_diff.h     (working copy)
 @@ -112,6 +112,11 @@
 (personally I prefer 'svn diff -x-p' to show the function name here)

Ok, will do next time.

    svn_error_t *(*datasource_open)(void *diff_baton,
                                    svn_diff_datasource_e datasource);

 +  /** Open the datasources of type @a datasources. */
 +  svn_error_t *(*datasources_open)(void *diff_baton, apr_off_t 
 *prefix_lines,
 +                                   svn_diff_datasource_e datasource0,
 +                                   svn_diff_datasource_e datasource1);
 +

 So, you're extending the svn_diff_fns_t struct, which is defined in
 a public header.

 It's a public struct with no constructor function, so I believe you have
 to revv it (into svn_diff_fns2_t) in order to extend it (for binary
 compatibility: people allocating this struct and then using a newer
 Subversion library at runtime).

 If it did have a constructor function, you'd have to extend it only at
 the end, and even then make sure that if the added elements are NULL (eg
 because an old caller didn't know to fill them) then everything still
 worked.

 Probably more important to get the logic right than to revv it right
 away, though; the latter is a SMOP.

Doh! I'm sure that observation was in the back of my head somewhere,
but I forgot about it while working on the solution. Anyway, you're
right: there is first some more work to get the algorithm right.

I've had some progress:
- The blame_tests.py now all pass (tests 2 and 7 provoked a core
dump). That makes all tests pass. However, although I fixed the
coredump, I don't fully understand the root cause (why they ended up
in the situation where they ended up). So I'm going to study that
first a bit more.
- I have now included support for files with \r eol-style.

I'll send a new version of the patch shortly.

I'm also thinking that I could try to take advantage of -x-b/-x-w or
-x--ignore-eol-style options to make it even faster (right now the
prefix/suffix scanning will stop at the first difference, regardless
if it's a whitespace or eol difference that could/should be ignored).

However, I'm not sure if I should implement this myself, as part of
the find_identical_prefix and find_identical_suffix functions, or
switch to the usage of datasource_get_next_token (which is the
function that is used by the real diff algorithm to extract the
lines, and which uses util.c#svn_diff__normalize_buffer to normalize
whitespace and eol's if needed).

Right now, I don't read entire lines (tokens) but compare each byte as
I go. But I could do it line-based as well (extract line from file1,
extract line from file2, memcmp lines). I would have to make the
calculation of the adler32 hash in datasource_get_next_token
conditional on some boolean, or factor out the part of the function
that's useful to me into a new static function.

There is one caveat to this approach: I'm not sure if it would work
backwards (for suffix scanning). Well, the normalization function
wouldn't have to be changed, but the extraction of lines would have to
go backward. Surely it's possible, but I have no idea how much I'd
have to change the code in get_next_token to get lines backwards...

I'm also not sure if one would be (significantly) faster than the
other: comparing byte-by-byte while going through both files, or
extracting entire lines and then comparing the lines.

Thoughts?

Cheers,
-- 
Johan

Re: Interrupting an update after change of externals causes corrupt working copy

2010-09-30 Thread Johan Corveleyn

Hi devs,

As per Daniel Shahaf's suggestion, I'm continuing this discussion on
the dev list. See the previous mails in this thread on the users list
for some context (or
http://svn.haxx.se/users/archive-2010-09/0406.shtml). I'll summarize
below.

This issue reproduces with 1.6.12 as well as with trunk. It would be
interesting to know the WC-NG folks take on this.

Note: the subject line may be slightly exaggerated, as there seems to
be a good workaround to repair things (at least in case of
intra-repository externals).

Summary
---

Steps to reproduce:
1) Change a directory external definition.

2) svn update

3) Interrupt (Ctrl-C) while it is processing the switch of the
external. I did this after some files of the external were already
Updated.

Result:
svn status now shows a lot of files and directories as S (switched),
and some directories missing. Running svn info on the switched
files/dirs shows that they still point to the old external location.

Running svn update again fixes the missing dirs, but not the
switched nodes. svn revert does nothing (which is logical: there are
no local changes). svn cleanup does nothing.

Workaround:
Running svn switch $new_url $external_dir brings everything back to normal.


So:
- Wouldn't it be better if svn update could repair this (as in
resuming the interrupted update from before)?
- Should I file an issue for this?
- I'm not sure how all this plays out with externals from remote
repositories (haven't tested this). With or without a change of the
remote location. Would an svn switch --relocate fix things in the
same way then?

Note: this can be easily reproduced with any large (public) repository
(I think). Just define some externals in working copy only (no need to
commit), update, change external definition (no need to commit),
update, interrupt.

Cheers,
-- 
Johan

trunk failing tests on Windows XP (32 bit): prop-tests.py 33, stat-tests.py 5, upgrade-tests.py 11

2010-09-30 Thread Johan Corveleyn

Hi devs,

The following tests fail on my machine (Windows XP 32-bit, built with
VCE 2008, ra_local):

- prop-tests.py 33: test properties of obstructed subdirectories
svn: Can't open directory
'C:\research\svn\client_build\svn_trunk\Release\subversion\tests\cmdline\svn-test-work\working_copies\prop_tests-33\A\C':
The directory name is invalid.

- stat-tests.py 5: status on versioned items whose type has changed
svn: Can't open directory
'C:\research\svn\client_build\svn_trunk\Release\subversion\tests\cmdline\svn-test-work\working_copies\stat_tests-5\A':
The directory name is invalid.

- upgrade_tests.py 11: missing directories and obstructing files
svn: Can't open directory
'C:\research\svn\client_build\svn_trunk\Release\subversion\tests\cmdline\svn-test-work\working_copies\upgrade_tests-11\A\D':
The directory name is invalid.

They all seem to try to open some path as a directory, but it isn't a
directory (but an empty file or something).

Am I the only one seeing this? Is this known/normal?

Please find the fail log files in attachment.

Cheers,
-- 
Johan
[[[
CMD: svnadmin.exe create svn-test-work\repositories\prop_tests-33 
--bdb-txn-nosync --fs-type=fsfs
TIME = 0.14
CMD: svnadmin.exe dump svn-test-work\local_tmp\repos | svnadmin.exe load 
svn-test-work\repositories\prop_tests-33 --ignore-uuid
TIME = 0.016000
CMD: svn.exe co 
file:///C:/research/svn/client_build/svn_trunk/Release/subversion/tests/cmdline/svn-test-work/repositories/prop_tests-33
 svn-test-work\working_copies\prop_tests-33 --config-dir 
C:\research\svn\client_build\svn_trunk\Release\subversion\tests\cmdline\svn-test-work\local_tmp\config
 --password rayjandom --no-auth-cache --username jrandom
TIME = 0.734000
Asvn-test-work\working_copies\prop_tests-33\A
Asvn-test-work\working_copies\prop_tests-33\A\B
Asvn-test-work\working_copies\prop_tests-33\A\B\lambda
Asvn-test-work\working_copies\prop_tests-33\A\B\E
Asvn-test-work\working_copies\prop_tests-33\A\B\E\alpha
Asvn-test-work\working_copies\prop_tests-33\A\B\E\beta
Asvn-test-work\working_copies\prop_tests-33\A\B\F
Asvn-test-work\working_copies\prop_tests-33\A\mu
Asvn-test-work\working_copies\prop_tests-33\A\C
Asvn-test-work\working_copies\prop_tests-33\A\D
Asvn-test-work\working_copies\prop_tests-33\A\D\gamma
Asvn-test-work\working_copies\prop_tests-33\A\D\G
Asvn-test-work\working_copies\prop_tests-33\A\D\G\pi
Asvn-test-work\working_copies\prop_tests-33\A\D\G\rho
Asvn-test-work\working_copies\prop_tests-33\A\D\G\tau
Asvn-test-work\working_copies\prop_tests-33\A\D\H
Asvn-test-work\working_copies\prop_tests-33\A\D\H\chi
Asvn-test-work\working_copies\prop_tests-33\A\D\H\omega
Asvn-test-work\working_copies\prop_tests-33\A\D\H\psi
Asvn-test-work\working_copies\prop_tests-33\iota
Checked out revision 1.
CMD: svn.exe propset red blue svn-test-work\working_copies\prop_tests-33\A\C 
--config-dir 
C:\research\svn\client_build\svn_trunk\Release\subversion\tests\cmdline\svn-test-work\local_tmp\config
 --password rayjandom --no-auth-cache --username jrandom
TIME = 0.11
property 'red' set on 'svn-test-work\working_copies\prop_tests-33\A\C'
CMD: svn.exe proplist --verbose --xml 
svn-test-work\working_copies\prop_tests-33\A 
svn-test-work\working_copies\prop_tests-33\A\B\E\beta 
svn-test-work\working_copies\prop_tests-33\A\D 
svn-test-work\working_copies\prop_tests-33\A\C 
svn-test-work\working_copies\prop_tests-33\A\B 
svn-test-work\working_copies\prop_tests-33\A\D\G\tau 
svn-test-work\working_copies\prop_tests-33\A\D\G\rho 
svn-test-work\working_copies\prop_tests-33\A\D\G\pi 
svn-test-work\working_copies\prop_tests-33\A\D\H\psi 
svn-test-work\working_copies\prop_tests-33\A\mu 
svn-test-work\working_copies\prop_tests-33\A\D\G 
svn-test-work\working_copies\prop_tests-33\A\D\H\omega 
svn-test-work\working_copies\prop_tests-33\A\D\H\chi 
svn-test-work\working_copies\prop_tests-33\A\B\F 
svn-test-work\working_copies\prop_tests-33\A\B\E 
svn-test-work\working_copies\prop_tests-33\A\B\E\alpha 
svn-test-work\working_copies\prop_tests-33\A\B\lambda 
svn-test-work\working_copies\prop_tests-33\A\D\H 
svn-test-work\working_copies\prop_tests-33\iota 
svn-test-work\working_copies\prop_tests-33\A\D\gamma 
svn-test-work\working_copies\prop_tests-33 --config-dir 
C:\research\svn\client_build\svn_trunk\Release\subversion\tests\cmdline\svn-test-work\local_tmp\config
 --password rayjandom --no-auth-cache --username jrandom
TIME = 0.125000
?xml version=1.0?
properties
target
   path=svn-test-work\working_copies\prop_tests-33\A\C
property
   name=redblue/property
/target
/properties
CMD: svn.exe proplist --verbose --xml 
svn-test-work\working_copies\prop_tests-33\A 
svn-test-work\working_copies\prop_tests-33\A\B\E\beta 
svn-test-work\working_copies\prop_tests-33\A\D 
svn-test-work\working_copies\prop_tests-33\A\B\E\alpha 
svn-test-work\working_copies\prop_tests-33\A\B 
svn-test-work\working_copies\prop_tests-33\A\D\G\tau

Re: trunk failing tests on Windows XP (32 bit): prop-tests.py 33, stat-tests.py 5, upgrade-tests.py 11

2010-10-01 Thread Johan Corveleyn

On Fri, Oct 1, 2010 at 3:45 PM, Paul Burba ptbu...@gmail.com wrote:
 On Thu, Sep 30, 2010 at 8:10 PM, Bert Huijben b...@qqmail.nl wrote:
 -Original Message-
 From: Johan Corveleyn [mailto:jcor...@gmail.com]
 Sent: vrijdag 1 oktober 2010 1:51
 To: Subversion Development
 Subject: trunk failing tests on Windows XP (32 bit): prop-tests.py 33,
 stat-tests.py 5, upgrade-tests.py 11

 Hi devs,

 The following tests fail on my machine (Windows XP 32-bit, built with
 VCE 2008, ra_local):

 - prop-tests.py 33: test properties of obstructed subdirectories
 svn: Can't open directory
 'C:\research\svn\client_build\svn_trunk\Release\subversion\tests\cmdlin
 e\svn-test-work\working_copies\prop_tests-33\A\C':
 The directory name is invalid.

 - stat-tests.py 5: status on versioned items whose type has changed
 svn: Can't open directory
 'C:\research\svn\client_build\svn_trunk\Release\subversion\tests\cmdlin
 e\svn-test-work\working_copies\stat_tests-5\A':
 The directory name is invalid.

 - upgrade_tests.py 11: missing directories and obstructing files
 svn: Can't open directory
 'C:\research\svn\client_build\svn_trunk\Release\subversion\tests\cmdlin
 e\svn-test-work\working_copies\upgrade_tests-11\A\D':
 The directory name is invalid.

 They all seem to try to open some path as a directory, but it isn't a
 directory (but an empty file or something).

 Am I the only one seeing this? Is this known/normal?

 This seems to be related to the APR version you use. (Paul Burba reported
 this same problem earlier this week).

 Hi Johan,

 Bert is correct, I saw these failure earlier this week while using APR
 1.3.12.  All the failures occur when a file unexpectedly obstructs a
 directory.  The APR macro APR_STATUS_IS_ENOTDIR in 1.3.x does not
 handle the ERROR_DIRECTORY 267 The directory name is invalid that
 Windows raises in this case.  This was fixed in APR in
 http://svn.apache.org/viewvc?view=revisionrevision=821306 and is in
 APR 1.4.2.

Thanks for the info. I was/am still building against APR 1.3.8, so
that explains it.

So, will 1.4.2 become the minimum supported APR version for SVN? Or
maybe a future 1.3.x release that fixes the same issue?

Or will SVN trunk be changed/fixed to cope with this? Since this
problem only appeared a week ago, it must be that something in svn
code changed to provoke this problem, right? It used  to work ...

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-10-02 Thread Johan Corveleyn

Hi,

Here is a second iteration of the patch. It now passes make check.

Differences from the previous version are:
- Support for \r eol-style (\n and \r\n was already ok).
- The number of prefix_lines is now passed to svn_diff__lcs, so it can
use that value to set the position offset of the EOF marker
correctly, in case one of both files has become empty after skipping
the prefix. This fixes the crashes in blame_tests.py 2 and 7.

The patch is pretty big, so please let me know if I should split it up
to make it more reviewable (I could easily split it up between the
prefix-finding and the suffix-finding, at the cost of having overview
over the entire algorithm).

Still to do:
- Think about why results are sometimes different (because of
eliminated suffix, the LCS can sometimes be slightly different), and
what can be done about it.
- Generalize for more than 2 datasources (for diff3 and diff4).
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options.

But I'd like to do those things in follow-up patches, after this one
has been reviewed and digested a little bit. So at this point: review,
feedback, ... very welcome :-).

Log message:
[[[
Make svn_diff_diff skip identical prefix and suffix to make diff and blame
faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function types datasources_open and
   get_prefix_lines to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (get_prefix_lines): New function (does nothing).
  (svn_diff__mem_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
   and suffix_offset_in_chunk.
  (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open both datasources and find their
   identical prefix and suffix.
  (get_prefix_lines): New function.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument datasource_opened, to indicate that
   the datasource was already opened.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added argument datasource_opened. Only open the
   datasource if datasource_opened is FALSE. Set the starting offset of the
   position list to the number of prefix lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument prefix_lines. Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of common diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open, to open original and
   modified at once, and find their identical prefix and suffix. Pass
   prefix_lines to svn_diff__lcs and to svn_diff__diff.

* subversion/libsvn_diff/diff3.c
  (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.

* subversion/libsvn_diff/diff4.c
 (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.
]]]

Cheers,
-- 
Johan
Index: subversion/include/svn_diff.h
===
--- subversion/include/svn_diff.h   (revision 1003326)
+++ subversion/include/svn_diff.h   (working copy)
@@ -112,6 +112,11 @@ typedef struct svn_diff_fns_t
   svn_error_t *(*datasource_open)(void *diff_baton,
   svn_diff_datasource_e datasource);
 
+  /** Open the datasources of type @a datasources. */
+  svn_error_t *(*datasources_open)(void *diff_baton, apr_off_t *prefix_lines,
+   svn_diff_datasource_e datasource0,
+   svn_diff_datasource_e datasource1);
+
   /** Close the datasource of type @a datasource. */
   svn_error_t *(*datasource_close)(void *diff_baton,
svn_diff_datasource_e datasource);
@@ -124,6 +129,9 @@ typedef struct svn_diff_fns_t
 void *diff_baton,
 svn_diff_datasource_e datasource);
 
+  /** Get the number of identical prefix lines from the @a diff_baton. */
+  apr_off_t (*get_prefix_lines)(void *diff_baton);
+
   /** A function for ordering the tokens, resembling 'strcmp' in functionality.
* @a compare should contain the return value of the comparison:
* If @a ltoken and @a rtoken are equal, return 0.  If @a ltoken is
Index: subversion/libsvn_diff/diff_file.c

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-10-03 Thread Johan Corveleyn

On Sun, Oct 3, 2010 at 1:46 AM, Johan Corveleyn jcor...@gmail.com wrote:
 Hi,

 Here is a second iteration of the patch. It now passes make check.

 Differences from the previous version are:
 - Support for \r eol-style (\n and \r\n was already ok).
 - The number of prefix_lines is now passed to svn_diff__lcs, so it can
 use that value to set the position offset of the EOF marker
 correctly, in case one of both files has become empty after skipping
 the prefix. This fixes the crashes in blame_tests.py 2 and 7.

 The patch is pretty big, so please let me know if I should split it up
 to make it more reviewable (I could easily split it up between the
 prefix-finding and the suffix-finding, at the cost of having overview
 over the entire algorithm).

 Still to do:
 - Think about why results are sometimes different (because of
 eliminated suffix, the LCS can sometimes be slightly different), and
 what can be done about it.

Hm, this problem has made me reconsider whether my patch is a good
solution (more details below). I'm thinking of other ways of
implementing this kind of optimization, so for now I'm pulling this
patch. Please don't review it yet, as I might go for a radically
different approach (sorry for any time spent that may be lost).

The details:
The problem happens if there are two (or more) non-matching parts with
an identical part in between (so the prefix scanning stops early,
doesn't meet the suffix in one of the files), and the suffix scanning
goes too far because the end of the last change is identical to the
original at that point.

For example (best viewed with fixed-width font):
[[[
original | modified
---
This is line 1   | This is line 1
This is line 2   | This line is *changed*
This is line 3   | This is line 3
... (more identical lines)   | ...
existing_function()  | existing_function()
{| {
  // do stuff|   // do stuff
  return SVN_NO_ERROR;   |   return SVN_NO_ERROR;
}| }
 |
This is the end of the file  | new_function()
-| {
 |   // do other stuff
 |   return SVN_NO_ERROR;
 | }
 |
 | This is the end of the file
 |-
]]]

The following identical suffix will be eliminated:
[[[
  return SVN_NO_ERROR;
}

This is the end of the file
]]]

which means the LCS algorithm cannot do better than to say that the
following lines are new:
[[[
+  return SVN_NO_ERROR;
+}
+
+new_function()
+{
+  // do other stuff
]]]

Not quite what we want/expect. If the suffix is not eliminated, the
LCS algorithm gives the correct/expected result. Also if the change in
the beginning of the file isn't there, the result is correct (because
the prefix scanning goes first, and eliminates the identical stuff
from the start (I'll leave it as an exercise to the reader to see that
the result is better)).

Interestingly, GNU diff also has this problem. It mitigates it a bit
by keeping a number of identical prefix/suffix lines (at least the
number of context lines, or a higher number if supplied by
--horizon-lines=NUM). This cannot completely solve the problem though:
for any given number of horizon-lines one can always come up with an
example that doesn't give the correct result.

So I need to find a way to not eliminate the identical suffix, but
to mark those lines as identical and then include them in the
position_list that's going to be used by the LCS algorithm. I'd like
to do the same for the prefix. This means I need to approach this
problem on a different level.

To be continued ...

Cheers,
-- 
Johan

[RFC] Diff (blame) optimization: how to go forward

2010-10-07 Thread Johan Corveleyn

Hi all,

This is a follow-up to the WIP-patches I posted last week [1], which
are about improving performance of svn_diff (and therefor also blame
on the client-side), especially for large files.

To summarize: the idea was (is) to strip off the identical prefix and
suffix, and then letting the existing diff algorithm work on the
remainder. As stated in [2], I ran into some difficulties to get the
exact same result of the existing diff algorithm (because of too eager
suffix stripping). I've spent some time searching for a perfect
solution, but I can't find one. So now I'm stuck and would like to
have some feedback on what to do with it.

First: the thing works, and it can reduce the time needed for svn_diff
by 20% up to 80% for large files (depending on the distribution of the
changes). An extreme example is a reduction from ~150ms to ~30ms for a
6-lines file with a small number of lines changed (say up to 100)
located close together (so lots of identical prefix/suffix).

Blame gets similar benefits, *if the server is fast enough*. I used
svnserve built from Stefan^2's performance branch to test this. A
normal svnserve with FSFS isn't really fast enough (unless maybe with
an SSD, but I don't have one here) to serve up the 2500 revisions of
the large file quickly enough. But the performance-enhanced-svnserve,
with a hot cache filled with the necessary full-texts, is definitely
fast enough to make the time of blame completely dependent on the
client-side performance. Conclusion: the diff optimization we're
talking about is much more useful for blame together with a server
with the full-text membuffer caching of the performance branch.

Now, the reason why I'm stuck: suffix scanning sometimes strips a
little bit too much (see [2] for full explanation). In this case the
resulting diff is still correct, but it's not as nice for the user. As
I noted in [2], GNU diff also has this problem, but the user can help
a little bit by specifying a large enough number of
prefix/suffix-lines to keep (--horizon-lines).

So, what to do?
- I have not found a better solution than to use some form of
horizon-lines to get the nice result, but still have a similar
performance improvement (unless if I leave out suffix stripping
entirely, only strip prefix, but that halves the power of the
optimization).

- I've tested with keeping up to 50 suffix-lines. It didn't have the
slightest impact on the performance improvement (we're talking about
stripping away 1000's of lines). This fixed all real-life examples I
could find/test (diff/blame output is precisely identical to original
svn). If I kept only 20 lines, I still ran into some differences (30
lines was enough for my example file, but I took it up to 50 to be
sure).

- I would like to avoid leaving the decision to the user to come up
with an adequate number of suffix-lines-to-keep. So I'd like to just
hardcode some value, say 50, or even 100. I think this would give good
(=identical to original svn) diff/blame results in all but the most
extreme cases. With suffix-lines-to-keep=50, you'd need to insert a
block of text that has its last 50 lines identical to the 50 lines
preceding the insertion point, to mess up the diff result.

- If really necessary, we could say default=50, but it can be
overridden by the user with another -x option.

So the main question is: is such a change in behavior (unlikely to
have an impact in most realistic cases) acceptable for this kind of
performance improvement?

Other options:
- Come up with a clever solution to fix the optimization properly (but
I don't know if that's possible - suggestions welcome :-)).
- Throw this optimization away, and take a look at something like
Patience diff (supposed to be fast as well, and give nice results).
However, I don't know Patience diff well enough to code up something,
or even to judge whether it would give good results in all (or most)
cases.
- Just throw it away and don't look back :-)
- Something else?

Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2010-10/0032.shtml
[2] http://svn.haxx.se/dev/archive-2010-10/0050.shtml

Re: [RFC] caret notation enhancement

2010-10-08 Thread Johan Corveleyn

On Fri, Oct 8, 2010 at 2:35 PM, Karl Heinz Marbaise khmarba...@gmx.de wrote:
 Hi Tino,

     svn cp ^^/trunk ^^/tags/RELEASE-1.0.0 -m- Tagging

 The usage of the doubled ^ is just as an example, cause i know
 on Windows you already have to type the doubled ^ because of the shell.

 So what should ^^ (or however it turns out) mean, exactly?

 It was just an example how it might look like


 I always wanted something like repository path of current working copy
 - is that what you mean?

 No...the repository path can be seen if you do an svn info in your wc.

 Maybe ^: could be used as the anchor for

 repository path of current working copy, if you are in a copy of
 /some/path/to/project/branch/whatever you might use

   svn cp ^: ^:../../tags/newtag

 to copy that path to /some/path/to/project/tags/newtag.

 Might be difficult to implement, I suppose but IANAD ;).


 Lets assume the following repository structure...to make my explanation more
 clear (Or may be i misunderstand yours?)...


 /
 + px
       +--- py
             +--- pz
                   +--- trunk
                   +--- tags
                   +--- branches


 If you do a checkout from the above trunk you need to do the following to
 create a tag (svn 1.5, 1.6):

    svn cp ^/px/py/pz/trunk ^/px/py/pz/tags/R1 -m- Tag

 So what a like to see is something like:

    svn cp ^!/trunk ^!/tags/R1 -m- Tag

But how is svn supposed to know that you mean ^! to be expanded to
^/px/py/pz, and not to some other subfolder of the repository root,
say ^/px/py/pt?

AFAICS there is no context telling svn in which project you are
working (unless you mean that svn should deduce this from the working
copy you're in when executing that command).

-- 
Johan

Re: [RFC] Diff (blame) optimization: how to go forward

2010-10-08 Thread Johan Corveleyn

On Fri, Oct 8, 2010 at 3:08 PM, Daniel Shahaf d...@daniel.shahaf.name wrote:
 Johan Corveleyn wrote on Fri, 8 Oct 2010 at 01:44 -:
 With suffix-lines-to-keep=50, you'd need to insert a
 block of text that has its last 50 lines identical to the 50 lines
 preceding the insertion point, to mess up the diff result.

 - If really necessary, we could say default=50, but it can be
 overridden by the user with another -x option.

 So the main question is: is such a change in behavior (unlikely to
 have an impact in most realistic cases) acceptable for this kind of
 performance improvement?

 Just to clarify: are you asking whether it's acceptable to have a not
 nice (but still semantically correct) diff output in case the user adds
 a block whose last 50 lines match the 50 lines prior to the first
 difference?

 And 'not nice' just means the preceding, rather than trailing, instance
 of the repeated block would be grabbed by the /^+/ lines in the diff,
 right?

Yep, that's what I meant.

 In this case, I think I'm going to answer your long mail with just...

    Yes.

Ok, great.

 or actually...

    Yes, and let's move on to deeper problems (if any).

What deeper problems? :-)

Seriously: I saw this as the biggest problem in the scope of this
optimization. If the result is not good the entire optimization would
be worthless, so any other problem with my implementation would be
irrelevant.

Good to hear that it's not that big of a problem.

(I realize this optimization is just a small thing compared to more
important/fundamental stuff going on right now in subversion. But it's
important to me personally. And it's the most I can do with my
abilities/knowledge/time right now, so ...)

 (I'm not /particularly/ worried about a different-than-expected, but
 still semantically correct, diff --- though, yes, it's nice to output
 the expected diff if that's possible without too much effort and
 without breaking other use cases.)

 :-),

 Daniel
 (p.s. I don't have as much time to look into this as I'd like to...)

No problem. Your feedback is very much appreciated.

On Fri, Oct 8, 2010 at 8:46 PM, Hyrum K. Wright
hyrum_wri...@mail.utexas.edu wrote:
 My not-having-looked-deeply-at-the-problem reaction is this: if we can
 introduce significant speed-ups in the common case, while still
 maintaining functionality, let's do it.  There may be corner cases
 where somebody gets an ugly diff, but in those corner cases there are
 likely to be *other* issues as play as well.  Let's not penalize
 everybody for the sake of a couple of corner cases.

Ok, perfect, we seem to have a consensus then :-). I'll cook up a
version of the patch which keeps 50 suffix lines.

Thanks for the feedback, Daniel and Hyrum.

Just one more thing: as I mentioned in my rather long mail, blame
would benefit the most from my optimization if the server were fast
enough. So if anyone could spare some review cycles (ok, I know they
are scarce these days), reviewing Stefan^2's integrate-cache-membuffer
branch would really help my cause as well ...

Cheers,
-- 
Johan

Re: [RFC] Diff (blame) optimization: how to go forward

2010-10-08 Thread Johan Corveleyn

On Fri, Oct 8, 2010 at 10:35 PM, Hyrum K. Wright
hyrum_wri...@mail.utexas.edu wrote:
 On Fri, Oct 8, 2010 at 3:24 PM, Johan Corveleyn jcor...@gmail.com wrote:
 ...

 Just one more thing: as I mentioned in my rather long mail, blame
 would benefit the most from my optimization if the server were fast
 enough. So if anyone could spare some review cycles (ok, I know they
 are scarce these days), reviewing Stefan^2's integrate-cache-membuffer
 branch would really help my cause as well ...

 Yeah, just a question of tuits.  If you've reviewed the branch,
 posting your conclusions here would be a great contribution.  Speaking
 for myself, I feel you've got more experience in some areas of that
 branch than I do. :)

 (Apologies if you've already done so, and I've missed it.)

No, sorry, I haven't either. I have compiled and run it (well actually
the performance branch itself, not the integrate-cache-membuffer
branch). But I haven't looked at the source code.

Mmmm, tuits, I wish I had some spare ones lying around :-). Who knows,
maybe I'll find some ...

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-10-08 Thread Johan Corveleyn

Ok, third iteration of the patch in attachment. It passes make check.

As discussed in [1], this version keeps 50 lines of the identical
suffix around, to give the algorithm a good chance to generate a diff
output of good quality (in all but the most extreme cases, this will
be the same as with the original svn_diff algorithm).

That's about the only difference with the previous iteration. So for
now, I'm submitting this for review. Any feedback is very welcome :-).

I still consider this a WIP, because of the following remaining todo's
(which may have a lot of impact on the current implementation):
- Generalize for more than 2 datasources (for diff3 and diff4).
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options. Maybe
switch the implementation to read out entire lines before comparing
(like datasources_get_next_token does).

Log message:
[[[
Make svn_diff_diff skip identical prefix and suffix to make diff and blame
faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function types datasources_open and
   get_prefix_lines to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (get_prefix_lines): New function (does nothing).
  (svn_diff__mem_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
   and suffix_offset_in_chunk.
  (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open both datasources and find their
   identical prefix and suffix. From the identical suffix, 50 lines are kept to
   help the diff algorithm find the nicest possible diff representation
   in case of ambiguity.
  (get_prefix_lines): New function.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument datasource_opened, to indicate that
   the datasource was already opened.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added argument datasource_opened. Only open the
   datasource if datasource_opened is FALSE. Set the starting offset of the
   position list to the number of prefix lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument prefix_lines. Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of common diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open, to open original and
   modified at once, and find their identical prefix and suffix. Pass
   prefix_lines to svn_diff__lcs and to svn_diff__diff.

* subversion/libsvn_diff/diff3.c
  (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.

* subversion/libsvn_diff/diff4.c
  (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.
]]]


Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2010-10/0141.shtml
Index: subversion/include/svn_diff.h
===
--- subversion/include/svn_diff.h   (revision 1006020)
+++ subversion/include/svn_diff.h   (working copy)
@@ -112,6 +112,11 @@ typedef struct svn_diff_fns_t
   svn_error_t *(*datasource_open)(void *diff_baton,
   svn_diff_datasource_e datasource);
 
+  /** Open the datasources of type @a datasources. */
+  svn_error_t *(*datasources_open)(void *diff_baton, apr_off_t *prefix_lines,
+   svn_diff_datasource_e datasource0,
+   svn_diff_datasource_e datasource1);
+
   /** Close the datasource of type @a datasource. */
   svn_error_t *(*datasource_close)(void *diff_baton,
svn_diff_datasource_e datasource);
@@ -124,6 +129,9 @@ typedef struct svn_diff_fns_t
 void *diff_baton,
 svn_diff_datasource_e datasource);
 
+  /** Get the number of identical prefix lines from the @a diff_baton. */
+  apr_off_t (*get_prefix_lines)(void *diff_baton);
+
   /** A function for ordering the tokens, resembling 'strcmp' in functionality.
* @a compare should contain the return value of the comparison:
* If @a ltoken and @a rtoken are equal, return 0.  If @a ltoken is
Index: subversion/libsvn_diff/diff_file.c
===
--- subversion/libsvn_diff/diff_file.c  (revision 1006020)
+++

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-10-09 Thread Johan Corveleyn

On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad julian.f...@wandisco.com wrote:
 On Sat, 2010-10-09, Johan Corveleyn wrote:
 Ok, third iteration of the patch in attachment. It passes make check.

 As discussed in [1], this version keeps 50 lines of the identical
 suffix around, to give the algorithm a good chance to generate a diff
 output of good quality (in all but the most extreme cases, this will
 be the same as with the original svn_diff algorithm).

 That's about the only difference with the previous iteration. So for
 now, I'm submitting this for review. Any feedback is very welcome :-).

 Hi Johan.

Hi Julian,

Thanks for taking a look.

 I haven't reviewed it, but after seeing today's discussion I had just
 scrolled quickly through the previous version of this patch.  I noticed
 that the two main functions - find_identical_suffix and
 find_identical_suffix - are both quite similar (but not quite similar
 enough to make them easily share implementation) and both quite long,
 and I noticed you wrote in an earlier email that you were finding it
 hard to make the code readable.  I have a suggestion that may help.

 I think the existing structure of the svn_diff__file_baton_t is
 unhelpful:
 {
  const svn_diff_file_options_t *options;
  const char *path[4];

  apr_file_t *file[4];
  apr_off_t size[4];

  int chunk[4];
  char *buffer[4];
  char *curp[4];
  char *endp[4];

  /* List of free tokens that may be reused. */
  svn_diff__file_token_t *tokens;

  svn_diff__normalize_state_t normalize_state[4];

  apr_pool_t *pool;
 } svn_diff__file_baton_t;

 All those array[4] fields are logically related, but this layout forces
 the programmer to address them individually.

 So I wrote a patch - attached - that refactors this into an array of 4
 sub-structures, and simplifies all the code that uses them.

 I think this will help you to get better code clarity because then your
 increment_pointer_or_chunk() for example will be able to take a single
 pointer to a file_info structure instead of a lot of pointers to
 individual members of the same.

 Would you take a look and let me know if you agree.  If so, I can commit
 the refactoring straight away.

Yes, great idea! That would indeed vastly simplify a lot of the code.
So please go ahead and commit the refactoring.

Also, maybe the last_chunk number could be included in the file_info
struct? Now it's calculated in several places: last_chunk0 =
offset_to_chunk(file_baton-size[idx0]), or I have to pass it on
every time as an extra argument. Seems like the sort of info that
could be part of file_info.

One more thing: you might have noticed that for find_identical_suffix
I use other buffers, chunks, curp's, endp's, ... than for the prefix.
For prefix scanning I can just use the stuff from the diff_baton,
because after prefix scanning has finished, everything is buffered and
pointing correctly for the normal algorithm to continue (i.e.
everything points at the first byte of the first non-identical line).
For suffix scanning I need to use other structures (newly alloc'd
buffer etc), so as to not mess with those pointers/buffers from the
diff_baton.

So: I think I'll need the file_info struct to be available out of the
diff_baton_t struct as well, so I can use this in suffix scanning
also.

(side-note: I considered first doing suffix scanning, then prefix
scanning, so I could reuse the buffers/pointers from diff_baton all
the time, and still have everything pointing correctly after
eliminating prefix/suffix. But that could give vastly different
results in some cases, for instance when original file is entirely
identical to both the prefix and the suffix of the modified file. So I
decided it's best to stick with first prefix, then suffix).

 Responding to some of the other points you mentioned in a much earlier
 mail:

 3) It's only implemented for 2 files. I'd like to generalize this for
 an array of datasources, so it can also be used for diff3 and diff4.

 4) As a small hack, I had to add a flag datasource_opened to
 token.c#svn_diff__get_tokens, so I could have different behavior for
 regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is
 no longer needed.

 Yes, I'd like to see 3), and so hack 4) will go away.

I'm wondering though how I should represent the datasources to pass
into datasources_open. An array combined with a length parameter?
Something like:

static svn_error_t *
datasources_open(void *baton, apr_off_t *prefix_lines,
 svn_diff_datasource_e[] datasources, int datasources_len)

? And then use for loops everywhere I now do things twice for the two
datasources?

 5) I've struggled with making the code readable/understandable. It's
 likely that there is still a lot of room for improvement. I also
 probably need to document it some more.

 You need to write a full doc string for datasources_open(), at least.
 It needs especially to say how it relates to datasource_open() - why
 should the caller call

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-10-09 Thread Johan Corveleyn

On Sat, Oct 9, 2010 at 5:19 PM, Daniel Shahaf d...@daniel.shahaf.name wrote:
 Johan Corveleyn wrote on Sat, Oct 09, 2010 at 14:21:09 +0200:
 (side-note: I considered first doing suffix scanning, then prefix
 scanning, so I could reuse the buffers/pointers from diff_baton all
 the time, and still have everything pointing correctly after
 eliminating prefix/suffix. But that could give vastly different
 results in some cases, for instance when original file is entirely
 identical to both the prefix and the suffix of the modified file. So I
 decided it's best to stick with first prefix, then suffix).

 What Hyrum said.  How common /is/ this case?  And, anyway, in that case
 both everything was appended and everything was prepended are
 equally legitimate diffs.

Hm, I'm not sure about this one. I just wanted to try the maximum
reasonably possible to keep the results identical to what they were.
Using another buffer for suffix scanning didn't seem that big of a
deal (only slight increase in memory use (2 chunks of 128K in current
implementation)). I made that decision pretty early, before I knew of
the other problem of suffix scanning, and the keep-50-suffix-lines
compromise we decided upon.

There may be more subtle cases than the one I described, I don't know.
OTOH, now that we have the keep-50-suffix-lines, that may help also in
this case. I'll have to think about that. Maybe I can give it a go,
first suffix then prefix, and see if I can find real-life problems ...

-- 
Johan

Re: trunk failing tests on Windows XP (32 bit): prop-tests.py 33, stat-tests.py 5, upgrade-tests.py 11

2010-10-10 Thread Johan Corveleyn

On Sat, Oct 2, 2010 at 11:24 AM, Bert Huijben b...@qqmail.nl wrote:
 -Original Message-
 From: Paul Burba [mailto:ptbu...@gmail.com]
 Sent: vrijdag 1 oktober 2010 15:46
 To: Bert Huijben
 Cc: Johan Corveleyn; Subversion Development
 Subject: Re: trunk failing tests on Windows XP (32 bit): prop-tests.py
 33, stat-tests.py 5, upgrade-tests.py 11

 On Thu, Sep 30, 2010 at 8:10 PM, Bert Huijben b...@qqmail.nl wrote:
  -Original Message-
  From: Johan Corveleyn [mailto:jcor...@gmail.com]
  Sent: vrijdag 1 oktober 2010 1:51
  To: Subversion Development
  Subject: trunk failing tests on Windows XP (32 bit): prop-tests.py
 33,
  stat-tests.py 5, upgrade-tests.py 11

  Hi devs,

  The following tests fail on my machine (Windows XP 32-bit, built
 with
  VCE 2008, ra_local):

  - prop-tests.py 33: test properties of obstructed subdirectories
  svn: Can't open directory

 'C:\research\svn\client_build\svn_trunk\Release\subversion\tests\cmdlin
  e\svn-test-work\working_copies\prop_tests-33\A\C':
  The directory name is invalid.

  - stat-tests.py 5: status on versioned items whose type has changed
  svn: Can't open directory

 'C:\research\svn\client_build\svn_trunk\Release\subversion\tests\cmdlin
  e\svn-test-work\working_copies\stat_tests-5\A':
  The directory name is invalid.

  - upgrade_tests.py 11: missing directories and obstructing files
  svn: Can't open directory

 'C:\research\svn\client_build\svn_trunk\Release\subversion\tests\cmdlin
  e\svn-test-work\working_copies\upgrade_tests-11\A\D':
  The directory name is invalid.

  They all seem to try to open some path as a directory, but it isn't
 a
  directory (but an empty file or something).

  Am I the only one seeing this? Is this known/normal?

  This seems to be related to the APR version you use. (Paul Burba
 reported
  this same problem earlier this week).

 Hi Johan,

 Bert is correct, I saw these failure earlier this week while using APR
 1.3.12.  All the failures occur when a file unexpectedly obstructs a
 directory.  The APR macro APR_STATUS_IS_ENOTDIR in 1.3.x does not
 handle the ERROR_DIRECTORY 267 The directory name is invalid that
 Windows raises in this case.  This was fixed in APR in
 http://svn.apache.org/viewvc?view=revisionrevision=821306 and is in
 APR 1.4.2.

 I think we should just add this error to the SVN__APR_STATUS_IS_ENOTDIR() 
 macro to fix compatibility with older apr versions, but I would like to know 
 why this didn't fail a few weeks ago. (I haven't been very active for the 
 last week, but I haven't seen any relevant commits on the commit list in the 
 last few weeks)

FWIW:
I don't know if anybody else has tried to pinpoint when/why this
started to fail on trunk, but I've taken a look. It seems to have
started already quite some time ago: it started with r991236 (31
August), which was the enabling of single-DB mode.

I didn't have time yet to investigate further with SVN_WC__SINGLE_DB
and SINGLE_DB
defined to see which commit caused it with single-db already enabled
before r991236. If nobody beats me to it, I'll continue the search
when I have some more time next week.

Gut feeling: maybe it has something to do with the actual removal of
directories-scheduled-for-deletion in single-db, as opposed to the
wc-1 behavior?

Cheers,
-- 
Johan

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-10-10 Thread Johan Corveleyn

On Sat, Oct 9, 2010 at 2:21 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad julian.f...@wandisco.com wrote:
 But this makes me think, it looks to me like this whole
 prefix-suffix-skipping functionality would fit better inside the
 lower-level diff algorithm rather than inside the datasources_open
 function.  Do you agree?

 It works as it is, but you were talking about wanting it to obey the
 standard token-parsing rules such as ignore white space, and so on.
 It seems that things like this would be much better one level down.

 Yes, I've been struggling with this. But I can't easily see it fit in
 the lower levels right now. Problem is that everything in those lower
 levels always acts on 1 datasource at a time (getting all the tokens,
 inserting them into the token tree, ... then the same for the next
 datasource). The datasource_open seemed to me to be the easiest place
 to combine datasources to do things for both of them concurrently
 (with least impact on the rest of the code).

 Maybe those lower-level functions could also be made
 multi-datasource, but I have to think a little bit more about that.

I've thought a little bit more about this, going through the code, and
yeah, it might be a good idea to push the prefix/suffix scanning into
the lower-level parts of the diff-algorithm (the token parsing /
building the token tree).

Something like:
- Make token.c#svn_diff__get_tokens take multiple datasources.
- In diff.c, diff3.c and diff4.c replace the multiple calls to this
function to one call passing multiple datasources.
- token.c#svn_diff__get_tokens (with multiple datasources) will take
care of identical prefix and suffix scanning based on tokens (so can
take advantage of the standard token-parsing rules with ignore-*
options, by simply calling svn_diff_fns_t.datasource_get_next_token).

Some things needed to support this:
- svn_diff_fns_t.datasource_get_next_token: calculation of the hash
should be made conditional (I don't want to waste time for the adler32
hash for lines that are not going to be put in the token tree).
- For suffix scanning, I'll need another variation of
datasource_get_next_token to get tokens from the end going backwards
(say datasource_get_previous_token). No hash needed.

Does that make sense? Or am I missing it completely?

It would mean I'd have to throw away the current patch and rewrite it
in the token layer (but I don't really mind, as long as a good
solution takes its place :-)). And your file_info refactoring would
also not be used (although it's certainly a worthwhile refactoring by
itself, making the current code clearer).

Cheers,
-- 
Johan

Re: trunk failing tests on Windows XP (32 bit): prop-tests.py 33, stat-tests.py 5, upgrade-tests.py 11

2010-10-15 Thread Johan Corveleyn

On Sun, Oct 10, 2010 at 3:07 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Sat, Oct 2, 2010 at 11:24 AM, Bert Huijben b...@qqmail.nl wrote:
 -Original Message-
 From: Paul Burba [mailto:ptbu...@gmail.com]
 Sent: vrijdag 1 oktober 2010 15:46
 To: Bert Huijben
 Cc: Johan Corveleyn; Subversion Development
 Subject: Re: trunk failing tests on Windows XP (32 bit): prop-tests.py
 33, stat-tests.py 5, upgrade-tests.py 11

 On Thu, Sep 30, 2010 at 8:10 PM, Bert Huijben b...@qqmail.nl wrote:
  -Original Message-
  From: Johan Corveleyn [mailto:jcor...@gmail.com]
  Sent: vrijdag 1 oktober 2010 1:51
  To: Subversion Development
  Subject: trunk failing tests on Windows XP (32 bit): prop-tests.py
 33,
  stat-tests.py 5, upgrade-tests.py 11

  Hi devs,

  The following tests fail on my machine (Windows XP 32-bit, built
 with
  VCE 2008, ra_local):

  - prop-tests.py 33: test properties of obstructed subdirectories
  svn: Can't open directory

 'C:\research\svn\client_build\svn_trunk\Release\subversion\tests\cmdlin
  e\svn-test-work\working_copies\prop_tests-33\A\C':
  The directory name is invalid.

  - stat-tests.py 5: status on versioned items whose type has changed
  svn: Can't open directory

 'C:\research\svn\client_build\svn_trunk\Release\subversion\tests\cmdlin
  e\svn-test-work\working_copies\stat_tests-5\A':
  The directory name is invalid.

  - upgrade_tests.py 11: missing directories and obstructing files
  svn: Can't open directory

 'C:\research\svn\client_build\svn_trunk\Release\subversion\tests\cmdlin
  e\svn-test-work\working_copies\upgrade_tests-11\A\D':
  The directory name is invalid.

  They all seem to try to open some path as a directory, but it isn't
 a
  directory (but an empty file or something).

  Am I the only one seeing this? Is this known/normal?

  This seems to be related to the APR version you use. (Paul Burba
 reported
  this same problem earlier this week).

 Hi Johan,

 Bert is correct, I saw these failure earlier this week while using APR
 1.3.12.  All the failures occur when a file unexpectedly obstructs a
 directory.  The APR macro APR_STATUS_IS_ENOTDIR in 1.3.x does not
 handle the ERROR_DIRECTORY 267 The directory name is invalid that
 Windows raises in this case.  This was fixed in APR in
 http://svn.apache.org/viewvc?view=revisionrevision=821306 and is in
 APR 1.4.2.

 I think we should just add this error to the SVN__APR_STATUS_IS_ENOTDIR() 
 macro to fix compatibility with older apr versions, but I would like to know 
 why this didn't fail a few weeks ago. (I haven't been very active for the 
 last week, but I haven't seen any relevant commits on the commit list in the 
 last few weeks)

 FWIW:
 I don't know if anybody else has tried to pinpoint when/why this
 started to fail on trunk, but I've taken a look. It seems to have
 started already quite some time ago: it started with r991236 (31
 August), which was the enabling of single-DB mode.

 I didn't have time yet to investigate further with SVN_WC__SINGLE_DB
 and SINGLE_DB
 defined to see which commit caused it with single-db already enabled
 before r991236. If nobody beats me to it, I'll continue the search
 when I have some more time next week.

 Gut feeling: maybe it has something to do with the actual removal of
 directories-scheduled-for-deletion in single-db, as opposed to the
 wc-1 behavior?

I saw that the problem was fixed in r102170 by Bert, by changing the
SVN__APR_STATUS_IS_ENOTDIR() macro. Thanks, the tests all pass now,
with APR 1.3.8 on my WinXP.

Now, some root-cause analysis :-), in case this is still useful...

Going further back than r991236, but with SVN_WC__SINGLE_DB and
SINGLE_DB explicitly defined, I found the following:

1) prop_tests.py 33
- First fails in r980406 with (Can't open directory ... : The
directory name is invalid.)
- Before r980406, it fails with a more normal failure: EXCEPTION:
SVNTreeUnequal.

2) stat_tests.py 5
- First fails in r990818 with (Can't open directory ... : The
directory name is invalid.)
- Before r990818, the test succeeded.

3) upgrade_tests.py 11
- First fails in r991160 with (Can't open directory ... : The
directory name is invalid.)
- Before r991160, it fails with an assertion: (svn: In file
'..\..\..\subversion\libsvn_wc\wc_db.c'
line 6558: assertion failed ((pdh)-wcroot != NULL 
(pdh)-wcroot-format == SVN_WC__VERSION)).

Now it's up to someone else to take a closer look at those revisions,
and try to figure out why they introduced the directory name is
invalid failures. That's a bit out of my league :-).

Cheers,
-- 
Johan

Re: trunk failing tests on Windows XP (32 bit): prop-tests.py 33, stat-tests.py 5, upgrade-tests.py 11

2010-10-15 Thread Johan Corveleyn

On Fri, Oct 15, 2010 at 12:20 PM, Johan Corveleyn jcor...@gmail.com wrote:
 I saw that the problem was fixed in r102170 by Bert

s/r102170/r1021760/

I really should install that Undo send google labs extension :)

-- 
Johan

Re: svn commit: r1022931 - in /subversion/trunk/subversion/libsvn_wc: status.c wc-queries.sql wc_db.c wc_db.h

2010-10-15 Thread Johan Corveleyn

On Fri, Oct 15, 2010 at 4:19 PM,  phi...@apache.org wrote:
 Author: philip
 Date: Fri Oct 15 14:19:36 2010
 New Revision: 1022931

 URL: http://svn.apache.org/viewvc?rev=1022931view=rev
 Log:
 Implement status using per-dir queries.  On my machine (Linux, local
 disk) this improves the speed of status on a Subversion trunk working
 copy by a factor of 3, with a hot-cache, and about 30% with a cold cache.
 1.7 is still slower than 1.6 for the hot-cache (but it's less than 100%),
 but for cold-cache 1.7 is now faster than 1.6.

Now, that's more like it!

On my machine (WinXP, 32 bit, local disk 5400 rpm), with a medium size
working copy (944 dirs, 10286 files; same as I used in previous tests
([1])):

- cold cache: 1.7 is almost 50% faster than 1.6
1.7: 22s
1.6: 42s

- hot cache: 1.7 is just about on par with 1.6 (only 20% slower)
1.7: 0.86s
1.6: 0.72s

Great work, Philip (and all the others who brought us up to this point
of course)!

Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2010-09/0067.shtml

Re: massive memory leak

2010-10-19 Thread Johan Corveleyn

On Tue, Oct 19, 2010 at 8:14 PM, Stefan Küng tortoise...@gmail.com wrote:
 On 19.10.2010 20:13, Lieven Govaerts wrote:

 This has been fixed in serf trunk r1408 for a while, but hasn't shown
 up in a serf patch release yet.

 Sorry, I should have checked the serf commits first.

 Thanks for the update on this.

Also see:
http://subversion.tigris.org/issues/show_bug.cgi?id=3684 - ra_serf has
a memory usage problem

Where Paul Burba describes exactly this problem (memory leak with serf
0.7.0 over SSL, no leak with http), and confirms that s...@1408 fixes
it.

Cheers,
-- 
Johan

diff4: is it actually used?

2010-10-19 Thread Johan Corveleyn

Hi devs,

In the context of the diff optimization patch I'm working on ([1]),
I'm wondering if diff4 is actually used in svn. If I look for usages
of subversion/libsvn_diff/diff4.c#svn_diff_diff4, I only come up with
tools/diff/diff4.c#main.

So: this code isn't used in the svn core itself? What's the use of the
diff4 executable in tools/diff? Is it for experimentation with that
implementation, but it never really got finalized to be taken up in
svn proper? Or am I overlooking something?

Also, tools/diff/diff4.c mentions the following description above the
copyright notice: diff4-test.c -- test driver for 3-way text merges.
Wrong filename. And test driver, what does that mean?


Background: I'm working on extending my patch to work for all diff's
in svn: normal diff, diff3 and diff4 (to avoid having to special case
things only for regular diff, and the patch is just as useful for
diffX as for regular diff). I don't want to break anything, so I was
looking around for ways how to exercise that code.

Should I take diff4 into account, and if so, how can I test it?

Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2010-10/0167.shtml

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-10-19 Thread Johan Corveleyn

On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad julian.f...@wandisco.com wrote:
 On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote:
 On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad julian.f...@wandisco.com 
 wrote:
  On Sat, 2010-10-09, Johan Corveleyn wrote:
  On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad julian.f...@wandisco.com 
  wrote:
   So I wrote a patch - attached - that refactors this into an array of 4
   sub-structures, and simplifies all the code that uses them.
  [...]
  Yes, great idea! That would indeed vastly simplify a lot of the code.
  So please go ahead and commit the refactoring.
 
  OK, committed in r1021282.

 Thanks, looks much more manageable now.

 I'd like to see a simplified version of your last patch, taking
 advantage of that, before you go exploring other options.

Ok, here's a new version of the patch, taking advantage of your
file_info refactoring. This vastly simplifies the code, so that it
might actually be understandable now :-).

Other things I've done in this version:
1) Generalized everything to handle an array of datasources/files,
instead of just two. This makes it slightly more complex here and
there (using for loops everywhere), but I think it's ok, and it's also
more consistent/generic. If anyone has better ideas to do those for
loops, suggestions welcome.

This makes the algorithm usable by diff3 as well (and diff4 if needed
(?)). I have not yet enabled it for diff3, because I haven't yet
understood how it handles the generation of its diff output (needs to
take into account the prefix_lines. I tried some quick hacks, but lots
of tests were failing, so I'll have to look more into it - that's for
a follow up patch). When I can enable it for diff3 (and diff4), I can
remove datasource_open (with one datasource).

2) Removed get_prefix_lines from svn_diff_fns_t (and its
implementations in diff_file.c and diff_memory.c). Instead I pass
prefix_lines directly to token.c#svn_diff__get_tokens.

3) If prefix scanning ended in the last chunk, the suffix scanning now
reuses that buffer which already contains the last chunk. As a special
case, this also avoids reading the file twice if it's smaller than 128
Kb.

4) Added doc strings everywhere. Feel free to edit those, I'm new at
documenting things in svn.


Still TODO:
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- See if implementing the critical parts of increment_pointers and
decrement_pointers in a macro improves performance.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options. For
this (and for other reasons), I'd still like to investigate pushing
this optimization into the token parsing/handling layer, to extract
entire tokens etc., even if this means the current patch has to be
thrown away. I'll take this up in a separate thread.

Log message:
[[[
Make svn_diff skip identical prefix and suffix to make diff and blame faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function type datasources_open to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (svn_diff__mem_vtable): Added new function datasources_open.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added member prefix_lines, and inside the
   struct file_info the members suffix_start_chunk and suffix_offset_in_chunk.
  (increment_pointers, decrement_pointers): New functions.
  (is_one_at_bof, is_one_at_eof): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open multiple datasources and find
   their identical prefix and suffix, so these can be excluded from the rest
   of the diff algorithm, as a performance optimization. From the identical
   suffix, 50 lines are kept to help the diff algorithm find the nicest
   possible diff representation in case of ambiguity.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new function datasources_open.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument datasource_opened, to indicate that
   the datasource was already opened, and argument prefix_lines, the number
   of identical prefix lines.and use
   this as the starting offset for the token we're getting.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added arguments datasource_opened and
   prefix_lines. Only open the datasource if datasource_opened is FALSE.
   Set the starting offset of the position list to the number of prefix_lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument prefix_lines. Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of common diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open to open original and
   modified at once and find

Diff optimization: implement prefix/suffix-skipping in token-handling code (was: Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster)

2010-10-19 Thread Johan Corveleyn

On Tue, Oct 12, 2010 at 12:35 PM, Julian Foad julian.f...@wandisco.com wrote:
 On Sun, 2010-10-10 at 23:43 +0200, Johan Corveleyn wrote:
 On Sat, Oct 9, 2010 at 2:21 PM, Johan Corveleyn jcor...@gmail.com wrote:
  On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad julian.f...@wandisco.com 
  wrote:
  But this makes me think, it looks to me like this whole
  prefix-suffix-skipping functionality would fit better inside the
  lower-level diff algorithm rather than inside the datasources_open
  function.  Do you agree?
 
  It works as it is, but you were talking about wanting it to obey the
  standard token-parsing rules such as ignore white space, and so on.
  It seems that things like this would be much better one level down.
 
  Yes, I've been struggling with this. But I can't easily see it fit in
  the lower levels right now. Problem is that everything in those lower
  levels always acts on 1 datasource at a time (getting all the tokens,
  inserting them into the token tree, ... then the same for the next
  datasource). The datasource_open seemed to me to be the easiest place
  to combine datasources to do things for both of them concurrently
  (with least impact on the rest of the code).
 
  Maybe those lower-level functions could also be made
  multi-datasource, but I have to think a little bit more about that.

 I've thought a little bit more about this, going through the code, and
 yeah, it might be a good idea to push the prefix/suffix scanning into
 the lower-level parts of the diff-algorithm (the token parsing /
 building the token tree).

 Something like:
 - Make token.c#svn_diff__get_tokens take multiple datasources.
 - In diff.c, diff3.c and diff4.c replace the multiple calls to this
 function to one call passing multiple datasources.
 - token.c#svn_diff__get_tokens (with multiple datasources) will take
 care of identical prefix and suffix scanning based on tokens (so can
 take advantage of the standard token-parsing rules with ignore-*
 options, by simply calling svn_diff_fns_t.datasource_get_next_token).

 One of the improvements we're looking for is to make use of the already
 existing diff options - ignore-white-space etc.

 We can get that improvement with a much smaller change: simply by
 calling the 'get_next_token' routine, or rather a part of it, from
 within your current 'find_identical_prefix' function, without touching
 any of the generic diffN.c/token.c code and APIs.

 Some things needed to support this:
 - svn_diff_fns_t.datasource_get_next_token: calculation of the hash
 should be made conditional (I don't want to waste time for the adler32
 hash for lines that are not going to be put in the token tree).

 Yes.  If you take this smaller change approach, then the way to do
 this would be to factor out the non-adler part of
 'datasource_get_next_token' into a separate private function, and call
 that.

 - For suffix scanning, I'll need another variation of
 datasource_get_next_token to get tokens from the end going backwards
 (say datasource_get_previous_token). No hash needed.

 Yes.

 Does that make sense? Or am I missing it completely?

 I can't really comment on the question of moving this right down into
 the diffN.c/token.c code, as I don't have time to study that right now.
 The possible benefits I can see are:

 * Sharing this feature across all kinds of diff implementations
 (currently: file and in-memory-string).  Good from a practical POV if
 and only if really long strings are being compared in memory.  Good from
 a code-structural POV.

Yes, that seems like a nice improvement, code-structurally. I don't
know if it will have noticeable performance impact. If I understand
the code correctly, the in-memory-diff code is used for diffing
properties. Some properties can be quite large (e.g. svn:mergeinfo),
but I don't know if they are large enough to have an impact (unless
some high-level operations perform a million property diffs or
something).

 * I'm curious about whether it would be possible to integrate prefix
 skipping into the main algorithm in such a way that the algorithm would
 see the skipped prefix as a possible match, just like any other chunk
 (including its adler32), but in a much quicker way than the algorithm
 currently finds such a prefix.  If so, it would be able to handle better
 the cases where one file has added a prefix that is a duplicate of
 existing text.  Same for the suffix.

Maybe that's possible, but I don't think it will help much. For one
thing, it introduces some overhead (adler32 calculation of the
prefix). And it will only help if the added piece of text is *exactly*
the same as the prefix (every line of it). If the added piece of text
misses the last line of the prefix, it will not match (different
adler32 hash). If you need to be able to match at different spots
inside the prefix, you'll have to hash (and compare) every line
separately, which voids the benefit of this optimization.

However, I've had another idea for optimization, which I

Issue triage: issues 3429 and 3474

2010-11-08 Thread Johan Corveleyn

Hi,

I recently verified that the following two issues are fixed on trunk:

- #3429: svn mv A B; svn mv B A generates replace without history

- #3474: making a new subdir, moving files into it and then renaming
the subdir, breaks history of the moved files

Should I mark these as fixed in the issue tracker, or will someone
else do that, after double-checking my findings?

Or, maybe the best approach: I could add a regression test for these
issues, so we can all be sure that they are fixed (and remain fixed),
after which they can be marked as fixed.

I have not written a test for Subversion before, but I guess I should
be able to manage after reading the relevant section from the
community guide, and the subversion/tests/README and
subversion/tests/cmdline/README it references. Any other pointers or
tips are welcome of course :-).

Cheers,
-- 
Johan

Re: Interrupting an update after change of externals causes corrupt working copy

2010-11-15 Thread Johan Corveleyn

Almost forgot about this. It's now filed in the issue tracker:
http://subversion.tigris.org/issues/show_bug.cgi?id=3751

Cheers,
Johan

On Thu, Sep 30, 2010 at 10:03 PM, Johan Corveleyn jcor...@gmail.com wrote:
 Hi devs,

 As per Daniel Shahaf's suggestion, I'm continuing this discussion on
 the dev list. See the previous mails in this thread on the users list
 for some context (or
 http://svn.haxx.se/users/archive-2010-09/0406.shtml). I'll summarize
 below.

 This issue reproduces with 1.6.12 as well as with trunk. It would be
 interesting to know the WC-NG folks take on this.

 Note: the subject line may be slightly exaggerated, as there seems to
 be a good workaround to repair things (at least in case of
 intra-repository externals).

 Summary
 ---

 Steps to reproduce:
 1) Change a directory external definition.

 2) svn update

 3) Interrupt (Ctrl-C) while it is processing the switch of the
 external. I did this after some files of the external were already
 Updated.

 Result:
 svn status now shows a lot of files and directories as S (switched),
 and some directories missing. Running svn info on the switched
 files/dirs shows that they still point to the old external location.

 Running svn update again fixes the missing dirs, but not the
 switched nodes. svn revert does nothing (which is logical: there are
 no local changes). svn cleanup does nothing.

 Workaround:
 Running svn switch $new_url $external_dir brings everything back to normal.


 So:
 - Wouldn't it be better if svn update could repair this (as in
 resuming the interrupted update from before)?
 - Should I file an issue for this?
 - I'm not sure how all this plays out with externals from remote
 repositories (haven't tested this). With or without a change of the
 remote location. Would an svn switch --relocate fix things in the
 same way then?

 Note: this can be easily reproduced with any large (public) repository
 (I think). Just define some externals in working copy only (no need to
 commit), update, change external definition (no need to commit),
 update, interrupt.

 Cheers,
 --
 Johan

Re: Issue triage: issues 3429 and 3474

2010-11-16 Thread Johan Corveleyn

On Mon, Nov 8, 2010 at 1:02 PM, Philip Martin
philip.mar...@wandisco.com wrote:
 Johan Corveleyn jcor...@gmail.com writes:

 Or, maybe the best approach: I could add a regression test for these
 issues, so we can all be sure that they are fixed (and remain fixed),
 after which they can be marked as fixed.

 Yes, please.  Are there any existing XFAIL tests that apply?  They
 sometimes don't XPASS automatically when the bug is fixed because the
 test expectation is wrong.

Doh, it seems that issue #3474 already has a test, which PASSes (added
by Bert, which he mentioned in a comment in the issue):

PASS:  copy_tests.py 81: copy of new dir with copied file keeps history

This is exactly what issue #3474 is about. Bert added the test as
XFAIL in r938071, and it was marked PASS by you, Philip, in r955334.

So, I guess this wraps up that issue: can someone mark it as resolved?

There was a slight confusion when I read the test code, because a
comment still talks about the tests as Currently this fails because
 The following patch removes that obsolete comment:

[[[
Index: subversion/tests/cmdline/copy_tests.py
===
--- subversion/tests/cmdline/copy_tests.py  (revision 1035851)
+++ subversion/tests/cmdline/copy_tests.py  (working copy)
@@ -4358,8 +4358,6 @@ def copy_added_dir_with_copy(sbox):
   'NewDir2/mu': Item(status='A ', copied='+', wc_rev='-'),
 })

-  # Currently this fails because NewDir2/mu loses its history in the copy
-  # from NewDir to NewDir2
   svntest.actions.run_and_verify_status(wc_dir, expected_status)



]]]


For issue #3429, I'll try to write a regression test myself, and post
a patch for it in a new thread.

Cheers,
-- 
Johan

[PATCH] copy_tests.py - expand move_file_back_and_forth, to verify issue #3429

2010-11-16 Thread Johan Corveleyn

The attached patch expands the test move_file_back_and_forth
(copy_tests.py 45) as a regression test for issue#3429.

[[[
Expand move_file_back_and_forth test to verify issue #3429
(svn mv A B; svn mv B A generates replace without history).

* subversion/tests/cmdline/copy_tests.py
  (move_file_back_and_forth): Check expected status before commit.
]]]

Cheers,
-- 
Johan
Index: subversion/tests/cmdline/copy_tests.py
===
--- subversion/tests/cmdline/copy_tests.py  (revision 1035851)
+++ subversion/tests/cmdline/copy_tests.py  (working copy)
@@ -2428,17 +2428,27 @@ def move_file_back_and_forth(sbox):
   svntest.actions.run_and_verify_svn(None, None, [], 'mv',
  rho_move_path, rho_path)
 
+  # Create expected status tree before commit
+  expected_status = svntest.actions.get_virginal_state(wc_dir, 1)
+  expected_status.add({
+'A/D/G/rho' : Item(status='R ', copied='+', wc_rev='-'),
+})
+
+  # Test status before commit
+  svntest.actions.run_and_verify_status(wc_dir, expected_status)
+
   # Created expected output tree for 'svn ci':
   expected_output = svntest.wc.State(wc_dir, {
 'A/D/G/rho' : Item(verb='Replacing'),
 })
 
-  # Create expected status tree
+  # Create expected status tree after commit
   expected_status = svntest.actions.get_virginal_state(wc_dir, 1)
   expected_status.add({
 'A/D/G/rho' : Item(status='  ', wc_rev=2),
 })
 
+  # Test commit output and status
   svntest.actions.run_and_verify_commit(wc_dir,
 expected_output,
 expected_status,

Re: [PATCH] copy_tests.py - expand move_file_back_and_forth, to verify issue #3429

2010-11-16 Thread Johan Corveleyn

On Wed, Nov 17, 2010 at 2:14 AM, Johan Corveleyn jcor...@gmail.com wrote:
 The attached patch expands the test move_file_back_and_forth
 (copy_tests.py 45) as a regression test for issue#3429.

 [[[
 Expand move_file_back_and_forth test to verify issue #3429
 (svn mv A B; svn mv B A generates replace without history).

 * subversion/tests/cmdline/copy_tests.py
  (move_file_back_and_forth): Check expected status before commit.
 ]]]

I just realized that I can add a similar check to
move_dir_back_and_forth. Here is a second patch that expands both
tests.

(note: can the if svntest.main.wc_is_singledb(wc_dir) be dropped
from move_dir_back_and_forth?)

Log message:
[[[
Expand move_file_back_and_forth and move_dir_back_and_forth tests to verify
issue #3429 (svn mv A B; svn mv B A generates replace without history).

* subversion/tests/cmdline/copy_tests.py
  (move_file_back_and_forth): Check expected status before commit.
  (move_dir_back_and_forth): Check expected status after executing the moves.
]]]

Cheers,
-- 
Johan
Index: subversion/tests/cmdline/copy_tests.py
===
--- subversion/tests/cmdline/copy_tests.py  (revision 1035851)
+++ subversion/tests/cmdline/copy_tests.py  (working copy)
@@ -2428,17 +2428,27 @@ def move_file_back_and_forth(sbox):
   svntest.actions.run_and_verify_svn(None, None, [], 'mv',
  rho_move_path, rho_path)
 
+  # Create expected status tree before commit
+  expected_status = svntest.actions.get_virginal_state(wc_dir, 1)
+  expected_status.add({
+'A/D/G/rho' : Item(status='R ', copied='+', wc_rev='-'),
+})
+
+  # Test status before commit
+  svntest.actions.run_and_verify_status(wc_dir, expected_status)
+
   # Created expected output tree for 'svn ci':
   expected_output = svntest.wc.State(wc_dir, {
 'A/D/G/rho' : Item(verb='Replacing'),
 })
 
-  # Create expected status tree
+  # Create expected status tree after commit
   expected_status = svntest.actions.get_virginal_state(wc_dir, 1)
   expected_status.add({
 'A/D/G/rho' : Item(status='  ', wc_rev=2),
 })
 
+  # Test commit output and status
   svntest.actions.run_and_verify_commit(wc_dir,
 expected_output,
 expected_status,
@@ -2474,6 +2484,23 @@ def move_dir_back_and_forth(sbox):
   svntest.actions.run_and_verify_svn(None, None, expected_err,
  'mv', D_move_path, D_path)
 
+  if svntest.main.wc_is_singledb(wc_dir):
+# Verify if status indicates a replace with history
+expected_status = svntest.actions.get_virginal_state(wc_dir, 1)
+expected_status.add({
+  'A/D'   : Item(status='R ', copied='+', wc_rev='-'),
+  'A/D/G' : Item(status='  ', copied='+', wc_rev='-'),
+  'A/D/G/pi'  : Item(status='  ', copied='+', wc_rev='-'),
+  'A/D/G/rho' : Item(status='  ', copied='+', wc_rev='-'),
+  'A/D/G/tau' : Item(status='  ', copied='+', wc_rev='-'),
+  'A/D/gamma' : Item(status='  ', copied='+', wc_rev='-'),
+  'A/D/H' : Item(status='  ', copied='+', wc_rev='-'),
+  'A/D/H/chi' : Item(status='  ', copied='+', wc_rev='-'),
+  'A/D/H/omega'   : Item(status='  ', copied='+', wc_rev='-'),
+  'A/D/H/psi' : Item(status='  ', copied='+', wc_rev='-'),
+  })
+svntest.actions.run_and_verify_status(wc_dir, expected_status)
+
 def copy_move_added_paths(sbox):
   copy and move added paths without commits

Re: Issue triage: issues 3429 and 3474

2010-11-17 Thread Johan Corveleyn

On Wed, Nov 17, 2010 at 7:14 PM, Daniel Shahaf d...@daniel.shahaf.name wrote:
 Johan Corveleyn wrote on Wed, Nov 17, 2010 at 01:25:24 +0100:
 On Mon, Nov 8, 2010 at 1:02 PM, Philip Martin
 philip.mar...@wandisco.com wrote:
  Johan Corveleyn jcor...@gmail.com writes:
 
  Or, maybe the best approach: I could add a regression test for these
  issues, so we can all be sure that they are fixed (and remain fixed),
  after which they can be marked as fixed.
 
  Yes, please.  Are there any existing XFAIL tests that apply?  They
  sometimes don't XPASS automatically when the bug is fixed because the
  test expectation is wrong.

 Doh, it seems that issue #3474 already has a test, which PASSes (added
 by Bert, which he mentioned in a comment in the issue):

 PASS:  copy_tests.py 81: copy of new dir with copied file keeps history

 This is exactly what issue #3474 is about. Bert added the test as
 XFAIL in r938071, and it was marked PASS by you, Philip, in r955334.

 So, I guess this wraps up that issue: can someone mark it as resolved?


 Could you mark it as resolved?  Or don't you have the necessary
 permissions on tigris?

Ok, done (didn't know I could do that myself :-)).

 There was a slight confusion when I read the test code, because a
 comment still talks about the tests as Currently this fails because
  The following patch removes that obsolete comment:

 [[[
 Index: subversion/tests/cmdline/copy_tests.py
 ===
 --- subversion/tests/cmdline/copy_tests.py    (revision 1035851)
 +++ subversion/tests/cmdline/copy_tests.py    (working copy)
 @@ -4358,8 +4358,6 @@ def copy_added_dir_with_copy(sbox):
        'NewDir2/mu'        : Item(status='A ', copied='+', wc_rev='-'),
      })

 -  # Currently this fails because NewDir2/mu loses its history in the copy
 -  # from NewDir to NewDir2
    svntest.actions.run_and_verify_status(wc_dir, expected_status)



 +1, but could you add a link to the issue (in a comment) while you're
 there?

Committed in r1036306.

Cheers,
-- 
Johan

Diff optimizations progress (or not)

2010-11-20 Thread Johan Corveleyn

[ adding dev@ to discussion between danielsh and me about
(non-)progress of the diff-optimizations work. ]

On Wed, Nov 17, 2010 at 3:21 AM, Daniel Shahaf d...@daniel.shahaf.name wrote:
 [ if you want to, you can add dev@ to the CC on replies. ]

 Johan Corveleyn wrote on Wed, Nov 17, 2010 at 00:57:53 +0100:
 I have made some progress locally, but not enough to commit it yet:
 - The implementation I've sent in my latest patch-mail to the list
 works well, but needs some final wrap up (to make it work with diff3
 and diff4). That is: if we decide this implementation is the way to
 go. (I also made a slight variation by implementing the critical
 functions (incrementing pointers) with a macro, which made it another
 10-15% faster).


 As I said on IRC: don't wait for everything to be gift-wrapped before
 you commit.  That is not our expectation from you, and it isn't a good
 approach to collaborative development of moderately large changes.

 If you have an implementation that works for diff2 and doesn't break
 diff3+diff4, commit it, and extend it from 2-sources to N-sources in
 subsequent commits.  If you have something that works and you think can
 be made faster, why not commit it now and then make another commit that
 implements the optimization?

Actually, this approach (the one that eliminates identical
prefix/suffix byte-per-byte, when opening the datasources) was already
almost finished with the latest patch I sent to the list. It was
already extended to handle N datasources. The only thing left to do
was to make diff3+diff4 handle the results in a meaningful way (I
punted on this for now, because I don't understand those algorithms
very well).

But ultimately, I was really doubtful about the low-level approach
that I took, and would really prefer the token-approach if that would
work out equally well (I think it will be better, more readable,
faster, more flexible, ...).

 - However, as I already hinted in that mail thread: I'm not satisfied
 with that implementation, and I'm attempting another way in the
 token-parsing layer (working with entire tokens/lines, instead of
 byte-per-byte). It's this new way that I'd like to get working, and
 commit to the branch.


 Committing an approach to a branch doesn't commit you to that approach.

 If you have working code (as in: doesn't break anything), which is
 a step on the way to implementing what --- at that point in time ---
 you'd like the branch to implement, and the code is commit-worthy (i.e.,
 reviewable, documented, and consists of a single logical change) ---
 then what reason do you have not to commit it?

 If you think the code could be improved later --- or even that a totally
 different approach would be better --- then do investigate those.  But
 --- in order to keep others in the loop (which calls for *small* --- and
 thus frequent --- updates), or to checkpoint your progress, or just
 because the investigation is going to take some time --- you'll first
 commit the piece of code that *is* working and *is* an improvement, and
 later see how it can be improved further, than not commit it at all and
 later bomb dev@ with a patch plus a too-big-to-be-reviewable Here are
 N other approaches that I ruled out.

 In short, unlike on trunk, on aptly-named experimental branches (from
 to experiment) you don't have to be certain a patch is 1.8.0-worthy in
 order to commit it.

 Iterating in public on dev@ enables the whole process (including the
 decision to scrub, if necessary) to benefit from everyone's inputs, and
 to get those inputs as early as possible.  It allows people to chime in
 if they see you take an approach they know won't help, and it allows you
 to turn to the list for advice then you get stuck.


 [1] You may create as many branches as you need.  (You don't need to
 ask permission.)  If organizing your code on two branches makes your
 life easier, go for it!

So I have the byte-approach nearly finalized, but I don't like it as
much as the token-approach, which still needs a lot of work :-(. And
they are mutually exclusive (doesn't make sense to implement both), so
I wouldn't like to commit both approaches to the same branch (I would
have to undo the first approach before committing the second one). So
maybe the way to go is to have two branches for the two
implementations ...

I could then commit the byte-approach (which is nearly finished) in
small steps (making it easier to review) in one branch, and the new
token-approach to the other (that is, as soon as I have something
commitable, see below). We can then discuss which of the two
approaches would be preferred...

 - With the token-way, I almost have the prefix part working (ran
 into some problem when rolling back the token at which the first
 difference is encountered (making svn crash); I know how to solve it,
 but I just need to do it).


 You could commit the token-way without the prefix part --- with just
 suffix-scan enabled --- and implement prefix scanning

Re: AW: How to find out the rev number where a file was deleted?

2010-11-29 Thread Johan Corveleyn

[ moving to dev@ ]

Following up on a discussion on the users list about the lack of a way
to easily find the rev number in which a file was deleted...

Already referred to issue #3627 (FS API support for oldest-to-youngest
history traversal) and FS-NG, as mentioned on the roadmap. But the
discussion continued about why this is so hard right now, and if there
are alternative approaches. See below...

On Mon, Nov 29, 2010 at 3:51 AM, Daniel Shahaf d...@daniel.shahaf.name wrote:
 Johan Corveleyn wrote on Sun, Nov 28, 2010 at 21:20:28 +0100:
 On Sun, Nov 28, 2010 at 6:35 PM, Daniel Shahaf d...@daniel.shahaf.name 
 wrote:
  Stefan Sperling wrote on Sun, Nov 28, 2010 at 16:48:30 +0100:
  The real problem is that we want to be able to answer these questions
  very fast, and some design aspects work against this. For instance,
  FSFS by design does not allow modifying old revisions. So where do
  we store the copy-to information for a given p...@n?
 
  copy-to information is immutable (never changes once created), so we
  could add another hierarchy (parallel to revs/ and revprops/) in which
  to store that information.  Any 'cp f...@n bar' operation would need to
  create/append a file in that hierarchy.
 
  Open question: how to organize $new_hierarchy/16/16384/** to make it
  efficiently appendable and queryable (and for what queries? Iterate
  all copied-to places is one).
 
  Makes sense?

 I'm not sure. But there is another alternative: while we wait for
 FS-NG (or another solution like you propose), one could implement the
 slow algorithm within the current design.

 Are you advocating to implement it in the core (as an svn_fs_* API) or
 as a third-party script?  The latter is certainly fine, but regarding
 the former I don't see the point of adding an API that cannot be
 implemented efficiently at this time.

Why not in the core? We can't do this quickly, so we don't do it is
not a very strong argument against having this very useful
functionality IMHO.

Having it in the core is vastly more useful for people like me (and my
colleagues): works on Windows, regardless of whether or not one has
perl/python installed, no need to distribute an additional script,
guaranteed to be available everywhere an svn client is installed, ...

It's actually quite similar to the way blame is implemented
currently: we don't really have the design (line-based information) to
do this quickly, but we calculate it from the other information that
we have available (in a way that could also be done by a script on the
client: diffing every interesting revision against the next,
remembering the lines that were added/removed in every step). Can you
imagine not having blame in svn core just because we can't do it
quickly? Ok, blame may be a more important use case than finding the
rev number where a file was deleted, but still ...

So I still think it's definitely worth it to have this in the core and
offer an API, and implement it slowly now because that's the only way
we can do it (besides, I don't think it will be *that* slow). And
optimize it later when we have FS-NG, or another way to retrieve
this info quickly...

However, having said all that doesn't change the fact that someone
still needs to implement it, and I must admit I don't have the cycles
for that currently :-(.

Cheers,
Johan

 Just automating what a
 user (or script) currently does when looking for this information,
 i.e. a binary search.

 Of course it would be slow, but it would certainly already provide
 value. At the very least, it saves users a lot of time searching FAQ's
 and list archives, wondering why this doesn't work, understanding the
 design limitations, and then finally implementing their own script or
 doing a one-time manual search.

 Then, when FS-NG arrives, or someone comes up with a way to index this
 information, it can be optimized.

 I don't know if there would be fundamental problems with that, apart
 from the fact that someone still needs to implement it of course ...

 Cheers,
 --
 Johan

diff-optimizations-tokens branch: I think I'm going to abandon it

2010-11-30 Thread Johan Corveleyn

Hi devs,

As mentioned in [1], I've created two branches to try out two
different approaches for the diff optimizations of prefix/suffix
scanning.

The first one, diff-optimizations-bytes, has a working implementation
of the optimization. It still has some open todo items, but it
basically works.

The second one, diff-optimizations-tokens, takes a more high-level
approach by working in the token handling layer. It takes the
extracted lines as a whole, and compares them, to scan for identical
prefix and suffix. I preferred this new approach, because it seemed
more elegant (and works both for diff_file and diff_memory (property
diffs)). However, although the token-based prefix scanning works
adequately, I'm now stuck with the suffix scanning.

I am now considering to abandon the tokens-approach, for the following reasons:

1) There is still a lot of work. Scanning for identical suffix is
quite difficult, because we now have to extract tokens (lines) in
reverse. I've put in place a stub for that function
(datasource_get_previous_token), but that still needs to be
implemented. And that's the hardest part, IMHO.

Not only that, but I just realized that I'll also have to implement a
reverse variant of util.c#svn_diff__normalize_buffer (which contains
the encouraging comment It only took me forever to get this routine
right,... (added by ehu in r866123)), and maybe also of token_compare
(not sure).

2) I'm beginning to see that token-based suffix scanning will not be
as fast as byte-based suffix scanning. Simply because, in the case of
byte-based suffix scanning, we can completely ignore line structure.
We never have to compare characters with \n or \r, we just keep
reading bytes and comparing them. So there is an extra overhead for
token-based suffix scanning.


So, unless someone can convince me otherwise, I'm probably going to
stop with the token approach. Because of 2), I don't think it's worth
it to spend the effort needed for 1), especially because the
byte-based approach already works.

Any thoughts?

Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2010-11/0416.shtml

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

2010-12-01 Thread Johan Corveleyn

On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf d...@daniel.shahaf.name wrote:
 Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
 I am now considering to abandon the tokens-approach, for the following 
 reasons:
 ...
 So, unless someone can convince me otherwise, I'm probably going to
 stop with the token approach. Because of 2), I don't think it's worth
 it to spend the effort needed for 1), especially because the
 byte-based approach already works.


 In other words, you're saying that the token-based approach: (b) won't be
 as fast as the bytes-based approach can be, and (a) requires much effort
 to be spent on implementing the reverse reading of tokens?  (i.e.,
 a backwards gets())

Yes.

The reverse reading is quite hard (in the diff_file.c case) because of
the chunked reading of files. A line may be split by a chunk
boundary (it may even be split in the middle of an eol sequence
(between \r and \n), and it all still needs to be
canonicalized/normalized correctly (that's why we'll also need a
reverse normalization function). The current forward get_next_token
does this very well, and the reverse should be analogous, but I just
find it hard to reason about, and to get it implemented correctly. It
will take me a lot of time, and with a high probability of introducing
subtle bugs for special edge cases.

 Any thoughts?


 -tokens/BRANCH-README mentions one of the advantages of the tokens
 approach being that the comparison is done only after whitespace and
 newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in
 effect).  IIRC, the -bytes approach doesn't currently take advantage of
 these -x flags?

 What is the practical effect of the fact the -bytes approach doesn't
 take advantage of these flags?  If a file (with a moderately long
 history) has had all its newlines changed in rN, then I assume that your
 -bytes optimizations will speed up all the diffs that 'blame' performs
 on that file, except for the single diff between rN and its predecessor?

Yes, I thought that would be an important advantage of the tokens
approach, but as you suggest: the practical effect will most likely be
limited. Indeed, only this single diff between rN and its predecessor
(which may be 1 in 1000 diffs) will not benifit from
prefix/suffix-optimization. Even if the rate of such changes is like
1/100th (let's say you change leading tabs to spaces, and vice versa,
every 100 revisions), it will hardly be noticeable.

The perfectionist in me still thinks: hey, you can also implement this
normalization with the byte-based approach (in a streamy way). But
that will probably be quite hard, because I'd have to take into
account that I might have to read more bytes from one file than from
the other file (right now I can always simply read a single byte from
both files). And other than that, it will at least be algorithm
duplication: it will be a (streamy) re-implementation of the
normalization algorithm that's already used in the token layer. So all
in all it's probably not worth it ...

Cheers,
-- 
Johan

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

2010-12-01 Thread Johan Corveleyn

On Wed, Dec 1, 2010 at 11:44 AM, Daniel Shahaf d...@daniel.shahaf.name wrote:
 Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100:
 On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf d...@daniel.shahaf.name 
 wrote:
  Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
  I am now considering to abandon the tokens-approach, for the following 
  reasons:
  ...
  So, unless someone can convince me otherwise, I'm probably going to
  stop with the token approach. Because of 2), I don't think it's worth
  it to spend the effort needed for 1), especially because the
  byte-based approach already works.
 
 
  In other words, you're saying that the token-based approach: (b) won't be
  as fast as the bytes-based approach can be, and (a) requires much effort
  to be spent on implementing the reverse reading of tokens?  (i.e.,
  a backwards gets())

 Yes.

 The reverse reading is quite hard (in the diff_file.c case) because of
 the chunked reading of files. A line may be split by a chunk
 boundary (it may even be split in the middle of an eol sequence
 (between \r and \n), and it all still needs to be
 canonicalized/normalized correctly (that's why we'll also need a
 reverse normalization function). The current forward get_next_token
 does this very well, and the reverse should be analogous, but I just
 find it hard to reason about, and to get it implemented correctly. It
 will take me a lot of time, and with a high probability of introducing
 subtle bugs for special edge cases.


 OK, so a reverse get_next_token() could be tricky to implement.  But,
 worse, won't having it mean that implementors of svn_diff_fns_t can't
 have streamy access to their source?  Since they would be required to
 provide sometimes a token from the start and sometimes a token from the
 end...

 Well, it's not streamy in our usual sense of the word, but it's
 double-streamy (no one requests the middle byte until either all
 bytes before or all bytes after it were transmitted already)

Oooh, I hadn't considered other implementors besides the internal
diff_memory.c and diff_file.c. You're right, that would impose
additional constraints on other implementors. I don't know if being
non-streamy (or less streamy anyway) would be problem ...

In my last commit on the -tokens branch, I added a flag open_at_end
to the datasource_open function (function of svn_diff_fns_t), so the
datasource could be opened at the end. Also, other supporting
functions were needed: token_pushback_prefix, token_pushback_suffix
(to push back tokens that were read too far when scanning for
prefix/suffix) and the dreaded datasource_get_previous_token. Anyway,
the high-level idea was:

1) Open datasources at the end.

2) Scan for identical suffix (getting tokens in reverse).

3) At first non-match: pushback suffix token, and note where suffix starts.

4) Open datasources at the beginning.

5) Scan for identical prefix (getting tokens normally, but without hash).

6) At first non-match: pushback prefix token.

7) Run the old algorithm: getting tokens forward, but with hash. The
get_next_token function should stop (return NULL for token) when it
arrives at the suffix.


Sidestep: I just now realized that I probably don't need to have the
reverse normalization algorithm for implementing get_previous_token.
The call to the normalization function in get_next_token is (I think)
only needed to be able to calculate the hash. But since
get_previous_token doesn't need to calculate hashes, I may be able to
get by without normalization there. I'd only need to normalize inside
token_compare, and I *think* I can just to that forwardly, instead
of backwards. Just thinking out loud here ...

So: that makes the token-approach again a little bit more possible.
But do we want it? It requires a lot more from implementors of
svn_diff_fns_t. OTOH, it does offer a generic prefix/suffix
optimization to all implementors of svn_diff_fns_t ...

  Any thoughts?
 
 
  -tokens/BRANCH-README mentions one of the advantages of the tokens
  approach being that the comparison is done only after whitespace and
  newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in
  effect).  IIRC, the -bytes approach doesn't currently take advantage of
  these -x flags?
 
  What is the practical effect of the fact the -bytes approach doesn't
  take advantage of these flags?  If a file (with a moderately long
  history) has had all its newlines changed in rN, then I assume that your
  -bytes optimizations will speed up all the diffs that 'blame' performs
  on that file, except for the single diff between rN and its predecessor?

 Yes, I thought that would be an important advantage of the tokens
 approach, but as you suggest: the practical effect will most likely be
 limited. Indeed, only this single diff between rN and its predecessor
 (which may be 1 in 1000 diffs) will not benifit from
 prefix/suffix-optimization. Even if the rate of such changes is like
 1/100th (let's say you change

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

2010-12-02 Thread Johan Corveleyn

On Thu, Dec 2, 2010 at 6:43 AM, Daniel Shahaf d...@daniel.shahaf.name wrote:
 [ finally getting back to this mail; having slept on it, etc. ]

 Johan Corveleyn wrote on Wed, Dec 01, 2010 at 13:34:48 +0100:
 On Wed, Dec 1, 2010 at 11:44 AM, Daniel Shahaf d...@daniel.shahaf.name 
 wrote:
  Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100:
  On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf d...@daniel.shahaf.name 
  wrote:
   Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
   I am now considering to abandon the tokens-approach, for the following 
   reasons:
   ...
   So, unless someone can convince me otherwise, I'm probably going to
   stop with the token approach. Because of 2), I don't think it's worth
   it to spend the effort needed for 1), especially because the
   byte-based approach already works.
  
  
   In other words, you're saying that the token-based approach: (b) won't 
   be
   as fast as the bytes-based approach can be, and (a) requires much effort
   to be spent on implementing the reverse reading of tokens?  (i.e.,
   a backwards gets())
 
  Yes.
 
  The reverse reading is quite hard (in the diff_file.c case) because of
  the chunked reading of files. A line may be split by a chunk
  boundary (it may even be split in the middle of an eol sequence
  (between \r and \n), and it all still needs to be
  canonicalized/normalized correctly (that's why we'll also need a
  reverse normalization function). The current forward get_next_token
  does this very well, and the reverse should be analogous, but I just
  find it hard to reason about, and to get it implemented correctly. It
  will take me a lot of time, and with a high probability of introducing
  subtle bugs for special edge cases.
 
 
  OK, so a reverse get_next_token() could be tricky to implement.  But,
  worse, won't having it mean that implementors of svn_diff_fns_t can't
  have streamy access to their source?  Since they would be required to
  provide sometimes a token from the start and sometimes a token from the
  end...
 
  Well, it's not streamy in our usual sense of the word, but it's
  double-streamy (no one requests the middle byte until either all
  bytes before or all bytes after it were transmitted already)

 Oooh, I hadn't considered other implementors besides the internal
 diff_memory.c and diff_file.c.

 Just to clarify: diff_file.c and diff_memory.c are the only implementors
 of svn_diff_fns_t in the core code, correct?

Yes.

 You're right, that would impose
 additional constraints on other implementors. I don't know if being
 non-streamy (or less streamy anyway) would be problem ...


 We should have asked this before, but:

 Do we know who are the other implementors / typical use-cases of
 svn_diff_fns_t?

Yeah, was wondering about this too. Are there in fact other
implementors? Maybe plugins for IDE's or something? How could we find
out? How public is this API in fact?

For blame, I know all revisions are first converted to full-texts,
which are written out to temp files. Then diff_file.c works on those
two files.

 In my last commit on the -tokens branch, I added a flag open_at_end
 to the datasource_open function (function of svn_diff_fns_t), so the
 datasource could be opened at the end. Also, other supporting
 functions were needed: token_pushback_prefix, token_pushback_suffix
 (to push back tokens that were read too far when scanning for
 prefix/suffix) and the dreaded datasource_get_previous_token. Anyway,
 the high-level idea was:

 1) Open datasources at the end.

 2) Scan for identical suffix (getting tokens in reverse).

 3) At first non-match: pushback suffix token, and note where suffix starts.

 4) Open datasources at the beginning.

 5) Scan for identical prefix (getting tokens normally, but without hash).

 6) At first non-match: pushback prefix token.

 7) Run the old algorithm: getting tokens forward, but with hash. The
 get_next_token function should stop (return NULL for token) when it
 arrives at the suffix.


 Sidestep: I just now realized that I probably don't need to have the
 reverse normalization algorithm for implementing get_previous_token.
 The call to the normalization function in get_next_token is (I think)
 only needed to be able to calculate the hash. But since
 get_previous_token doesn't need to calculate hashes, I may be able to
 get by without normalization there. I'd only need to normalize inside
 token_compare, and I *think* I can just to that forwardly, instead
 of backwards. Just thinking out loud here ...


 Is normalization function the thing that collapses newlines and
 whitespaces if -x-w or -x--ignore-eol-style is in effect?  If so,
 another purpose of calling that might be to make suffixes that are
 not bytewise-identical, but only modulo-whitespace-identical, also
 be lumped into the identical suffix.

 (Haven't dived into the code yet; the above is based on my understanding
 of your high-level descriptions)

Yes, it's svn_diff__normalize_buffer

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

2010-12-02 Thread Johan Corveleyn

On Thu, Dec 2, 2010 at 4:21 PM, Julian Foad julian.f...@wandisco.com wrote:
 Hi Johan.

 I've just read the whole of this thread.

 I didn't quite understand your original point (2) that token-based
 suffix scanning will not be as fast as byte-based suffix scanning.
 Sure it won't, but is there any reason you mentioned suffix scanning
 there specifically?  The same is true of prefix scanning, of course.

I mentioned suffix specifically, because for prefix scanning the bytes
approach doesn't have that same advantage. It has to count the number
of lines, to have a correct line offset for the rest of the normal
algorithm. So the byte-based prefix scanning needs to check for
eol-characters just like the token-based version. That means that
theoretically (implementation details aside) both approaches should be
equally fast for prefix scanning.

But for suffix scanning, it seems I don't need to know the number of
lines, so the bytes approach can just ignore line structure for the
suffix.

 And both of them could be fast enough, I assume, if you disable the hash
 calculation in the get token callbacks like you were talking about.

Granted, the difference could be minimal (I haven't actually tested).

However, an additional two comparisons (against \r and \n), for every
byte in the suffix of say 500Kb, for say 2000 diffs of a blame
operation, it might cost another couple of seconds :-).

 But I don't think that necessarily affects the main point.  It looks
 like you've thoroughly investigated using a token based approach.  Thank
 you for doing so.  My initial feeling that it was worth investigating
 was in the hope that you might find some fairly straightforward and
 self-contained modification to the existing token-handling layer.  I
 think the result of this investigation, in which you needed to add
 token-fetch-backwards callbacks and so on, shows that this approach is
 too complex.  I don't want to see a complex implementation.  Therefore I
 support your inclination to abandon that approach and use the byte-wise
 approach instead.

Yep, it just seems too complex (as also confirmed by Bill Tutt in the
other mail).

The pushback stuff (which is inevitable, because we will always read
one token too far to discover the non-match, and that line needs to be
reread to calculate the hash) is also rather dirty/annoying (adds
additional svn_diff_fns_t-ness).

It was definitely worth investigating, if only to give me some peace
of mind that we have considered the options (and it was a great
learning experience).

Cheers,
-- 
Johan

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

2010-12-02 Thread Johan Corveleyn

On Thu, Dec 2, 2010 at 6:18 PM, Bill Tutt b...@tutts.org wrote:
 Note: This email only tangentially relates to svn diff and more about
 reverse token scanning in general:

 As someone who has implemented suffix reverse token scanning before:

Thanks for the input. It's nice to see other people have also
struggled with this :-).

 * It simply isn't possible in DBCS code pages. Stick to byte only here.
    SBCS and UTF-16 make reverse token stuff relatively
 straightforward. UTF-8 is a little trickier but still tractable.
   At least UTF-8 is tractable in a way that DBCS isn't. You always
 know which part of a Unicode code point you are in. (i.e. byte 4 vs.
 byte 3 vs. etc...)

Ok, this further supports the decision to focus on the byte-based
approach. We'll only consider stuff identical if all bytes are
identical. That's the simplest route, and since it's only an
optimization anyway ...

 * I would recommend only supporting a subset of the diff options for
 reverse token scanning. i.e. ignore whitespace/ignore eol but skip
 ignore case (if svn has that, I forget...)

svn diff doesn't have an ignore-case option, so that's ok :-).

    If tokens include keyword expansion operations then stop once you
 hit one. The possible source of bugs outways the perf gain in my mind
 here.

Haven't thought about keyword expansion yet. But as you suggest: I'm
not going to bother doing special stuff for (expanded) keywords. If we
find a mismatch, we'll stop with the optimized scanning, and fall back
to the default algorithm.

 * Suffix scanning does really require a seekable stream, if it isn't
 seekable then don't perform the reverse scanning.  It is only an
 optimization after all.

Hm, yes, we'll need to be careful about that. I'll start another mail
thread asking for known implementors of the svn_diff_fns_t functions,
to find out whether seeking around like that for suffix would be
supported.

 Additional ignore whitespace related comment:
 * IIRC, Perforce had an interesting twist on ignoring whitespace. You
 could ignore just line leading/ending whitespace instead of all
 whitespace differences but pay attention to any whitespace change
 after the trim operation had completed.

 e.g.:
 *     aaa bbb    vs aaa bbb would compare as equal
 *     aaa  bbb   vs aaa  bbb would compare as equal
 *     aaa  bbb   vs aaa bbb would compare as non-equal due to the
 white space change in the middle of the line

Cool (svn doesn't have that option). But I'm not sure what that would
be useful for (as a user, I can't immediately imagine an important use
case). Anyway, could still be a nice option...

Cheers,
-- 
Johan

Implementations of svn_diff_fns_t?

2010-12-02 Thread Johan Corveleyn

Hi,

This question came up during recent discussion about the
diff-optimizations-tokens branch [1]:

What are the known implementors of svn_diff_fns_t, the vtable of
svn_diff callback functions in subversion/include/svn_diff.h? Besides
the internal diff_memory.c and diff_file.c that is.

Are there other implementations out there of these callbacks? IDE
plugins, third-party tools, ...?

Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2010-12/0057.shtml

Re: [PATCH] copy_tests.py - expand move_file_back_and_forth, to verify issue #3429

2010-12-08 Thread Johan Corveleyn

On Wed, Dec 8, 2010 at 3:17 PM, Julian Foad julian.f...@wandisco.com wrote:
 On Wed, 2010-11-17, Johan Corveleyn wrote:
 On Wed, Nov 17, 2010 at 2:14 AM, Johan Corveleyn jcor...@gmail.com wrote:
  The attached patch expands the test move_file_back_and_forth
  (copy_tests.py 45) as a regression test for issue#3429.
 [...]
 I just realized that I can add a similar check to
 move_dir_back_and_forth. Here is a second patch that expands both
 tests.

 Thank you, Johan.  Committed revision 1043427.  (I tweaked the comments
 a bit.)

 I have closed issue #3429.

Thanks!

 (note: can the if svntest.main.wc_is_singledb(wc_dir) be dropped
 from move_dir_back_and_forth?)

 Because single-db is now always enabled?  Yes.  But it's sometimes nice
 to be able to go back and test older code with the current tests, so
 let's not hurry to remove it.

Ok, got it.

Cheers,
-- 
Johan

Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

2010-12-15 Thread Johan Corveleyn

On Wed, Dec 15, 2010 at 10:30 AM, Branko Čibej br...@xbc.nu wrote:
 On 15.12.2010 02:30, Stefan Fuhrmann wrote:
 On 14.12.2010 23:35, Johan Corveleyn wrote:
 Some considerations:
 - Maybe I can use APR_INLINE, with similar results?

 - Maybe I can put just the critical section into a macro (the curp++
 / curp-- part), and call a function when a chunk boundary is
 encountered (~ once every 131072 iterations (chunks are 128 Kb
 large)), to read in the new chunk, advancing variables, ...
 Prefer inlining because the compiler is free to ignore it.
 Depending on the target architecture and the compiler,
 it may be beneficial to narrow the scope of optimization:
 In large functions, the optimizer may guess wrongly about
 the hot spots.

 Certainly. These days it makes no sense to use macros instead of
 inlining, except in really, really extreme cases (of which there should
 be none in Subversion code), or where macros are used for tricks that
 the language proper doesn't support (debug-mode error location reporting
 being such an example).

 Which raises the question -- did you run your test case with a
 fully-optimized build, or with some debug-friendly build? Compilers
 these days /should/ be able to automatically inline static functions,
 regardless of explicit inline declarations

It was a Windows build (x86), with Release configuration, compiled
with Visual Studio Express 2008 (when measuring performance of this
branch, I always use a Release build, a Debug build is at least twice
as slow; I only change to a Debug build if I'm having problems,
needing to debug etc.).

I have not used any special compilation options, so maybe that could
also make a difference (but, to be honest, I'm lacking the expert
knowledge about C compilers, options, platform specific things etc.
for this).

Maybe the function is a little bit too large/complex for the compiler
to automatically inline it without any hints? Or maybe writing it
differently, like Stefan suggests, could help the compiler to inline
it without needing explicit inline declarations... I'll experiment a
little bit (in a couple of days, having a huge shortage of time now
:-)).

-- 
Johan

Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

2010-12-15 Thread Johan Corveleyn

On Wed, Dec 15, 2010 at 2:30 AM, Stefan Fuhrmann eq...@web.de wrote:
 On 14.12.2010 23:35, Johan Corveleyn wrote:

 Hi all,

 Hi Johan ;)

Hi Stefan, thanks for the input :). I suspected that you might have
some ideas about this ...

 On the diff-optimizations-bytes branch, in diff_file.c, there are two
 functions which are called for every byte of the identical prefix and
 suffix: increment_pointers and decrement_pointers. These functions are
 actually equivalents of curp++ or curp--, reading the next/previous
 byte, but taking into account the chunked-ness of the data (file data
 is read in memory in chunks).

 As an experiment I changed these functions into macro's, eliminating
 the function calls. This makes the diff algorithm another 10% - 15%
 faster (granted, this was measured with my extreme testcase of a 1,5
 Mb file (6 lines), of which most lines are identical
 prefix/suffix). However, having an entire function like that
 implemented as a macro isn't very pretty (see below for an example).

 I'm kind of surprised that the calling overhead is
 actually noticeable for larger files, i.e. larger values
 of file_len it should loop many times.

file_len is the size of the array of files, not the length of the
files themselves. In the typical case (the only case that's currently
supported) file_len is 2. That's because we're diffing just 2 files:
the original  one and the modified one. The implementation is made
with an array and a file_len (and for loops), because I wanted it to
be generically useful also for diff3 (with 3 files: original, modified
and latest) and diff4 (4 files: original, modified, latest and
ancestor).

Also, I did my measurements with a blame operation on this very large
file, which has ~2500 revisions. So that's 2500 diffs of a ~1,5 Mb
file, with say on average 1 Mb of identical prefix/suffix every time.
That's some 2,500,000,000 function calls.

For measurement, I counted the total time of the blame operation, as
well as the cumulative time for the svn_diff_diff calls (with
apr_time_now() before and after).

 Some considerations:
 - Maybe I can use APR_INLINE, with similar results?

 - Maybe I can put just the critical section into a macro (the curp++
 / curp-- part), and call a function when a chunk boundary is
 encountered (~ once every 131072 iterations (chunks are 128 Kb
 large)), to read in the new chunk, advancing variables, ...

 Prefer inlining because the compiler is free to ignore it.
 Depending on the target architecture and the compiler,
 it may be beneficial to narrow the scope of optimization:
 In large functions, the optimizer may guess wrongly about
 the hot spots.

 - Maybe it's not worth it?

 Since inlining is for free from the maintenance POV,
 any gain should justify it.

 Thoughts?

 Two things you might try:
 * introduce a local variable for afile[i]
 * replace the for() loop with two nested ones, keeping
  calls to other functions out of the hot spot:

 for (i=0; i  file_len;)
 {
  /* hot spot: */
  for(; i  file_len; i++)
  {
    curFile = afile[i];
    if (curFile-chunk==-1)
      curFile-chunk = 0;
    else if (curFile-curp != curFile-endp -1)
      curFile-curp++;
    else
      break;
  }

  if (i  file_len)
  {
    /* the complex, rarely used stuff goes here */
  }
 }

Ok, when I have some time, I will experiment a little bit with
inline, and see if the compiler gets it (I'll try your nested for
loop example (the corrected one, which you just sent in the other
mail) to help the compiler a little bit). I should be able to easily
compare that with the macro version.

-- 
Johan

Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

2010-12-15 Thread Johan Corveleyn

On Wed, Dec 15, 2010 at 11:53 AM, Johan Corveleyn jcor...@gmail.com wrote:
 On Wed, Dec 15, 2010 at 11:50 AM, Johan Corveleyn jcor...@gmail.com wrote:
 Also, I did my measurements with a blame operation on this very large
 file, which has ~2500 revisions. So that's 2500 diffs of a ~1,5 Mb
 file, with say on average 1 Mb of identical prefix/suffix every time.
 That's some 2,500,000,000 function calls.

 That should be: 5,000,000,000 function calls, because for every diff
 we advance prefix and suffix pointers in two files: the original one
 and the modified one.

Damn. Scratch that last mail. I'm confused. It's 2,500,000,000
function calls (each of which advances the two pointers in a for
loop).

(that's sleep-deprivation in action (sick child etc ...))
-- 
Johan

Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

2010-12-20 Thread Johan Corveleyn

On Mon, Dec 20, 2010 at 11:03 AM, Julian Foad julian.f...@wandisco.com wrote:
 On Mon, 2010-12-20, Johan Corveleyn wrote:
 New macro version (increment only, decrement is similar):
 [[[
 /* For all files in the FILE array, increment the curp pointer.  If a file
  * points before the beginning of file, let it point at the first byte again.
  * If the end of the current chunk is reached, read the next chunk in the
  * buffer and point curp to the start of the chunk.  If EOF is reached, set
  * curp equal to endp to indicate EOF. */
 #define increment_pointers(all_files, files_len, pool)                       
 \
   do {                                                                       
 \
     int i;                                                                   
 \
                                                                              
 \
     for (i = 0; i  files_len; i++)                                          
 \
     {                                                                        
 \
       if (all_files[i]-chunk == -1) /* indicates before beginning of file 
 */\
         all_files[i]-chunk = 0;     /* point to beginning of file again */  
 \
       else if (all_files[i]-curp != all_files[i]-endp - 1)                 
 \
         all_files[i]-curp++;                                                
 \

 Hi Johan.

 Here you are having to test for two special cases every time: chunk==-1
 and curp==endp-1.  I would suggest changing the specification of before
 beginning of file to include the promise that curp==endp-1, so that you
 don't have to use a separate test here and can instead test for this
 special case within increment_chunk().

Good idea. I'll try this tonight ...

Cheers,
-- 
Johan

Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

2010-12-20 Thread Johan Corveleyn

On Mon, Dec 20, 2010 at 2:16 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Mon, Dec 20, 2010 at 11:03 AM, Julian Foad julian.f...@wandisco.com 
 wrote:
 On Mon, 2010-12-20, Johan Corveleyn wrote:
 New macro version (increment only, decrement is similar):
 [[[
 /* For all files in the FILE array, increment the curp pointer.  If a file
  * points before the beginning of file, let it point at the first byte 
 again.
  * If the end of the current chunk is reached, read the next chunk in the
  * buffer and point curp to the start of the chunk.  If EOF is reached, set
  * curp equal to endp to indicate EOF. */
 #define increment_pointers(all_files, files_len, pool)                      
  \
   do {                                                                      
  \
     int i;                                                                  
  \
                                                                             
  \
     for (i = 0; i  files_len; i++)                                         
  \
     {                                                                       
  \
       if (all_files[i]-chunk == -1) /* indicates before beginning of file 
 */\
         all_files[i]-chunk = 0;     /* point to beginning of file again */ 
  \
       else if (all_files[i]-curp != all_files[i]-endp - 1)                
  \
         all_files[i]-curp++;                                               
  \

 Hi Johan.

 Here you are having to test for two special cases every time: chunk==-1
 and curp==endp-1.  I would suggest changing the specification of before
 beginning of file to include the promise that curp==endp-1, so that you
 don't have to use a separate test here and can instead test for this
 special case within increment_chunk().

 Good idea. I'll try this tonight ...

Ok, this worked pretty well. It's a little bit faster (about 1 second,
which seems right intuitively).

One style question though: should these macros be all upper case
(INCREMENT_POINTERS and DECREMENT_POINTERS)? I guess so.

The code now looks as follows (in attachment a patch relative to
diff-optimizations-bytes branch):

[[[
/* For all files in the FILE array, increment the curp pointer.  If a file
 * points before the beginning of file, let it point at the first byte again.
 * If the end of the current chunk is reached, read the next chunk in the
 * buffer and point curp to the start of the chunk.  If EOF is reached, set
 * curp equal to endp to indicate EOF. */
#define increment_pointers(all_files, files_len, pool)   \
  do {   \
int i;   \
 \
for (i = 0; i  files_len; i++)  \
{\
  if (all_files[i]-curp  all_files[i]-endp - 1)   \
all_files[i]-curp++;\
  else   \
SVN_ERR(increment_chunk(all_files[i], pool));\
}\
  } while (0)


/* For all files in the FILE array, decrement the curp pointer.  If the
 * start of a chunk is reached, read the previous chunk in the buffer and
 * point curp to the last byte of the chunk.  If the beginning of a FILE is
 * reached, set chunk to -1 to indicate BOF. */
#define decrement_pointers(all_files, files_len, pool)   \
  do {   \
int i;   \
 \
for (i = 0; i  files_len; i++)  \
{\
  if (all_files[i]-curp  all_files[i]-buffer) \
all_files[i]-curp--;\
  else   \
SVN_ERR(decrement_chunk(all_files[i], pool));\
}\
  } while (0)


static svn_error_t *
increment_chunk(struct file_info *file, apr_pool_t *pool)
{
  apr_off_t length;
  apr_off_t last_chunk = offset_to_chunk(file-size);

  if (file-chunk == -1)
{
  /* We are at BOF (Beginning Of File). Point to first chunk/byte again. */
  file-chunk = 0;
  file-curp = file-buffer;
}
  else if (file-chunk == last_chunk)
{
  /* We are at the last chunk. Indicate EOF by setting curp == endp. */
  file-curp = file-endp

Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

2010-12-28 Thread Johan Corveleyn

On Fri, Dec 24, 2010 at 3:40 PM, Stefan Fuhrmann eq...@web.de wrote:
 On 20.12.2010 02:43, Johan Corveleyn wrote:

 On Wed, Dec 15, 2010 at 10:58 AM, Stefan Fuhrmanneq...@web.de  wrote:

 On 15.12.2010 02:30, Stefan Fuhrmann wrote:

 On 14.12.2010 23:35, Johan Corveleyn wrote:

 Thoughts?

 Two things you might try:
 * introduce a local variable for afile[i]
 * replace the for() loop with two nested ones, keeping
  calls to other functions out of the hot spot:

 for (i=0; i  file_len;)

 That should read:
 for (i=0; i  file_len; i++)

 {
  /* hot spot: */
  for(; i  file_len; i++)
  {
    curFile = afile[i];
    if (curFile-chunk==-1)
      curFile-chunk = 0;
    else if (curFile-curp != curFile-endp -1)
      curFile-curp++;
    else
      break;
  }

  if (i  file_len)
  {
    /* the complex, rarely used stuff goes here */
  }
 }

 Ok, I tried this, but it didn't really help. It's still only about as
 fast as before. While the macro version is about 10% faster (I made a
 cleaner macro version, only macro'ing the hotspot stuff, with a
 function call to the complex, rarely used stuff).

 For the inline version, I tried several variations (always with
 APR_INLINE in the function signature, and with local variable for
 file[i]): with or without the nested for loop, with or without the
 complex stuff factored out to a separate function, ... it made no
 difference.

 Maybe I'm still doing something wrong, keeping the compiler from
 inlining it (but I'm not really a compiler expert, maybe I have to add
 some compiler options or something, ...). OTOH, I now have a macro
 implementation which is quite readable (and which uses the do ...
 while(0) construct - thanks Peter), and which does the trick. So maybe
 I should simply stick with that option ...

 The key factor here is that file_len is only 2.
 Basically, that will make my optimization a
 pessimization.

 Assuming that for most calls, curp of *both*
 files will be somewhere *between* BOF and
 EOF, the unrolling the loop may help:

 #define INCREMENT_POINTERS

 /* special code for the common case. 10 .. 12 ticks per execution */

 static APR_INLINE svn_boolean_t
 quickly_increment_pointers(struct file_info *afile[])
 {
 struct file_info *file0 = afile[0];
 struct file_info *file1 = afile[1];
 if ((afile0-chunk != -1)  (afile1-chunk != -1))
  {
    /* suitable_type */ nextp0 = afile0-curp + 1;
    /* suitable_type */ nextp1 = afile1-curp + 1;
    if ((nextp0 != afile0-endp)  (nextp1 != afile1-endp))
      {
        afile0-curp = nextp0;
        afile1-curp = nextp1;
        return TRUE;
      }
  }
 return FALSE;
 }

 /* slow code covering all special cases */

 static svn_error_t*
 slowly_increment_pointers(struct file_info *file[], apr_size_t file_len,
 apr_pool_t *pool)
 {
  for (i = 0; i  file_len;i++)
 ...
 }

 /* maybe as macro: */
 return ((file_len == 2)  quickly_increment_pointers (file))
  ? SVN_NO_ERROR
  : slowly_increment_pointers(file, file_len, pool);

Nice! I will try this out sometime soon, but I'm not so sure it will
help more than the current macro version (only macro for the critical
section, calling a function for increment/decrement chunk - see
diff_file.c in HEAD of diff-optimizations-bytes branch). I guess I'll
just have to test it.

With the macro implementation in place my gut feeling is that the
prefix/suffix scanning is reasonably optimal, and that the remaining
time spent in the diff code (70 seconds for my example blame of big
xml file with 2200 changes) is almost all due to the original diff
algorithm (working on the remainder between prefix and suffix). But to
be sure I should probably measure that first.

(I should note here that, with my example file, the prefix/suffix
scanning doesn't always work optimally, because in about half of the
revisions there is a change both in one of the first lines (where an
author name is inserted automatically by XML Spy upon saving it), and
then somewhere in the middle where the actual change is done. It's
really a pity, because that means about 2-3 lines cannot be
skipped as prefix.)

 Minor remark on C style:

 static APR_INLINE svn_error_t *
 increment_pointers(struct file_info *file[], int file_len, apr_pool_t
 *pool)

 file_len should be apr_size_t unless it can assume negative
 values in which case it should be apr_ssize_t.That way, you
 know for sure it will never overflow. Int is potentially too short
 (although extrmly unlikely in this case) and it is signed
 (often not appropriate for an index value).

Thanks. I didn't know that. I'll change it.

 One more thing that might help speed up your code. The fact
 that the calling overhead of a single  simple function turns
 out to be a significant contribution to the overall runtime tells
 us that the execution is tightly CPU-bound and squeezing
 some extra cycles out of it could help. Try

 struct file_info file[] instead of struct file_info *file[]

 i.e. array of structs instead of array of pointers. The boring
 background

Re: FSFS format 6 (was: Re: FSv2)

2010-12-28 Thread Johan Corveleyn

On Sun, Dec 12, 2010 at 4:23 PM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de wrote:
 On 19.10.2010 15:10, Daniel Shahaf wrote:

 Greg Stein wrote on Tue, Oct 19, 2010 at 04:31:42 -0400:

 Personally, I see [FSv2] as a broad swath of API changes to align our
 needs with the underlying storage. Trowbridge noted that our current
 API makes it *really* difficult to implement an effective backend. I'd
 also like to see a backend that allows for parallel PUTs during the
 commit process. Hyrum sees FSv2 as some kind of super-key-value
 storage with layers on top, allowing for various types of high-scaling
 mechanisms.

 At the retreat, stefan2 also had some thoughts about this...

 [This is just a brain-dump for 1.8+]

 While working on the performance branch I made some
 observations concerning the way FSFS organizes data
 and how that could be changed for reduced I/O overhead.

 notes/fsfs-improvements.txt contains a summary of that
 could be done to improve FSFS before FS-NG. A later
 FS-NG implementation should then still benefit from the
 improvements.

+(number of fopen calls during a log operation)

I like this proposal a lot. As I already told before, we are running
our FSFS back-end on a SAN over NFS (and I suspect we're not the only
company doing this). In this environment, the server-side I/O of SVN
(especially the amount of random reads and fopen calls during e.g.
log) is often the major bottleneck.

There is one question going around in my head though: won't you have
to change/rearrange a lot of the FS layer code (and maybe repos
layer?) to benefit from this new format?

The current code is written in a certain way, not particularly
optimized for this new format (I seem to remember log does around 10
fopen calls for every interesting rev file, each time reading a
different part of it). Also, if an operation currently needs to access
many revisions (like log or blame), it doesn't take advantage at all
of the fact that they might be in a single packed rev file. The pack
file is opened and seeked in just as much as the sum of the individual
rev files.

So: how will the current code be able to take advantage of this new
format? Won't this require a major effort to restructure that code?

(This reminds me of the current difficulty (as I can see it, as an
innocent bystander) with the WC-NG rewrite: theoretically it should be
very fast, but the higher level code is still largely based upon the
old principles. So to take advantage of it, certain things have to be
changed at the higher level, making operations work dir-based or
tree-based, instead of file-based etc).

Cheers,
-- 
Johan

Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

2010-12-30 Thread Johan Corveleyn

On Tue, Dec 28, 2010 at 7:31 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Fri, Dec 24, 2010 at 3:40 PM, Stefan Fuhrmann eq...@web.de wrote:
 On 20.12.2010 02:43, Johan Corveleyn wrote:

 On Wed, Dec 15, 2010 at 10:58 AM, Stefan Fuhrmanneq...@web.de  wrote:

 On 15.12.2010 02:30, Stefan Fuhrmann wrote:

 On 14.12.2010 23:35, Johan Corveleyn wrote:

 Thoughts?

 Two things you might try:
 * introduce a local variable for afile[i]
 * replace the for() loop with two nested ones, keeping
  calls to other functions out of the hot spot:

 for (i=0; i  file_len;)

 That should read:
 for (i=0; i  file_len; i++)

 {
  /* hot spot: */
  for(; i  file_len; i++)
  {
    curFile = afile[i];
    if (curFile-chunk==-1)
      curFile-chunk = 0;
    else if (curFile-curp != curFile-endp -1)
      curFile-curp++;
    else
      break;
  }

  if (i  file_len)
  {
    /* the complex, rarely used stuff goes here */
  }
 }

 Ok, I tried this, but it didn't really help. It's still only about as
 fast as before. While the macro version is about 10% faster (I made a
 cleaner macro version, only macro'ing the hotspot stuff, with a
 function call to the complex, rarely used stuff).

 For the inline version, I tried several variations (always with
 APR_INLINE in the function signature, and with local variable for
 file[i]): with or without the nested for loop, with or without the
 complex stuff factored out to a separate function, ... it made no
 difference.

 Maybe I'm still doing something wrong, keeping the compiler from
 inlining it (but I'm not really a compiler expert, maybe I have to add
 some compiler options or something, ...). OTOH, I now have a macro
 implementation which is quite readable (and which uses the do ...
 while(0) construct - thanks Peter), and which does the trick. So maybe
 I should simply stick with that option ...

 The key factor here is that file_len is only 2.
 Basically, that will make my optimization a
 pessimization.

 Assuming that for most calls, curp of *both*
 files will be somewhere *between* BOF and
 EOF, the unrolling the loop may help:

 #define INCREMENT_POINTERS

 /* special code for the common case. 10 .. 12 ticks per execution */

 static APR_INLINE svn_boolean_t
 quickly_increment_pointers(struct file_info *afile[])
 {
 struct file_info *file0 = afile[0];
 struct file_info *file1 = afile[1];
 if ((afile0-chunk != -1)  (afile1-chunk != -1))
  {
    /* suitable_type */ nextp0 = afile0-curp + 1;
    /* suitable_type */ nextp1 = afile1-curp + 1;
    if ((nextp0 != afile0-endp)  (nextp1 != afile1-endp))
      {
        afile0-curp = nextp0;
        afile1-curp = nextp1;
        return TRUE;
      }
  }
 return FALSE;
 }

 /* slow code covering all special cases */

 static svn_error_t*
 slowly_increment_pointers(struct file_info *file[], apr_size_t file_len,
 apr_pool_t *pool)
 {
  for (i = 0; i  file_len;i++)
 ...
 }

 /* maybe as macro: */
 return ((file_len == 2)  quickly_increment_pointers (file))
  ? SVN_NO_ERROR
  : slowly_increment_pointers(file, file_len, pool);

 Nice! I will try this out sometime soon, but I'm not so sure it will
 help more than the current macro version (only macro for the critical
 section, calling a function for increment/decrement chunk - see
 diff_file.c in HEAD of diff-optimizations-bytes branch). I guess I'll
 just have to test it.

 With the macro implementation in place my gut feeling is that the
 prefix/suffix scanning is reasonably optimal, and that the remaining
 time spent in the diff code (70 seconds for my example blame of big
 xml file with 2200 changes) is almost all due to the original diff
 algorithm (working on the remainder between prefix and suffix). But to
 be sure I should probably measure that first.

Hmmm, my gut feeling seems to be a little bit off. I measured this,
and prefix/suffix scanning is still taking 46 - 50 seconds of the
total 70 seconds spent in diff-time during the blame of my example
file (~20 seconds are going to getting tokens and inserting them into
the token tree, and ~5 seconds in the actual LCS algorithm). So
optimizations here will probably still be very useful.

I'm going to let this rest for a while, until I get the prefix/suffix
scanning work for diff3 and diff4. After I get that all working, I
might revisit this for more optimization ...

Cheers,
-- 
Johan

diff-optimizations-bytes: how to make diff3 work with prefix/suffix scanning

2011-01-01 Thread Johan Corveleyn

[ Taking a privately-started discussion with danielsh to the list, in
case others have inspiration/insight about this. Question at hand: I'm
having trouble making diff3 work with prefix/suffix scanning of the
diff-optimizations-bytes branch. Any feedback is highly appreciated
:-). ]

On Fri, Dec 31, 2010 at 2:38 PM, Daniel Shahaf d...@daniel.shahaf.name wrote:
[...snip...]
 In diff3 with prefix enabled, I was seeing failures like this:

 EXPECTED:
 line1
 
 line2
 ===
 line2 modified


 ACTUAL:
 
 line1
 line2
 ===
 line1
 line2 modified


 so I assumed that when the prefix code gobbled the shared prefix.  How
 to fix it from there was purely guesswork on my part (and I haven't
 implemented it).

 These failures would go away if I prepended a line0 left and line0
 right to the file.

Yes, I know. That's indeed because the prefix-scanning code gobbled up
the identical prefix. That problem is also there in diff2.

In diff2, I fixed this by simply pre-pending a piece of common diff
to the diff linked list, in the function diff.c#svn_diff__diff.

From r1037371:
[[[
--- subversion/branches/diff-optimizations-bytes/subversion/libsvn_diff/diff.c
(original)
+++ subversion/branches/diff-optimizations-bytes/subversion/libsvn_diff/diff.c
Sun Nov 21 02:36:07 2010
@@ -43,6 +43,22 @@ svn_diff__diff(svn_diff__lcs_t *lcs,
  svn_diff_t *diff;
  svn_diff_t **diff_ref = diff;

+  if (want_common  (original_start  1))
+{
+  /* we have a prefix to skip */
+  (*diff_ref) = apr_palloc(pool, sizeof(**diff_ref));
+
+  (*diff_ref)-type = svn_diff__type_common;
+  (*diff_ref)-original_start = 0;
+  (*diff_ref)-original_length = original_start - 1;
+  (*diff_ref)-modified_start = 0;
+  (*diff_ref)-modified_length = modified_start - 1;
+  (*diff_ref)-latest_start = 0;
+  (*diff_ref)-latest_length = 0;
+
+  diff_ref = (*diff_ref)-next;
+}
+
  while (1)
{
  if (original_start  lcs-position[0]-offset
]]]

Naively, I wanted to do the same for diff3 (along with some other
tricks, to make the algorithm aware of the prefix_lines), but it
doesn't work. See the patch in attachment.

I mean: it works partly: diff-diff3-test.exe passes with the attached
patch, but merge_reintegrate_tests.py (1, 2, 10, 13 and 14) and
merge_tests.py (61) don't. I haven't studied the test failures in
detail (I probably should, but I currently can't wrap my head around
those tests).

I'm now pondering another approach, to pre-pend an lcs piece to the
lcs linked list, and let the rest of the algorithm do its work as
normal. Inspiration for this came when I was reading the code of
diff3.c#svn_diff__resolve_conflict, where a similar trick is used
around line 121:
[[[
/* If there were matching symbols at the start of
 * both sequences, record that fact.
 */
if (common_length  0)
  {
lcs = apr_palloc(subpool, sizeof(*lcs));
lcs-next = NULL;
lcs-position[0] = start_position[0];
lcs-position[1] = start_position[1];
lcs-length = common_length;

lcs_ref = lcs-next;
  }
]]]

Maybe I can even integrate this logic into lcs.c#svn_diff__lcs, where
I'm already passing the prefix_lines as a new parameter. If that would
work out, I could probably remove the special common-diff-injecting
code in diff.c#svn_diff__diff.

Thoughts?

It will take me some time/experimentation to work out the details, but
I think that should work ...

Cheers,
-- 
Johan
Index: subversion/libsvn_diff/diff3.c
===
--- subversion/libsvn_diff/diff3.c  (revision 1041582)
+++ subversion/libsvn_diff/diff3.c  (working copy)
@@ -251,10 +251,14 @@ svn_diff_diff3(svn_diff_t **diff,
 {
   svn_diff__tree_t *tree;
   svn_diff__position_t *position_list[3];
+  svn_diff_datasource_e datasource[] = {svn_diff_datasource_original,
+svn_diff_datasource_modified,
+svn_diff_datasource_latest};
   svn_diff__lcs_t *lcs_om;
   svn_diff__lcs_t *lcs_ol;
   apr_pool_t *subpool;
   apr_pool_t *treepool;
+  apr_off_t prefix_lines = 0;
 
   *diff = NULL;
 
@@ -263,28 +267,30 @@ svn_diff_diff3(svn_diff_t **diff,
 
   svn_diff__tree_create(tree, treepool);
 
+  SVN_ERR(vtable-datasources_open(diff_baton, prefix_lines, datasource, 3));
+
   SVN_ERR(svn_diff__get_tokens(position_list[0],
tree,
diff_baton, vtable,
svn_diff_datasource_original,
-   FALSE,
-   0,
+   TRUE,
+   prefix_lines,
subpool));
 
   SVN_ERR(svn_diff__get_tokens(position_list[1],
tree,
diff_baton, vtable,
svn_diff_datasource_modified,
-

Re: svn commit: r1054286 - /subversion/trunk/subversion/libsvn_delta/xdelta.c

2011-01-02 Thread Johan Corveleyn

On Sat, Jan 1, 2011 at 9:33 PM,  stef...@apache.org wrote:
 Author: stefan2
 Date: Sat Jan  1 20:33:22 2011
 New Revision: 1054286

 URL: http://svn.apache.org/viewvc?rev=1054286view=rev
 Log:
 XDelta calculation is major part of svn_txdelta_send_txstream.
 Therefore, speed up string matching code and the relatively
 expensive hash-code (adler32) calculations.

 * subversion/libsvn_delta/xdelta.c
  (init_adler32): init adler checksum with 0 instead of 1;
   use svn_adler32 for performance
  (adler32_out): -1 can / must be ommitted now that we
   start at 0 instead of 1 for s1.
  (adler32_replace): special, faster implementation replacing
   a adler32_out(), adler32_in() sequence

  (match_length): new utility function
  (find_match): speed up the main loop by calling match_length()
  (compute_delta): optimize adler32 calculations

 Modified:
    subversion/trunk/subversion/libsvn_delta/xdelta.c

Hi Stefan,

This makes my (Windows 32-bit VCE 2008) build fail:
svn_delta-1.lib(xdelta.obj) : error LNK2019: unresolved external
symbol _svn__adler32 referenced in function _init_adler32
..\..\..\Release\subversion\libsvn_delta\libsvn_delta-1.dll : fatal
error LNK1120: 1 unresolved externals

Johan

 Modified: subversion/trunk/subversion/libsvn_delta/xdelta.c
 URL: 
 http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_delta/xdelta.c?rev=1054286r1=1054285r2=1054286view=diff
 ==
 --- subversion/trunk/subversion/libsvn_delta/xdelta.c (original)
 +++ subversion/trunk/subversion/libsvn_delta/xdelta.c Sat Jan  1 20:33:22 2011
 @@ -29,6 +29,8 @@

  #include svn_delta.h
  #include delta.h
 +
 +#include private/svn_adler32.h

  /* This is pseudo-adler32. It is adler32 without the prime modulus.
    The idea is borrowed from monotone, and is a translation of the C++
 @@ -39,6 +41,10 @@
  #define ADLER32_MASK      0x
  #define ADLER32_CHAR_MASK 0x00ff

 +/* Size of the blocks we compute checksums for. This was chosen out of
 +   thin air.  Monotone used 64, xdelta1 used 64, rsync uses 128.  */
 +#define MATCH_BLOCKSIZE 64
 +
  /* Structure to store the state of our adler32 checksum.  */
  struct adler32
  {
 @@ -66,11 +72,32 @@ adler32_out(struct adler32 *ad, const ch
  {
   ad-s1 -= ((apr_uint32_t) (c))  ADLER32_CHAR_MASK;
   ad-s1 = ADLER32_MASK;
 -  ad-s2 -= (ad-len * (((apr_uint32_t) c)  ADLER32_CHAR_MASK)) + 1;
 +  ad-s2 -= (ad-len * (((apr_uint32_t) c)  ADLER32_CHAR_MASK));
   ad-s2 = ADLER32_MASK;
   --ad-len;
  }

 +/* Feed C_IN into the adler32 checksum and remove C_OUT at the same time.
 + * This function may (and will) only be called for
 + * ad-len == MATCH_BLOCKSIZE.
 + */
 +static APR_INLINE void
 +adler32_replace(struct adler32 *ad, const char c_out, const char c_in)
 +{
 +  apr_uint32_t s1 = ad-s1;
 +  apr_uint32_t s2 = ad-s2;
 +
 +  s2 -= (MATCH_BLOCKSIZE * (((apr_uint32_t) c_out)  ADLER32_CHAR_MASK));
 +
 +  s1 -= ((apr_uint32_t) (c_out))  ADLER32_CHAR_MASK;
 +  s1 += ((apr_uint32_t) (c_in))  ADLER32_CHAR_MASK;
 +
 +  s2 += s1;
 +
 +  ad-s1 = s1  ADLER32_MASK;
 +  ad-s2 = s2  ADLER32_MASK;
 +}
 +
  /* Return the current adler32 checksum in the adler32 structure.  */

  static APR_INLINE apr_uint32_t
 @@ -85,18 +112,15 @@ adler32_sum(const struct adler32 *ad)
  static APR_INLINE struct adler32 *
  init_adler32(struct adler32 *ad, const char *data, apr_uint32_t datalen)
  {
 -  ad-s1 = 1;
 -  ad-s2 = 0;
 -  ad-len = 0;
 -  while (datalen--)
 -    adler32_in(ad, *(data++));
 +  apr_uint32_t adler32 = svn__adler32(0, data, datalen);
 +
 +  ad-s1 = adler32  ADLER32_MASK;
 +  ad-s2 = (adler32  16)  ADLER32_MASK;
 +  ad-len = datalen;
 +
   return ad;
  }

 -/* Size of the blocks we compute checksums for. This was chosen out of
 -   thin air.  Monotone used 64, xdelta1 used 64, rsync uses 128.  */
 -#define MATCH_BLOCKSIZE 64
 -
  /* Information for a block of the delta source.  The length of the
    block is the smaller of MATCH_BLOCKSIZE and the difference between
    the size of the source data and the position of this block. */
 @@ -201,6 +225,35 @@ init_blocks_table(const char *data,
     }
  }

 +/* Return the lowest position at which A and B differ. If no difference
 + * can be found in the first MAX_LEN characters, MAX_LEN will be returned.
 + */
 +static apr_size_t
 +match_length(const char *a, const char *b, apr_size_t max_len)
 +{
 +  apr_size_t pos = 0;
 +
 +#if SVN_UNALIGNED_ACCESS_IS_OK
 +
 +  /* Chunky operation is so much faster ...
 +   *
 +   * We can't make this work on architectures that require aligned access
 +   * because A and B will probably have different alignment. So, skipping
 +   * the first few chars until alignment is reached is not an option.
 +   */
 +  for (; pos + sizeof(apr_size_t) = max_len; pos += sizeof(apr_size_t))
 +    if (*(const apr_size_t*)(a + pos) != *(const apr_size_t*)(b + pos))
 +      break;
 +
 +#endif
 +
 +  for (; pos  max_len; ++pos)
 +    if (a[pos] != b[pos])

Re: svn commit: r1054286 - /subversion/trunk/subversion/libsvn_delta/xdelta.c

2011-01-02 Thread Johan Corveleyn

On Sun, Jan 2, 2011 at 9:20 AM, Johan Corveleyn jcor...@gmail.com wrote:
 On Sat, Jan 1, 2011 at 9:33 PM,  stef...@apache.org wrote:
 Author: stefan2
 Date: Sat Jan  1 20:33:22 2011
 New Revision: 1054286

 URL: http://svn.apache.org/viewvc?rev=1054286view=rev
 Log:
 XDelta calculation is major part of svn_txdelta_send_txstream.
 Therefore, speed up string matching code and the relatively
 expensive hash-code (adler32) calculations.

 * subversion/libsvn_delta/xdelta.c
  (init_adler32): init adler checksum with 0 instead of 1;
   use svn_adler32 for performance
  (adler32_out): -1 can / must be ommitted now that we
   start at 0 instead of 1 for s1.
  (adler32_replace): special, faster implementation replacing
   a adler32_out(), adler32_in() sequence

  (match_length): new utility function
  (find_match): speed up the main loop by calling match_length()
  (compute_delta): optimize adler32 calculations

 Modified:
    subversion/trunk/subversion/libsvn_delta/xdelta.c

 Hi Stefan,

 This makes my (Windows 32-bit VCE 2008) build fail:
 svn_delta-1.lib(xdelta.obj) : error LNK2019: unresolved external
 symbol _svn__adler32 referenced in function _init_adler32
 ..\..\..\Release\subversion\libsvn_delta\libsvn_delta-1.dll : fatal
 error LNK1120: 1 unresolved externals

It was actually r1054277 that broke it. The following patch fixes it:
[[[
Index: build.conf
===
--- build.conf  (revision 1054417)
+++ build.conf  (working copy)
@@ -329,7 +329,7 @@ msvc-export =
 private\svn_log.h private\svn_mergeinfo_private.h
 private\svn_opt_private.h private\svn_skel.h private\svn_sqlite.h
 private\svn_utf_private.h private\svn_eol_private.h
-private\svn_token.h
+private\svn_token.h private\svn_adler32.h

 # Working copy management lib
 [libsvn_wc]
]]]

Cheers,
-- 
Johan

Re: svn commit: r1054286 - /subversion/trunk/subversion/libsvn_delta/xdelta.c

2011-01-02 Thread Johan Corveleyn

On Sat, Jan 1, 2011 at 9:33 PM,  stef...@apache.org wrote:
 Author: stefan2
 Date: Sat Jan  1 20:33:22 2011
 New Revision: 1054286

 URL: http://svn.apache.org/viewvc?rev=1054286view=rev
 Log:
 XDelta calculation is major part of svn_txdelta_send_txstream.
 Therefore, speed up string matching code and the relatively
 expensive hash-code (adler32) calculations.

 * subversion/libsvn_delta/xdelta.c
  (init_adler32): init adler checksum with 0 instead of 1;
   use svn_adler32 for performance
  (adler32_out): -1 can / must be ommitted now that we
   start at 0 instead of 1 for s1.
  (adler32_replace): special, faster implementation replacing
   a adler32_out(), adler32_in() sequence

  (match_length): new utility function
  (find_match): speed up the main loop by calling match_length()
  (compute_delta): optimize adler32 calculations

 Modified:
    subversion/trunk/subversion/libsvn_delta/xdelta.c

 Modified: subversion/trunk/subversion/libsvn_delta/xdelta.c
 URL: 
 http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_delta/xdelta.c?rev=1054286r1=1054285r2=1054286view=diff
 ==
 --- subversion/trunk/subversion/libsvn_delta/xdelta.c (original)
 +++ subversion/trunk/subversion/libsvn_delta/xdelta.c Sat Jan  1 20:33:22 2011
 @@ -29,6 +29,8 @@

  #include svn_delta.h
  #include delta.h
 +
 +#include private/svn_adler32.h

  /* This is pseudo-adler32. It is adler32 without the prime modulus.
    The idea is borrowed from monotone, and is a translation of the C++
 @@ -39,6 +41,10 @@
  #define ADLER32_MASK      0x
  #define ADLER32_CHAR_MASK 0x00ff

 +/* Size of the blocks we compute checksums for. This was chosen out of
 +   thin air.  Monotone used 64, xdelta1 used 64, rsync uses 128.  */
 +#define MATCH_BLOCKSIZE 64
 +
  /* Structure to store the state of our adler32 checksum.  */
  struct adler32
  {
 @@ -66,11 +72,32 @@ adler32_out(struct adler32 *ad, const ch
  {
   ad-s1 -= ((apr_uint32_t) (c))  ADLER32_CHAR_MASK;
   ad-s1 = ADLER32_MASK;
 -  ad-s2 -= (ad-len * (((apr_uint32_t) c)  ADLER32_CHAR_MASK)) + 1;
 +  ad-s2 -= (ad-len * (((apr_uint32_t) c)  ADLER32_CHAR_MASK));
   ad-s2 = ADLER32_MASK;
   --ad-len;
  }

 +/* Feed C_IN into the adler32 checksum and remove C_OUT at the same time.
 + * This function may (and will) only be called for
 + * ad-len == MATCH_BLOCKSIZE.
 + */
 +static APR_INLINE void
 +adler32_replace(struct adler32 *ad, const char c_out, const char c_in)
 +{
 +  apr_uint32_t s1 = ad-s1;
 +  apr_uint32_t s2 = ad-s2;
 +
 +  s2 -= (MATCH_BLOCKSIZE * (((apr_uint32_t) c_out)  ADLER32_CHAR_MASK));
 +
 +  s1 -= ((apr_uint32_t) (c_out))  ADLER32_CHAR_MASK;
 +  s1 += ((apr_uint32_t) (c_in))  ADLER32_CHAR_MASK;
 +
 +  s2 += s1;
 +
 +  ad-s1 = s1  ADLER32_MASK;
 +  ad-s2 = s2  ADLER32_MASK;
 +}
 +
  /* Return the current adler32 checksum in the adler32 structure.  */

  static APR_INLINE apr_uint32_t
 @@ -85,18 +112,15 @@ adler32_sum(const struct adler32 *ad)
  static APR_INLINE struct adler32 *
  init_adler32(struct adler32 *ad, const char *data, apr_uint32_t datalen)
  {
 -  ad-s1 = 1;
 -  ad-s2 = 0;
 -  ad-len = 0;
 -  while (datalen--)
 -    adler32_in(ad, *(data++));
 +  apr_uint32_t adler32 = svn__adler32(0, data, datalen);
 +
 +  ad-s1 = adler32  ADLER32_MASK;
 +  ad-s2 = (adler32  16)  ADLER32_MASK;
 +  ad-len = datalen;
 +
   return ad;
  }

 -/* Size of the blocks we compute checksums for. This was chosen out of
 -   thin air.  Monotone used 64, xdelta1 used 64, rsync uses 128.  */
 -#define MATCH_BLOCKSIZE 64
 -
  /* Information for a block of the delta source.  The length of the
    block is the smaller of MATCH_BLOCKSIZE and the difference between
    the size of the source data and the position of this block. */
 @@ -201,6 +225,35 @@ init_blocks_table(const char *data,
     }
  }

 +/* Return the lowest position at which A and B differ. If no difference
 + * can be found in the first MAX_LEN characters, MAX_LEN will be returned.
 + */
 +static apr_size_t
 +match_length(const char *a, const char *b, apr_size_t max_len)
 +{
 +  apr_size_t pos = 0;
 +
 +#if SVN_UNALIGNED_ACCESS_IS_OK
 +
 +  /* Chunky operation is so much faster ...
 +   *
 +   * We can't make this work on architectures that require aligned access
 +   * because A and B will probably have different alignment. So, skipping
 +   * the first few chars until alignment is reached is not an option.
 +   */
 +  for (; pos + sizeof(apr_size_t) = max_len; pos += sizeof(apr_size_t))
 +    if (*(const apr_size_t*)(a + pos) != *(const apr_size_t*)(b + pos))
 +      break;
 +
 +#endif
 +
 +  for (; pos  max_len; ++pos)
 +    if (a[pos] != b[pos])
 +      break;
 +
 +  return pos;
 +}
 +
  /* Try to find a match for the target data B in BLOCKS, and then
    extend the match as long as data in A and B at the match position
    continues to match.  We set the position in a we ended up in (in
 @@ -229,6 +282,8 @@ find_match(const struct blocks

Re: svn commit: r1054286 - /subversion/trunk/subversion/libsvn_delta/xdelta.c

2011-01-02 Thread Johan Corveleyn

On Sun, Jan 2, 2011 at 4:45 PM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de wrote:
 On 02.01.2011 16:38, Johan Corveleyn wrote:

 On Sun, Jan 2, 2011 at 9:20 AM, Johan Corveleynjcor...@gmail.com  wrote:

 On Sat, Jan 1, 2011 at 9:33 PM,stef...@apache.org  wrote:

 Author: stefan2
 Date: Sat Jan  1 20:33:22 2011
 New Revision: 1054286

 URL: http://svn.apache.org/viewvc?rev=1054286view=rev
 Log:
 XDelta calculation is major part of svn_txdelta_send_txstream.
 Therefore, speed up string matching code and the relatively
 expensive hash-code (adler32) calculations.

 * subversion/libsvn_delta/xdelta.c
  (init_adler32): init adler checksum with 0 instead of 1;
   use svn_adler32 for performance
  (adler32_out): -1 can / must be ommitted now that we
   start at 0 instead of 1 for s1.
  (adler32_replace): special, faster implementation replacing
   a adler32_out(), adler32_in() sequence

  (match_length): new utility function
  (find_match): speed up the main loop by calling match_length()
  (compute_delta): optimize adler32 calculations

 Modified:
    subversion/trunk/subversion/libsvn_delta/xdelta.c

 Hi Stefan,

 This makes my (Windows 32-bit VCE 2008) build fail:
 svn_delta-1.lib(xdelta.obj) : error LNK2019: unresolved external
 symbol _svn__adler32 referenced in function _init_adler32
 ..\..\..\Release\subversion\libsvn_delta\libsvn_delta-1.dll : fatal
 error LNK1120: 1 unresolved externals

 It was actually r1054277 that broke it. The following patch fixes it:
 [[[
 Index: build.conf
 ===
 --- build.conf  (revision 1054417)
 +++ build.conf  (working copy)
 @@ -329,7 +329,7 @@ msvc-export =
          private\svn_log.h private\svn_mergeinfo_private.h
          private\svn_opt_private.h private\svn_skel.h private\svn_sqlite.h
          private\svn_utf_private.h private\svn_eol_private.h
 -        private\svn_token.h
 +        private\svn_token.h private\svn_adler32.h

  # Working copy management lib
  [libsvn_wc]
 ]]]

 Thanks Johan! Committed in r1054421.

 -- Stefan^2.

No problem. I also saw two other warnings (than the ones I just sent),
but don't know exactly which revision introduced them (and don't know
how important they are). You'll probably know quicker than I do :-) :

..\..\..\subversion\libsvn_subr\adler32.c(64): warning C4057:
'function' : 'const Bytef *' differs in indirection to slightly
different base types from 'const char *'
..\..\..\subversion\libsvn_subr\adler32.c(64): warning C4244:
'function' : conversion from 'apr_off_t' to 'uInt', possible loss of
data


-- 
Johan

Re: -bytes branch reports bogus conflict that trunk doesn't

2011-01-02 Thread Johan Corveleyn

On Sun, Jan 2, 2011 at 7:36 PM, Daniel Shahaf danie...@apache.org wrote:
 Johan,

 I saw your sync merge, so I ran 'svn up' with intent to rebuild.  When
 I ran that 'up' with the branch build, I got a conflict, which didn't
 appear when I ran it with a trunk build.  (See attached transcript.)

Did you do that with a build of HEAD of diff-optimizations-bytes (i.e.
only changes to diff2 behavior)? Or with a build where the patch for
diff3 was applied?

Johan

 Furthermore, the conflict resolution (a) marked all the head of the file
 as conflict (which is wrong; line 49 did change, but the first
 20-something lines are identical), and (b) deleted everything after
 line 232.

 I realize that (a) is an outstanding issue (discussed in another thread
 this weekend), but I don't recall that (b) is a known issue.

 Daniel

Re: -bytes branch reports bogus conflict that trunk doesn't

2011-01-02 Thread Johan Corveleyn

On Sun, Jan 2, 2011 at 8:19 PM, Daniel Shahaf danie...@apache.org wrote:
 Johan Corveleyn wrote on Sun, Jan 02, 2011 at 20:13:59 +0100:
 On Sun, Jan 2, 2011 at 7:36 PM, Daniel Shahaf danie...@apache.org wrote:
  Johan,
 
  I saw your sync merge, so I ran 'svn up' with intent to rebuild.  When
  I ran that 'up' with the branch build, I got a conflict, which didn't
  appear when I ran it with a trunk build.  (See attached transcript.)

 Did you do that with a build of HEAD of diff-optimizations-bytes (i.e.
 only changes to diff2 behavior)? Or with a build where the patch for
 diff3 was applied?


 The latter, I believe.

Phew, that's a relief :-).

I'm pretty confident of the diff2 changes that are currently
committed, but the diff3 patch posted yesterday is definitely broken.

Interesting info, none the less, to see how it's broken, so thanks.
I'll see if this helps me understand where it's going wrong ...

 Johan

  Furthermore, the conflict resolution (a) marked all the head of the file
  as conflict (which is wrong; line 49 did change, but the first
  20-something lines are identical), and (b) deleted everything after
  line 232.
 
  I realize that (a) is an outstanding issue (discussed in another thread
  this weekend), but I don't recall that (b) is a known issue.
 
  Daniel

Re: My take on the diff-optimizations-bytes branch

2011-01-02 Thread Johan Corveleyn

On Sat, Jan 1, 2011 at 11:25 PM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de wrote:
 Hi Johan,

 Thursday night I did something stupid and had a look at  how
 svn blame could be made faster based on the HEAD code in
 your branch.

 One night and most of the following day later, I think I made it
 a few percent faster now. Some of the code I committed directly
 to /trunk and you need to pull these changes into your branch
 to compile the attached patch.

 Feel free to test it, take it apart and morph it any way you like --
 I know the patch isn't very pretty in the first place. I tested the
 patch on x64 LINUX and would like to hear whether it at least
 somewhat improved performance on your system for your
 svn blame config.xml use-case.

 -- Stefan^2.

 [[[
 Speed-up of datasource_open() and its sub-routines
 by a series of optimizations:

 * allocate the file[] array on stack instead of heap
  (all members become directly addressible through
  the stack pointer because all static sub-functions
  will actually be in-lined)
 * minor re-arragements in arithmetic expressions to
  maximize reuse of results (e.g. in INCREMENT_POINTERS)
 * move hot spots to seperate functions and provide a
  specialized version for file_len == 2
  (many loops collapse to a single execution, other
  values can be hard-coded, etc.)
 * use seperate loops to process data within a chunk
  so we don't need to call INCREMENT_POINTERS  friends
  that frequently
 * scan / compare strings in machine-word granularity
  instead of bytes
 ]]]

Hi Stefan,

Thanks for taking a look. I really appreciate it.

When I saw your first couple of commits, to the adler32 stuff, I
already thought: hmmm, he's up to something :-). And after I saw your
change to eol.c#svn_eol__find_eol_start, reading per machine-word, I
was thinking: hey, I could really use something like that in the
prefix/suffix scanning. Nice ... :-)

(I had some trouble applying your patch. It doesn't apply cleanly
anymore to HEAD of the branch (but most hunks were applied correctly,
and I could manually change the rest, so no problem)).

However, first tests are not so great. In fact, it's about 25% slower
(when blaming my settings.xml file with 2200 revisions, it's spending
about 90 seconds in diff, vs. around 72 with HEAD of
diff-optimizations-bytes).

Analyzing further, I can see it's spending significantly less time in
prefix/suffix scanning, but more time in token.c#svn_diff__get_tokens
(which is where the original algorithm gets the tokens/lines and
inserts them into the token tree). This tells me it's not working
correctly: either prefix/suffix scanning fails too soon, so there's
much more left for the regular algorithm. Or it's just not functioning
correctly.

Looking at the result of the blame operation, it seems it's the
latter. The final result of the blame is not correct anymore.

I'll try to look some more into it, to see what's going wrong. Maybe
you can also see it with a simple diff of a large file ... (for a good
example in svn's own codebase, you could try
subversion/tests/cmdline/merge-tests.py (pretty large, and almost 700
revisions)).

Cheers,
-- 
Johan

Re: My take on the diff-optimizations-bytes branch

2011-01-02 Thread Johan Corveleyn

On Sun, Jan 2, 2011 at 10:04 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Sat, Jan 1, 2011 at 11:25 PM, Stefan Fuhrmann
 stefanfuhrm...@alice-dsl.de wrote:
 Hi Johan,

 Thursday night I did something stupid and had a look at  how
 svn blame could be made faster based on the HEAD code in
 your branch.

 One night and most of the following day later, I think I made it
 a few percent faster now. Some of the code I committed directly
 to /trunk and you need to pull these changes into your branch
 to compile the attached patch.

 Feel free to test it, take it apart and morph it any way you like --
 I know the patch isn't very pretty in the first place. I tested the
 patch on x64 LINUX and would like to hear whether it at least
 somewhat improved performance on your system for your
 svn blame config.xml use-case.

 -- Stefan^2.

 [[[
 Speed-up of datasource_open() and its sub-routines
 by a series of optimizations:

 * allocate the file[] array on stack instead of heap
  (all members become directly addressible through
  the stack pointer because all static sub-functions
  will actually be in-lined)
 * minor re-arragements in arithmetic expressions to
  maximize reuse of results (e.g. in INCREMENT_POINTERS)
 * move hot spots to seperate functions and provide a
  specialized version for file_len == 2
  (many loops collapse to a single execution, other
  values can be hard-coded, etc.)
 * use seperate loops to process data within a chunk
  so we don't need to call INCREMENT_POINTERS  friends
  that frequently
 * scan / compare strings in machine-word granularity
  instead of bytes
 ]]]

 Hi Stefan,

 Thanks for taking a look. I really appreciate it.

 When I saw your first couple of commits, to the adler32 stuff, I
 already thought: hmmm, he's up to something :-). And after I saw your
 change to eol.c#svn_eol__find_eol_start, reading per machine-word, I
 was thinking: hey, I could really use something like that in the
 prefix/suffix scanning. Nice ... :-)

 (I had some trouble applying your patch. It doesn't apply cleanly
 anymore to HEAD of the branch (but most hunks were applied correctly,
 and I could manually change the rest, so no problem)).

 However, first tests are not so great. In fact, it's about 25% slower
 (when blaming my settings.xml file with 2200 revisions, it's spending
 about 90 seconds in diff, vs. around 72 with HEAD of
 diff-optimizations-bytes).

 Analyzing further, I can see it's spending significantly less time in
 prefix/suffix scanning, but more time in token.c#svn_diff__get_tokens
 (which is where the original algorithm gets the tokens/lines and
 inserts them into the token tree). This tells me it's not working
 correctly: either prefix/suffix scanning fails too soon, so there's
 much more left for the regular algorithm. Or it's just not functioning
 correctly.

 Looking at the result of the blame operation, it seems it's the
 latter. The final result of the blame is not correct anymore.

 I'll try to look some more into it, to see what's going wrong. Maybe
 you can also see it with a simple diff of a large file ... (for a good
 example in svn's own codebase, you could try
 subversion/tests/cmdline/merge-tests.py (pretty large, and almost 700
 revisions)).

Ok, by eliminating parts of the new stuff, I found out the problem is
in scan_backward_fast (part of suffix scanning). If I bypass that, and
always take scan_backward_slow, it works.

And it's fast too! It's taking only 58 seconds in diff, vs. 72 for
the normal version. Splitting that up further, it's taking only 28
seconds in prefix/suffix scanning, vs. ~47 for the normal version (for
some reason, the lcs processing took a little longer, but that's
probably unrelated (only did a single run)). And that's with
scan_backward_slow. That's pretty amazing.

The code of scan_backward_fast is pretty difficult to understand for
me. So I'm not sure if I can spot the problem. I'll continue staring
at it for a little while ...

Cheers,
-- 
Johan

Re: My take on the diff-optimizations-bytes branch

2011-01-02 Thread Johan Corveleyn

On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleyn jcor...@gmail.com wrote:
 On Sun, Jan 2, 2011 at 10:04 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Sat, Jan 1, 2011 at 11:25 PM, Stefan Fuhrmann
 stefanfuhrm...@alice-dsl.de wrote:
 Hi Johan,

 Thursday night I did something stupid and had a look at  how
 svn blame could be made faster based on the HEAD code in
 your branch.

 One night and most of the following day later, I think I made it
 a few percent faster now. Some of the code I committed directly
 to /trunk and you need to pull these changes into your branch
 to compile the attached patch.

 Feel free to test it, take it apart and morph it any way you like --
 I know the patch isn't very pretty in the first place. I tested the
 patch on x64 LINUX and would like to hear whether it at least
 somewhat improved performance on your system for your
 svn blame config.xml use-case.

 -- Stefan^2.

 [[[
 Speed-up of datasource_open() and its sub-routines
 by a series of optimizations:

 * allocate the file[] array on stack instead of heap
  (all members become directly addressible through
  the stack pointer because all static sub-functions
  will actually be in-lined)
 * minor re-arragements in arithmetic expressions to
  maximize reuse of results (e.g. in INCREMENT_POINTERS)
 * move hot spots to seperate functions and provide a
  specialized version for file_len == 2
  (many loops collapse to a single execution, other
  values can be hard-coded, etc.)
 * use seperate loops to process data within a chunk
  so we don't need to call INCREMENT_POINTERS  friends
  that frequently
 * scan / compare strings in machine-word granularity
  instead of bytes
 ]]]

 Hi Stefan,

 Thanks for taking a look. I really appreciate it.

 When I saw your first couple of commits, to the adler32 stuff, I
 already thought: hmmm, he's up to something :-). And after I saw your
 change to eol.c#svn_eol__find_eol_start, reading per machine-word, I
 was thinking: hey, I could really use something like that in the
 prefix/suffix scanning. Nice ... :-)

 (I had some trouble applying your patch. It doesn't apply cleanly
 anymore to HEAD of the branch (but most hunks were applied correctly,
 and I could manually change the rest, so no problem)).

 However, first tests are not so great. In fact, it's about 25% slower
 (when blaming my settings.xml file with 2200 revisions, it's spending
 about 90 seconds in diff, vs. around 72 with HEAD of
 diff-optimizations-bytes).

 Analyzing further, I can see it's spending significantly less time in
 prefix/suffix scanning, but more time in token.c#svn_diff__get_tokens
 (which is where the original algorithm gets the tokens/lines and
 inserts them into the token tree). This tells me it's not working
 correctly: either prefix/suffix scanning fails too soon, so there's
 much more left for the regular algorithm. Or it's just not functioning
 correctly.

 Looking at the result of the blame operation, it seems it's the
 latter. The final result of the blame is not correct anymore.

 I'll try to look some more into it, to see what's going wrong. Maybe
 you can also see it with a simple diff of a large file ... (for a good
 example in svn's own codebase, you could try
 subversion/tests/cmdline/merge-tests.py (pretty large, and almost 700
 revisions)).

 Ok, by eliminating parts of the new stuff, I found out the problem is
 in scan_backward_fast (part of suffix scanning). If I bypass that, and
 always take scan_backward_slow, it works.

 And it's fast too! It's taking only 58 seconds in diff, vs. 72 for
 the normal version. Splitting that up further, it's taking only 28
 seconds in prefix/suffix scanning, vs. ~47 for the normal version (for
 some reason, the lcs processing took a little longer, but that's
 probably unrelated (only did a single run)). And that's with
 scan_backward_slow. That's pretty amazing.

 The code of scan_backward_fast is pretty difficult to understand for
 me. So I'm not sure if I can spot the problem. I'll continue staring
 at it for a little while ...

Ok, I can't find it right now. There must be some functional
difference between scan_backward_fast and scan_backward_slow.

For now, some feedback on the rest of the patch:

[[[
 Index: subversion/libsvn_diff/diff_file.c
 ===
 --- subversion/libsvn_diff/diff_file.c(revision 1054044)
 +++ subversion/libsvn_diff/diff_file.c(working copy)
 ... snip ...
 @@ -363,53 +364,152 @@
  /* Check whether one of the FILEs has its pointers at EOF (this is the case 
 if
   * one of them has curp == endp (this can only happen at the last chunk)) */
  static svn_boolean_t
 -is_one_at_eof(struct file_info *file[], apr_size_t file_len)
 +is_one_at_eof(struct file_info file[], apr_size_t file_len)
  {
apr_size_t i;

for (i = 0; i  file_len; i++)
 -if (file[i]-curp == file[i]-endp)
 +if (file[i].curp == file[i].endp)
return TRUE;

return FALSE

Re: My take on the diff-optimizations-bytes branch

2011-01-03 Thread Johan Corveleyn

On Mon, Jan 3, 2011 at 12:03 PM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de wrote:
 On 03.01.2011 02:14, Johan Corveleyn wrote:

 On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleynjcor...@gmail.com
  wrote:

[... snip ...]

 And it's fast too! It's taking only 58 seconds in diff, vs. 72 for
 the normal version. Splitting that up further, it's taking only 28
 seconds in prefix/suffix scanning, vs. ~47 for the normal version (for
 some reason, the lcs processing took a little longer, but that's
 probably unrelated (only did a single run)). And that's with
 scan_backward_slow. That's pretty amazing.

 On x64, datasource_open() got about 10x faster
 but my test-case was the TSVN changelog where
 we (mainly) insert at the beginning instead of
 close to end of the file. So, one broken scan
 direction seems completely plausible ;)

Strange, if you mainly insert at the beginning, it should be mostly
broken, because then it's mostly identical suffix (the part which is
broken). But maybe it's only broken under certain edge conditions, so
that could still explain it...

Maybe it's best to refer to datasource*s*_open(), as opposed to the
original datasource_open(). In the original diff algorithm,
datasources were opened each one by one (and then extracting tokens,
inserting them into token tree, and then move on to the next
datasource). In diff-optimizations-bytes branch, I introduced
datasource*s*_open, to open them together, in order to chop off the
identical prefix/suffix. It's probably not a good function name,
because it's so similar to the old one (maybe datasource_open_all or
something would be better). But OTOH: when I started with this, I
thought I could remove/deprecate the datasource_open, once all diffN
algorithms would use datasource*s*_open. Oh well ...

 The code of scan_backward_fast is pretty difficult to understand for
 me. So I'm not sure if I can spot the problem. I'll continue staring
 at it for a little while ...

 Ok, I can't find it right now. There must be some functional
 difference between scan_backward_fast and scan_backward_slow.

 One way to get closer to the problem:

 * create a backup of the current rename scan_backward_fast code
 * copy scan_backward_slow to scan_backward_fast
 * replace all file_len with a hard-coded 2
  - can give a 50% boost depending on optimizer cleverness
 * step-by-step apply changes from the old scan_backward_fast code

Ok, if I have some time tonight, I will try to get closer ...

 For now, some feedback on the rest of the patch:

 [[[

[... snip ...]

 +  if (had_cr)
 +    {
 +      /* Check if we ended in the middle of a \r\n for one file, but \r
 for
 +         another. If so, back up one byte, so the next loop will back up
 +         the entire line. Also decrement *prefix_lines, since we counted
 one
 +         too many for the \r. */
 +      if ((*file[0].curp == '\n') || (*file[1].curp == '\n'))

 Here (or below DECREMENT, but within this same if condition) you need:
          (*prefix_lines)--;

 just like in the original (and *_slow) case. As is explained in the
 comment above: if we are here, we actually counted a full prefix line
 too much, because we already counted one for a matching '\r', but the
 character after that was '\n' in one file, but a different character
 in the other. So the eol-termination is different, so this is a
 non-matching line which needs to be fully rolled back (hence also the
 extra DECREMENT here).

 I'm not even sure the original code is correct at all.
 For instance, it does not handle '\r' (Mac style) EOLs
 properly. Not really sure how to fix that.

Are you sure it's not correct for Mac style EOLs? Can you give an example?

I thought I made it work correctly with '\r' EOLs (the very first
iterations of this (when I was still sending it as patches to the
dev-list) only worked for \r\n and \n, but somewhere along the way, I
made it handle \r EOLs as well). That's what the entire block below
/* check for eol, and count */ is for:
- if \r: set had_cr = TRUE, count an extra prefix_line.
- if \n: only count an additional prefix_line if the had_cr == FALSE,
because otherwise it's the second half of a \r\n EOL, and we already
counted it for the \r.
- all the rest: don't care, just move on (setting had_cr = FALSE).
- In the special case where prefix scanning ended with a mismatch
between a '\r\n' and '\rsome other character', decrement prefix_line
and go back one byte, because this line wasn't identical after all.

(my older versions just counted the number of \n's, which is ok for
both \n and \r\n).

The only reason to look at EOLs is to count the number of prefix_lines
(and to rollback the last non-matching line until the previous EOL,
but that's done by going back until finding either a \r or an \n). All
the rest is just comparing bytes.

(note that it doesn't support ignoring EOL-style differences (see the
comment /* ### TODO: see if we can take advantage of diff options
like ignore_eol_style or ignore_space. */). Maybe

Re: diff-optimizations-bytes: how to make diff3 work with prefix/suffix scanning

2011-01-05 Thread Johan Corveleyn

On Wed, Jan 5, 2011 at 12:11 PM, Julian Foad julian.f...@wandisco.com wrote:
 On Sat, 2011-01-01, Johan Corveleyn wrote:
 [ Taking a privately-started discussion with danielsh to the list, in
 case others have inspiration/insight about this. Question at hand: I'm
 having trouble making diff3 work with prefix/suffix scanning of the
 diff-optimizations-bytes branch. Any feedback is highly appreciated
 :-). ]

 On Fri, Dec 31, 2010 at 2:38 PM, Daniel Shahaf d...@daniel.shahaf.name 
 wrote:
 [...snip...]
  In diff3 with prefix enabled, I was seeing failures like this:
 
  EXPECTED:
  line1
  
  line2
  ===
  line2 modified
 
 
  ACTUAL:
  
  line1
  line2
  ===
  line1
  line2 modified
 
 
  so I assumed that when the prefix code gobbled the shared prefix.  How
  to fix it from there was purely guesswork on my part (and I haven't
  implemented it).
 
  These failures would go away if I prepended a line0 left and line0
  right to the file.

 Yes, I know. That's indeed because the prefix-scanning code gobbled up
 the identical prefix. That problem is also there in diff2.

 In diff2, I fixed this by simply pre-pending a piece of common diff
 to the diff linked list, in the function diff.c#svn_diff__diff.

 From r1037371:
 [[[
 --- 
 subversion/branches/diff-optimizations-bytes/subversion/libsvn_diff/diff.c
 (original)
 +++ 
 subversion/branches/diff-optimizations-bytes/subversion/libsvn_diff/diff.c
 Sun Nov 21 02:36:07 2010
 @@ -43,6 +43,22 @@ svn_diff__diff(svn_diff__lcs_t *lcs,
   svn_diff_t *diff;
   svn_diff_t **diff_ref = diff;

 +  if (want_common  (original_start  1))
 +    {
 +      /* we have a prefix to skip */
 +      (*diff_ref) = apr_palloc(pool, sizeof(**diff_ref));
 +
 +      (*diff_ref)-type = svn_diff__type_common;
 +      (*diff_ref)-original_start = 0;
 +      (*diff_ref)-original_length = original_start - 1;
 +      (*diff_ref)-modified_start = 0;
 +      (*diff_ref)-modified_length = modified_start - 1;
 +      (*diff_ref)-latest_start = 0;
 +      (*diff_ref)-latest_length = 0;
 +
 +      diff_ref = (*diff_ref)-next;
 +    }

 Hi Johan.

 I know this is work in progress but that whole if block of code is a
 good example of where a comment above the block would be *really* useful
 to tell the reader what it's for.  :-)

Ok, you're absolutely right. If I can't throw this block away (as
suggested below), I'll add some more comments.

 (Also, all that dereferencing is unnecessary in that particular block,
 but I looked at the rest of the function and saw that that style is used
 in two other blocks where it is not unnecessary.  Personally I would
 find the following style more readable throughout the function:

       {
         svn_diff_t *new_diff = apr_palloc(pool, sizeof(*new_diff));
         new_diff-type = ...
         ...

         *diff_ref = new_diff;
         diff_ref = new_diff-next;
       }
 )

Yep, that's definitely clearer. I just copy-pasted from the other
similar blocks in that function. But you're right.

Now, this style is also used in other similar places in diff3.c and
diff4.c. So if it's decided that it's best to change this to make it
more readable, I guess it should be changed in all those places
(preferably on trunk, it's not specific to the
diff-optimizations-bytes branch).

   while (1)
     {
       if (original_start  lcs-position[0]-offset
 ]]]

 Naively, I wanted to do the same for diff3 (along with some other
 tricks, to make the algorithm aware of the prefix_lines), but it
 doesn't work. See the patch in attachment.

 I mean: it works partly: diff-diff3-test.exe passes with the attached
 patch, but merge_reintegrate_tests.py (1, 2, 10, 13 and 14) and
 merge_tests.py (61) don't. I haven't studied the test failures in
 detail (I probably should, but I currently can't wrap my head around
 those tests).

 I'm now pondering another approach, to pre-pend an lcs piece to the
 lcs linked list, and let the rest of the algorithm do its work as
 normal. Inspiration for this came when I was reading the code of
 diff3.c#svn_diff__resolve_conflict, where a similar trick is used
 around line 121:
 [[[
     /* If there were matching symbols at the start of
      * both sequences, record that fact.
      */
     if (common_length  0)
       {
         lcs = apr_palloc(subpool, sizeof(*lcs));
         lcs-next = NULL;
         lcs-position[0] = start_position[0];
         lcs-position[1] = start_position[1];
         lcs-length = common_length;

         lcs_ref = lcs-next;
       }
 ]]]

 Maybe I can even integrate this logic into lcs.c#svn_diff__lcs, where
 I'm already passing the prefix_lines as a new parameter. If that would
 work out, I could probably remove the special common-diff-injecting
 code in diff.c#svn_diff__diff.

 Thoughts?

 Just from reading what you say here, that last solution sounds much
 better.

Yep. I hope to get this working soon. I did an attempt yesterday, but
ended up with causing an infinite loop somewhere/somehow, and gave up
around 2am :-). I'll

Re: svn commit: r1054701 - in /subversion/trunk/subversion/bindings/javahl/native: CopySources.cpp CreateJ.cpp EnumMapper.cpp ListCallback.cpp Revision.cpp RevisionRange.cpp StatusCallback.cpp org_apa

2011-01-10 Thread Johan Corveleyn

On Mon, Jan 3, 2011 at 7:34 PM,  hwri...@apache.org wrote:
 Author: hwright
 Date: Mon Jan  3 18:34:35 2011
 New Revision: 1054701

 URL: http://svn.apache.org/viewvc?rev=1054701view=rev
 Log:
 Fix JavaHL build and test failures introduced in r1054680.

 * subversion/bindings/javahl/native/CreateJ.cpp,
  subversion/bindings/javahl/native/StatusCallback.cpp,
  subversion/bindings/javahl/native/CopySources.cpp,
  subversion/bindings/javahl/native/Revision.cpp,
  subversion/bindings/javahl/native/RevisionRange.cpp,
  subversion/bindings/javahl/native/EnumMapper.cpp,
  subversion/bindings/javahl/native/org_apache_subversion_javahl_SVNClient.cpp,
  subversion/bindings/javahl/native/ListCallback.cpp:
    Update references to moved classes.

 Modified:
    subversion/trunk/subversion/bindings/javahl/native/CopySources.cpp
    subversion/trunk/subversion/bindings/javahl/native/CreateJ.cpp
    subversion/trunk/subversion/bindings/javahl/native/EnumMapper.cpp
    subversion/trunk/subversion/bindings/javahl/native/ListCallback.cpp
    subversion/trunk/subversion/bindings/javahl/native/Revision.cpp
    subversion/trunk/subversion/bindings/javahl/native/RevisionRange.cpp
    subversion/trunk/subversion/bindings/javahl/native/StatusCallback.cpp
    
 subversion/trunk/subversion/bindings/javahl/native/org_apache_subversion_javahl_SVNClient.cpp

  snip ... 

 Modified: subversion/trunk/subversion/bindings/javahl/native/CreateJ.cpp
 URL: 
 http://svn.apache.org/viewvc/subversion/trunk/subversion/bindings/javahl/native/CreateJ.cpp?rev=1054701r1=1054700r2=1054701view=diff
 ==
 --- subversion/trunk/subversion/bindings/javahl/native/CreateJ.cpp (original)
 +++ subversion/trunk/subversion/bindings/javahl/native/CreateJ.cpp Mon Jan  3 
 18:34:35 2011
 @@ -31,7 +31,7 @@
  #include EnumMapper.h
  #include RevisionRange.h
  #include CreateJ.h
 -#include ../include/org_apache_subversion_javahl_Revision.h
 +#include ../include/org_apache_subversion_javahl_types_Revision.h

Typo/Replace-o? This broke something:

WARNING: ..\include\org_apache_subversion_javahl_types_Revision.h
header not found, file subversion\bindings\javahl\native\CreateJ.cpp

The .h file is still named ../include/org_apache_subversion_javahl_Revision.h

Cheers,
-- 
Johan

Re: Diff optimizations and generating big test files

2011-01-16 Thread Johan Corveleyn

On Wed, Jan 5, 2011 at 4:33 PM, Philip Martin
philip.mar...@wandisco.com wrote:
 Johan Corveleyn jcor...@gmail.com writes:

 Thanks for the script, it gives me some good inspiration.

 However, it doesn't fit well with the optimization that's currently
 being done on the diff-optimizations-bytes branch, because the
 differing lines are spread throughout the entire file.

 I thought you were working on two different prefix problems, but if it's
 all the same problem that's fine.  It's why I want *you* to write the
 script, then I can test your patches on my machine.  When you are
 thinking of replacing function calls with macros that's very much
 hardware/OS/compiler specific and testing on more than one platform is
 important.

Sorry it took so long (busy/interrupted with other things), but here
in attachment is finally a python script that generates two files
suitable for testing the prefix/suffix optimization of the
diff-optimizations-bytes branch:

- Without options, it generates two files file1.txt and file2.txt,
with 100,000 lines of identical prefix and 100,000 lines of identical
suffix. And in between a mis-matching section of 500 lines (with a
probability of mismatch of 50%).

- Lines are randomly generated, with random lengths between 0 and 80
(by default).

- On my machine, it generates those two files of ~8 Mb in about 17 seconds.

- Usage: see below.


Tests on my machine (Win XP 32 bit, Intel T2400 CPU @ 1.83 GHz) show
the following:

1) tools/diff/diff from trunk@1058723:
   1.020 s

2) tools/diff/diff from diff-optimizations@1058811:
   0.370 s

3) tools/diff/diff from diff-optimizations@1058811 with stefan2's
low-level optimizations [1]:
   0.290 s

4) GNU diff:
   0.157 s

(it should be noted that svn's tools/diff/diff has a much higher
startup cost than GNU diff (for whatever reason), so that alone
accounts for part of the difference with GNU diff)

For really analyzing the benefit of the low-level optimizations (an
which part of those have the most impact), maybe bigger sample data is
needed.


===
$ ./gen-big-files.py --help
Usage: Generate files for diff

Options:
  -h, --helpshow this help message and exit
  -1 FILE1, --file1=FILE1
filename of left file of the diff, default file1.txt
  -2 FILE2, --file2=FILE2
filename of right file of the diff, default file2.txt
  -p PREFIX_LINES, --prefix-lines=PREFIX_LINES
number of prefix lines, default 10
  -s SUFFIX_LINES, --suffix-lines=SUFFIX_LINES
number of suffix lines, default 10
  -m MIDDLE_LINES, --middle-lines=MIDDLE_LINES
number of lines in the middle, non-matching section,
default 500
  --percent-mismatch=PERCENT_MISMATCH
percentage of mismatches in middle section, default 50
  --min-line-length=MIN_LINE_LENGTH
minimum length of randomly generated lines, default 0
  --max-line-length=MAX_LINE_LENGTH
maximum length of randomly generated lines, default 80

Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2011-01/0005.shtml - I have yet to
integrate (some of) these suggestions into the branch. That may take
me another couple of days (identifying which changes have the biggest
speed/weight gain etc).

Re: My take on the diff-optimizations-bytes branch

2011-01-16 Thread Johan Corveleyn

On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de wrote:
 On 03.01.2011 02:14, Johan Corveleyn wrote:

 On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleynjcor...@gmail.com
  wrote:

 On Sun, Jan 2, 2011 at 10:04 PM, Johan Corveleynjcor...@gmail.com
  wrote:
 For now, some feedback on the rest of the patch:

 [[[

 -          DECREMENT_POINTERS(file, file_len, pool);
 -        }
 +        DECREMENT_POINTERS(file, file_len, pool);
      }

 +  return SVN_NO_ERROR;
 +}
 +
 +/* Find the prefix which is identical between all elements of the FILE
 array.
 + * Return the number of prefix lines in PREFIX_LINES.  REACHED_ONE_EOF
 will be
 + * set to TRUE if one of the FILEs reached its end while scanning
 prefix,
 + * i.e. at least one file consisted entirely of prefix.  Otherwise,
 + * REACHED_ONE_EOF is set to FALSE.
 + *
 + * After this function is finished, the buffers, chunks, curp's and
 endp's
 + * of the FILEs are set to point at the first byte after the prefix. */
 +static svn_error_t *
 +find_identical_prefix(svn_boolean_t *reached_one_eof, apr_off_t
 *prefix_lines,
 +                      struct file_info file[], apr_size_t file_len,
 +                      apr_pool_t *pool)
 +{
 +  if (file_len == 2)
 +    SVN_ERR(find_identical_prefix_fast(reached_one_eof, prefix_lines,
 +                                       file, pool));
 +  else
 +    SVN_ERR(find_identical_prefix_slow(reached_one_eof, prefix_lines,
 +                                       file, file_len, pool));
 +

 Small problem here (not really visible in practice, but potential
 change in behavior none the less): if either of the above function
 calls exited early because of reached_all_eof, this function should
 also exit early.

 Yep, that is just an artifact caused by moving code
 into a sub-function.

Ok, that's no problem anymore, since I removed the early-exit in
r1057435. It was actually not correct, and was the main reason why
some tests kept failing when I ported it to diff3 (see that revisions
commit message for more details).

[ ... snip ... ]

 Also, it would really help if you could split up this patch (though
 this was certainly a good way to try it out and give me an idea of the
 possibilities).

 Giving you some ideas was the main point of it.

 If I find some time, I will try to reduce the patch
 (no fast / slow separation). But since the heap
 to stack conversion spreads across all of the
 code, it is hard to split the patch.

 It would be interesting to see where the biggest gains
 are coming from (I'm guessing from the per-machine-word
 reading/comparing; I'd like to try that first, maybe together with the
 allocating of the file[] on the stack; I'd like to avoid
 special-casing file_len==2, splitting up functions in *_slow and
 *_fast, because it makes it a lot more difficult imho (so I'd only
 like to do that if it's a significant contributor)). But if you want
 to leave that as an exercise for the reader, that's fine as well :-).

 Exercise is certainly not a bad thing ;)

Ok, I'll give it a go one of the coming days, now that I've got
diff3/diff4 working, and I've got a suitable
big-files-generating-script for creating reproducible test-files.

 But I think, the stack variable is certainly helpful
 and easy to do. While chunky operation gives
 a *big* boost, it is much more difficult to code if
 you need to compare multiple sources. It should
 be possible, though.

 The great advantage of the file_len==2 case is
 that this is the only time-critical operation (a
 true merge over long files is rare). Also, many
 constructs collapse if the count is fixed to 2:

 for (i = 1, is_match = TRUE; i  file_len; i++)
  is_match = is_match  *file[0].curp == *file[i].curp;
 while(is_match)
 {
 ...
  for (i = 1, is_match = TRUE; i  file_len; i++)
    is_match = is_match  *file[0].curp == *file[i].curp;
 }

 becomes

 while(*file[0].curp == *file[1].curp)
 {
 }

Hm, that's true, but I'm not sure that will have big impact (but like
you say, we'll need to do more tests). And now that I've read about a
really bad merge-case on the users-list [1], I'm thinking that diff3
should probably also get all the perf-improvement that we can give it.

Actually, what's a little bit troubling is that there are currently
only 3 possible file_len's, of which only 2 are used in practice:
diff2 and diff3 (diff4 is not used in svn core, only in
tools/diff/diff4). So, if we make a slow and a fast version, we could
just as well make a XXX2 and an XXX3 version explicitly, and dispense
with the need to pass arrays and array lenghts. Hmmm... generic (but
more complex) code vs. a couple of specialized versions.


 But basically, we will need to run some more tests.

 -- Stefan^2.



Cheers,
-- 
Johan

[1] http://svn.haxx.se/users/archive-2011-01/0245.shtml

Re: Coding goals / requirements.

2011-01-18 Thread Johan Corveleyn

Hi,

As the guy responsible for the quote that started this thread ([1]):

 Actually, what's a little bit troubling is that there are currently
 only 3 possible file_len's, of which only 2 are used in practice:
 diff2 and diff3 (diff4 is not used in svn core, only in
 tools/diff/diff4). So, if we make a slow and a fast version, we could
 just as well make a XXX2 and an XXX3 version explicitly, and dispense
 with the need to pass arrays and array lenghts. Hmmm... generic (but
 more complex) code vs. a couple of specialized versions.

I'd like to point out that I didn't mean to criticize the API choices,
I was just weighing the options. In fact, I came up with the initial
API choice myself (passing an array of files, instead of passing 2
resp. 3 resp. 4 files), because I wanted it to be generically useful.
In any case it's a new API in the libsvn_diff module, for making
prefix/suffix optimizations possible.

I don't want to hijack this high-level discussion to be a technical
one again, but it may provide some additional context to give some
more details about the thoughts behind it ...

In fact, I started out with the simple form: passing 2 file
arguments for regular diff. But then I discovered there was also
something like diff3 (with 3 files) and diff4 (with 4 files; not used
in core SVN). So I simply thought to myself: theoretically this
optimization should work with N files, so why not just make it so. And
I changed the API to accept an array of files.

This made the code more complex, but in reality it's always the same
approach (so once you 'get' it, it's easy to follow):

Things like:

while(*file0.curp == *file1.curp)
{
  ...
}

are replaced with (actually AND-ing N (or N-1) conditions):

for (i = 1, is_match = TRUE; i  file_len; i++)
  is_match = is_match  *file[0].curp == *file[i].curp;
while(is_match)
{
  ...
  for (i = 1, is_match = TRUE; i  file_len; i++)
is_match = is_match  *file[0].curp == *file[i].curp;
}


So, to be very concrete, it's really the choice between:

(1) A generic set of functions, that work for N files, which are
signifcantly more complex than the N==2 case:

datasources_open(file[])
find_identical_prefix(file[])
find_identical_suffix(file[])

(2) A specialized set of functions for 2, 3 and 4 files (4 is maybe
optional, because not used in svn core), which are a lot simpler, but
actually completely duplicate the higher level logic:

datasources_open2(file0, file1)
find_identical_prefix2(file0, file1)
find_identical_suffix2(file0, file1)

datasources_open3(file0, file1, file2)
find_identical_prefix3(file0, file1, file2)
find_identical_suffix3(file0, file1, file2)

datasources_open4(file0, file1, file2, file3)
find_identical_prefix4(file0, file1, file2, file3)
find_identical_suffix4(file0, file1, file2, file3)


At the time, I felt that (1) was the right choice, WRT power/weight
balance, keeping the logic in one single place. But with stefan2's
proposed low-level optimizations of the algorithm, it's more or less
open for debate :). Those low-level optimizations (which are
definitely worth it) are much more difficult if you want to write them
generically. And apart from that, splitting out the functions in _2,
_3 and _4 variants is also a low-level optimization by itself.

Right now, I'm still trying to keep it generic (trying to integrate
stefan2's suggestions in a generic way; I'm assuming for now that the
low-level optimization coming from the splitting-out itself is not
significant (I'll have to test that)), but I'm not so sure anymore.
We'll see...

Anyway, all in all, I certainly don't feel like I've wasted my time,
because it was also a great learning experience for me (new to C). And
being a perfectionist, I want it to be as good as possible WRT the
power/weight ratio (the most bang (speed/features/...) for the buck
(complexity/maintainability/readability)) :-).

Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2011-01/0241.shtml

On Tue, Jan 18, 2011 at 7:37 AM, Arwin Arni ar...@collab.net wrote:
 Hi All,

 It is a very interesting notion that Gavin throws at us. I think it is very
 important for an open-source project to maintain it's code in a way that it
 is easy for a new guy (like me) to make quick and meaningful changes. Most
 open-source projects with a large development community ends up in a mess of
 perfectly working, yet highly unreadable code.

 However, as a pair of fresh eyes looking at the code, I can safely say that
 though being an open-source project, Subversion has managed to stay
 readable. This can only be attributed to the hours of work spent on
 over-engineering the code (BTW, I personally don't think anything can be
 over-engineered. In my book, it is merely an synonym for perfection).

 There are parts of the code (in svn) that have been written with the notion
 of getting things done. And these are the parts that I really find
 difficult to

Re: My take on the diff-optimizations-bytes branch

2011-01-18 Thread Johan Corveleyn

On Mon, Jan 17, 2011 at 3:12 AM, Johan Corveleyn jcor...@gmail.com wrote:
 On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann
 stefanfuhrm...@alice-dsl.de wrote:

[ ... snip ... ]

 But I think, the stack variable is certainly helpful
 and easy to do.

Ok, I've done this (locally, still have to clean up a little and then
commit). It gives a nice 10% speedup of prefix/suffix scanning, which
is great.

I have a couple of small questions though, about other small
optimizations you did:

 Index: subversion/libsvn_diff/diff_file.c
 ===
 --- subversion/libsvn_diff/diff_file.c(revision 1054044)
 +++ subversion/libsvn_diff/diff_file.c(working copy)
...
 @@ -258,10 +259,10 @@
   
 \
  for (i = 0; i  files_len; i++)  
 \
  {
 \
 -  if (all_files[i]-curp  all_files[i]-endp - 1)   
 \
 -all_files[i]-curp++;
 \
 +  if (all_files[i].curp + 1  all_files[i].endp) 
 \
 +all_files[i].curp++; 
 \

Just curious: why did you change that condition (from 'X  Y - 1' to
'X + 1  Y')?

You mention in your log message: minor re-arragements in arithmetic
expressions to maximize reuse of results (e.g. in
INCREMENT_POINTERS).

Why does this help, and what's the potential impact?


Second question:

 +find_identical_prefix_slow(svn_boolean_t *reached_one_eof,
 +   apr_off_t *prefix_lines,
 +   struct file_info file[], apr_size_t file_len,
 +   apr_pool_t *pool)
 +{
 +  svn_boolean_t had_cr = FALSE;
svn_boolean_t is_match, reached_all_eof;
apr_size_t i;
 +  apr_off_t lines = 0;

*prefix_lines = 0;
for (i = 1, is_match = TRUE; i  file_len; i++)
 -is_match = is_match  *file[0]-curp == *file[i]-curp;
 +is_match = is_match  *file[0].curp == *file[i].curp;
 +
while (is_match)
  {
/* ### TODO: see if we can take advantage of
   diff options like ignore_eol_style or ignore_space. */
/* check for eol, and count */
 -  if (*file[0]-curp == '\r')
 +  if (*file[0].curp == '\r')
  {
 -  (*prefix_lines)++;
 +  lines++;

Here you added a local variable 'lines', which is incremented in the
function, and copied to *prefix_lines at the end (you do the same in
fine_identical_prefix_fast). The original code just sets and
increments *prefix_lines directly. So, out of curiousness: why does
this help, and how much?

Thanks,
-- 
Johan

Re: My take on the diff-optimizations-bytes branch

2011-01-22 Thread Johan Corveleyn

On Wed, Jan 19, 2011 at 1:19 AM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de wrote:
 On 18.01.2011 12:56, Johan Corveleyn wrote:

 On Mon, Jan 17, 2011 at 3:12 AM, Johan Corveleynjcor...@gmail.com
  wrote:

 On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann
 stefanfuhrm...@alice-dsl.de  wrote:

 [ ... snip ... ]

 But I think, the stack variable is certainly helpful
 and easy to do.

 Ok, I've done this (locally, still have to clean up a little and then
 commit). It gives a nice 10% speedup of prefix/suffix scanning, which
 is great.

 I have a couple of small questions though, about other small
 optimizations you did:

 Index: subversion/libsvn_diff/diff_file.c
 ===
 --- subversion/libsvn_diff/diff_file.c  (revision 1054044)
 +++ subversion/libsvn_diff/diff_file.c  (working copy)

 ...

 @@ -258,10 +259,10 @@

     \
      for (i = 0; i  files_len; i++)
      \
      {
      \
 -      if (all_files[i]-curp  all_files[i]-endp - 1)
     \
 -        all_files[i]-curp++;
      \
 +      if (all_files[i].curp + 1  all_files[i].endp)
     \
 +        all_files[i].curp++;
     \

 Just curious: why did you change that condition (from 'X  Y - 1' to
 'X + 1  Y')?

 You mention in your log message: minor re-arragements in arithmetic
 expressions to maximize reuse of results (e.g. in
 INCREMENT_POINTERS).

 Why does this help, and what's the potential impact?

 There is a hidden common sub-expression.
 Variant 1 in pseudo-code:

 lhs = all_files[i].curp;
 temp1 = all_files[i].endp;
 rhs = temp1 - 1;
 if (lhs  rhs)
 {
  temp2 = lhs + 1;
  all_files[i].curp = temp2;
 }

 Variant 2:

 lhs = all_files[i].curp;
 temp = lhs + 1;
 rhs = all_files[i].endp;
 if (lhs  rhs)
  all_files[i].curp = temp;

 The problem is that the compiler cannot figure that out
 on its own because (x+1) and (y-1) don't overflow / wrap
 around for the same values. In particular 0  (0-1) is true
 and (0+1)  0 is false in unsigned arithmetic.

Thanks for the explanation. So, I understand why this should be more
efficient, except ... that it's not, for some reason. At least not
after testing it on my x86 machine.

It's about 1% slower, tested with two test files generated with the
gen-big-files.py script (sent in the other thread), with 1,000,000
prefix and suffix lines, and 500 mismatching lines in the middle:

$ ./gen-big-files.py -1 big1.txt -2 big2.txt -p 100 -s 100

So for now, I didn't commit this change. Maybe it's better on other
platforms, so it may still make sense to do it, but then again, at
around 1% it's probably not that important ...


 Second question:

 +find_identical_prefix_slow(svn_boolean_t *reached_one_eof,
 +                           apr_off_t *prefix_lines,
 +                           struct file_info file[], apr_size_t file_len,
 +                           apr_pool_t *pool)
 +{
 +  svn_boolean_t had_cr = FALSE;
    svn_boolean_t is_match, reached_all_eof;
    apr_size_t i;
 +  apr_off_t lines = 0;

    *prefix_lines = 0;
    for (i = 1, is_match = TRUE; i  file_len; i++)
 -    is_match = is_match  *file[0]-curp == *file[i]-curp;
 +    is_match = is_match  *file[0].curp == *file[i].curp;
 +
    while (is_match)
      {
        /* ### TODO: see if we can take advantage of
           diff options like ignore_eol_style or ignore_space. */
        /* check for eol, and count */
 -      if (*file[0]-curp == '\r')
 +      if (*file[0].curp == '\r')
          {
 -          (*prefix_lines)++;
 +          lines++;

 Here you added a local variable 'lines', which is incremented in the
 function, and copied to *prefix_lines at the end (you do the same in
 fine_identical_prefix_fast). The original code just sets and
 increments *prefix_lines directly. So, out of curiousness: why does
 this help, and how much?

 Incrementing the temporary variable requires one
 indirection less than the original code and uses only
 one instead of two registers (in typical cases).

 The 'how much?' part depends on compiler / optimizer
 and even more on the target platform (x86 has very
 few registers to keep data in fly; x64 has about twice
 as many).

Ok, this apparently sped it up for around 1% on my machine. So I
committed this in r1062275.

Thanks.
Cheers,
-- 
Johan

Re: My take on the diff-optimizations-bytes branch

2011-01-23 Thread Johan Corveleyn

On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de wrote:
 On 03.01.2011 02:14, Johan Corveleyn wrote:
 It would be interesting to see where the biggest gains
 are coming from (I'm guessing from the per-machine-word
 reading/comparing; I'd like to try that first, maybe together with the
 allocating of the file[] on the stack; I'd like to avoid
 special-casing file_len==2, splitting up functions in *_slow and
 *_fast, because it makes it a lot more difficult imho (so I'd only
 like to do that if it's a significant contributor)). But if you want
 to leave that as an exercise for the reader, that's fine as well :-).

 Exercise is certainly not a bad thing ;)

 But I think, the stack variable is certainly helpful
 and easy to do. While chunky operation gives
 a *big* boost, it is much more difficult to code if
 you need to compare multiple sources. It should
 be possible, though.

Ok, I finished the exercise :-).

- File info structures on stack instead of on heap: committed in
r1060625. Gives 10% boost.

- chunky operation for multiple sources: committed in r1062532. Gives
~250% boost on my machine.

For suffix scanning, I made a couple of changes to the logic you sent
in your patch (as you remember, there was a problem with the
implementation of suffix scanning [1]). It was (in
scan_backward_fast):

[[[
+#if SVN_UNALIGNED_ACCESS_IS_OK
+
+  if (   (file[0].curp - sizeof(apr_size_t)  file[0].curp)
+   (file[1].curp - sizeof(apr_size_t)  file[1].curp))
+{
+  do
+{
+  file[0].curp -= sizeof(apr_size_t);
+  file[1].curp -= sizeof(apr_size_t);
+}
+  while (   (file[0].curp = min_curp0)
+  (file[1].curp = min_curp1)
+  (   *(const apr_size_t*)(file[0].curp + 1)
+ == *(const apr_size_t*)(file[1].curp + 1)));
+
+  file[0].curp += sizeof(apr_size_t);
+  file[1].curp += sizeof(apr_size_t);
+}
]]]

I changed this to (the N-file equivalent of):
[[[
+#if SVN_UNALIGNED_ACCESS_IS_OK
+
+  if (   (file[0].curp - sizeof(apr_size_t) = min_curp0)
+   (file[1].curp - sizeof(apr_size_t) = min_curp1))
+{
+  do
+{
+  file[0].curp -= sizeof(apr_size_t);
+  file[1].curp -= sizeof(apr_size_t);
+}
+  while (   (file[0].curp - sizeof(apr_size_t) = min_curp0)
+  (file[1].curp - sizeof(apr_size_t) = min_curp1)
+  (   *(const apr_size_t*)(file[0].curp + 1)
+ == *(const apr_size_t*)(file[1].curp + 1)));
+
+  file[0].curp += sizeof(apr_size_t);
+  file[1].curp += sizeof(apr_size_t);
+}
]]]

(so, changed the if-comparisons before the do-while-loop, to make sure
there is room to subtract a sizeof(apr_size_t) (the original didn't
really make sense to me), and changed the comparisons with min_curpX
in the while condition (check if there is still room to subtract a
sizeof(apr_size_t)).

After that, all tests went fine.


Now, you suggested in a previous post in this thread that hardcoding
file_len to 2 should speed it up with up to 50%. And you are correct
(on my machine, it's sped up by ~40% if I replace all file_len's in
find_identical_* with 2).

So, that leads me to conclude that I should probably throw away the
generic code, and replace it with 3 variants: find_identical_*2,
find_identical_*3 and find_identical_*4 (once I write *2, the rest
should be mostly copy-n-paste). As you said, this should also vastly
simplify the code (drop the for loops, simpler conditionals, ...).

Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2011-01/0022.shtml

Re: My take on the diff-optimizations-bytes branch

2011-01-23 Thread Johan Corveleyn

On Sun, Jan 23, 2011 at 10:46 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann
 stefanfuhrm...@alice-dsl.de wrote:
 On 03.01.2011 02:14, Johan Corveleyn wrote:
 It would be interesting to see where the biggest gains
 are coming from (I'm guessing from the per-machine-word
 reading/comparing; I'd like to try that first, maybe together with the
 allocating of the file[] on the stack; I'd like to avoid
 special-casing file_len==2, splitting up functions in *_slow and
 *_fast, because it makes it a lot more difficult imho (so I'd only
 like to do that if it's a significant contributor)). But if you want
 to leave that as an exercise for the reader, that's fine as well :-).

 Exercise is certainly not a bad thing ;)

 But I think, the stack variable is certainly helpful
 and easy to do. While chunky operation gives
 a *big* boost, it is much more difficult to code if
 you need to compare multiple sources. It should
 be possible, though.

 Ok, I finished the exercise :-).

 - File info structures on stack instead of on heap: committed in
 r1060625. Gives 10% boost.

 - chunky operation for multiple sources: committed in r1062532. Gives
 ~250% boost on my machine.

Oops, as some people pointed out on IRC, that should be ~60% (I meant
2.5 times faster, going from ~50s to ~20s, so that's about ~60% less
time spent in datasources_open).

-- 
Johan

Assertion failure during update_tests.py 58 (XFAIL: update a nonexistent child of a copied dir)

2011-01-23 Thread Johan Corveleyn

Hi,

Already for some time now, update_tests.py 58 (XFAIL: update a
nonexistent child of a copied dir) crashes on my machine:

svn: In file '..\..\..\subversion\libsvn_wc\update_editor.c' line
4877: assertion failed (repos_root != NULL  repos_uuid != NULL)

I understand that this test is XFAIL, that this isn't addressed yet,
but is it supposed to fail an assert?

On my system (Win XP) this causes an ugly popup to appear (which I
need to click away to continue), each time I run the full test suite
(This application has requested the Runtime to terminate it in an
unusual way...)

Relevant excerpt from tests.log in attachment (this was with trunk@1062600).

Cheers,
-- 
Johan

Re: FSFS format 6

2011-01-23 Thread Johan Corveleyn

On Wed, Dec 29, 2010 at 8:37 PM, Stefan Fuhrmann eq...@web.de wrote:
 On 29.12.2010 01:58, Johan Corveleyn wrote:

 On Sun, Dec 12, 2010 at 4:23 PM, Stefan Fuhrmann
 stefanfuhrm...@alice-dsl.de  wrote:

 On 19.10.2010 15:10, Daniel Shahaf wrote:

 Greg Stein wrote on Tue, Oct 19, 2010 at 04:31:42 -0400:

 Personally, I see [FSv2] as a broad swath of API changes to align our
 needs with the underlying storage. Trowbridge noted that our current
 API makes it *really* difficult to implement an effective backend. I'd
 also like to see a backend that allows for parallel PUTs during the
 commit process. Hyrum sees FSv2 as some kind of super-key-value
 storage with layers on top, allowing for various types of high-scaling
 mechanisms.

 At the retreat, stefan2 also had some thoughts about this...

 [This is just a brain-dump for 1.8+]

 While working on the performance branch I made some
 observations concerning the way FSFS organizes data
 and how that could be changed for reduced I/O overhead.

 notes/fsfs-improvements.txt contains a summary of that
 could be done to improve FSFS before FS-NG. A later
 FS-NG implementation should then still benefit from the
 improvements.

 +(number of fopen calls during a log operation)

 I like this proposal a lot. As I already told before, we are running
 our FSFS back-end on a SAN over NFS (and I suspect we're not the only
 company doing this). In this environment, the server-side I/O of SVN
 (especially the amount of random reads and fopen calls during e.g.
 log) is often the major bottleneck.

 There is one question going around in my head though: won't you have
 to change/rearrange a lot of the FS layer code (and maybe repos
 layer?) to benefit from this new format?

 Maybe. But as far as I understand the current
 FSFS structure, data access is mainly chasing
 pointers, i.e. reading relative or absolute byte
 offsets and moving there for the next piece of
 information. If everything goes well, none of that
 code needs to change; the revision packing
 algorithm will simply produce different offset
 values.

 The current code is written in a certain way, not particularly
 optimized for this new format (I seem to remember log does around 10
 fopen calls for every interesting rev file, each time reading a
 different part of it). Also, if an operation currently needs to access
 many revisions (like log or blame), it doesn't take advantage at all
 of the fact that they might be in a single packed rev file. The pack
 file is opened and seeked in just as much as the sum of the individual
 rev files.

 The fopen() calls should be eliminated by the
 file handle cache. IOW, they should already be
 addressed on the performance branch. Please
 let me know if that is not the case.

Ok, finally got around to verifying this.

You are completely correct: the performance branch avoids the vast
amount of repeated fopen() calls. With a simple test (testfile with 3
revisions, executing svn log of it) (note: this is an unpacked 1.6
repository):

- trunk: opens each rev file between 19 and 21 times.

- performance branch: opens each rev file 2 times.

(I don't know why it's not simply 1 time, but ok, 2 times is already a
factor 10 better than trunk :-)).

I tested this simply by adding one line of printf instrumentation
inside libsvn_subr/io.c#svn_io_file_open (see patch in attachment, as
well as the output for trunk and for perf-branch).

Now, if only that file-handle cache could be merged to trunk :-) ...

Cheers,
-- 
Johan
Index: subversion/libsvn_subr/io.c
===
--- subversion/libsvn_subr/io.c (revision 1062600)
+++ subversion/libsvn_subr/io.c (working copy)
@@ -2786,6 +2786,7 @@ svn_io_file_open(apr_file_t **new_file, const char
   const char *fname_apr;
   apr_status_t status;
 
+  fprintf(stderr,  OPEN(%d) %s\n, flag, fname);
   SVN_ERR(cstring_from_utf8(fname_apr, fname, pool));
   status = file_open(new_file, fname_apr, flag | APR_BINARY, perm, TRUE,
  pool);
$ ../../client_build/svn_branch_performance/dist/bin/svnserve -d -r 
c:/research/svn/experiment/repos
 OPEN(1) C:/research/svn/experiment/repos/format
 OPEN(129) C:/research/svn/experiment/repos/db/fs-type
 OPEN(129) C:/research/svn/experiment/repos/db/fs-type
 OPEN(129) C:/research/svn/experiment/repos/db/format
 OPEN(129) C:/research/svn/experiment/repos/db/uuid
 OPEN(129) C:/research/svn/experiment/repos/db/min-unpacked-rev
 OPEN(161) C:/research/svn/experiment/repos/db/fsfs.conf
 OPEN(129) C:/research/svn/experiment/repos/db/min-unpacked-revprop
 OPEN(129) C:/research/svn/experiment/repos/db/current
 OPEN(161) C:/research/svn/experiment/repos/conf/svnserve.conf
 OPEN(129) C:/research/svn/experiment/repos/db/revs/0/0
 OPEN(129) C:/research/svn/experiment/repos/db/current
 OPEN(129

Re: My take on the diff-optimizations-bytes branch

2011-01-25 Thread Johan Corveleyn

On Tue, Jan 25, 2011 at 1:37 AM, Stefan Fuhrmann eq...@web.de wrote:
[ ... snip ...]
 And, as promised, here some ideas how to get more
 speed from the generic code. Your latest commit:

 +#if SVN_UNALIGNED_ACCESS_IS_OK
 +
 +      /* Skip quickly over the stuff between EOLs. */
 +      for (i = 0, can_read_word = TRUE; i  file_len; i++)
 +        can_read_word = can_read_word
 +  (file[i].curp + sizeof(apr_size_t)  file[i].endp);
 +      while (can_read_word)
 +        {
 +          for (i = 1, is_match = TRUE; i  file_len; i++)
 +            is_match = is_match
 +  (   *(const apr_size_t *)file[0].curp
 +                           == *(const apr_size_t *)file[i].curp);
 +
 +          if (!is_match || contains_eol(*(const apr_size_t *)file[0].curp))
 +            break;
 +
 +          for (i = 0; i  file_len; i++)
 +            file[i].curp += sizeof(apr_size_t);
 +          for (i = 0, can_read_word = TRUE; i  file_len; i++)
 +            can_read_word = can_read_word
 +  (file[i].curp + sizeof(apr_size_t)  file[i].endp);
 +        }
 +
 +#endif

 could be changed to something like the following.
 Please note that I haven't tested any of this:

Thanks. There was one error in your suggestion, which I found out
after testing. See below.

 /* Determine how far we may advance with chunky ops without reaching
  * endp for any of the files.
  * Signedness is important here if curp gets close to endp.
  */
 apr_ssize_t max_delta = file[0].endp - file[0].curp - sizeof(apr_size_t);
 for (i = 1; i  file_len; i++)
 {
    apr_ssize_t delta = file[i].endp - file[i].curp - sizeof(apr_size_t);
    if (delta  max_delta)
        max_delta = delta;
 }

 /* the former while() loop */
 is_match = TRUE;
 for (delta = 0; delta  max_delta  is_match; delta += sizeof(apr_size_t))
 {
    apr_size_t chunk = *(const apr_size_t *)(file[0].curp + delta);
    if (contains_eol(chunk))
        break;

    for (i = 1; i  file_len; i++)
        if (chunk != *(const apr_size_t *)(file[i].curp + delta))
        {
            is_match = FALSE;

Here, I inserted:

delta -= sizeof(apr_size_t);

because otherwise, delta will be increased too far (it will still be
increased by the counting expression of the outer for-loop (after
which it will stop because of !is_match)). Maybe there is a
cleaner/clearer way to break out of the outer for-loop here, without
incrementing delta again, but for now, I've committed it with this
change (r1063565).

            break;
        }
 }

 /* We either found a mismatch or an EOL at or shortly behind curp+delta
  * or we cannot proceed with chunky ops without exceeding endp.
  * In any way, everything up to curp + delta is equal and not an EOL.
  */
 for (i = 0; i  file_len; i++)
    file[i].curp += delta;

Thanks. This gives on my machine/example another 15-20% performance
increase (datasources_open time going down from ~21 s to ~17 s). We
should probably do the same for suffix scanning, but I'm too tired
right now :-) (and suffix scanning is more difficult to grok, so not a
good idea to do at 3 am).

Cheers,
-- 
Johan

Re: Status of the branch diff-optimizations-bytes branch

2011-01-25 Thread Johan Corveleyn

On Tue, Jan 25, 2011 at 4:58 PM, Hyrum K Wright hy...@hyrumwright.org wrote:
 Johan (and other interested parties),
 I've been following some of the commits to the
 diff-optimizations-branch with interest.  While I've not reviewed them
 for technical merit, it appears that others have, and that there is
 good work going on in the branch.

Thanks for your interest.

It might be good to get a little bit more review on the whole, I
think. A lot of people have read parts of the code, but if I remember
correctly most (if not all) of them have said things like I haven't
reviewed it in detail, but here are some suggestions, feedback, 

 I'm wondering what the potential
 for merging back to trunk is.  This is the TODO section from the
 BRANCH-README:

 TODO:

  * eliminate identical prefix                 [DONE]
  * eliminate identical suffix                 [DONE]
  * make diff3 and diff4 use datasources_open  [DONE]
     (this may eliminate the need for datasource_open, and the flag
      datasource_opened in token.c#svn_diff__get_tokens)
  * implement (part of) increment_pointers and
    decrement_pointers with a macro            [DONE]
     (offers another 10% speedup)
  * integrate (some of the) low-level optimizations for prefix/suffix
    scanning, proposed by stefan2 [2]          []
  * revv svn_diff.h#svn_diff_fns_t             []

 It looks like, for the most part, any destabilizing functionality is
 completed, and what remains are simply optimizations.  This
 optimization work can probably be performed on trunk, and if so, we
 should merge the branch back and do the cleanup work there.  The only
 bit on the current list of stuff is revving svn_diff_fns_t.  Can that
 work be parallelized?

Yes, you are correct. Most of the essential work is done. Further
optimizations can just as well be done on trunk.

Revving svn_diff_fns_t: what do you mean with parallelizing it? I must
admit that I don't really know (yet) how to go about that. Very early
during the branch work, danielsh pointed out that I modified this
public struct (vtable for reading data from datasources), so it should
be revved. Is it listed/documented somewhere what should be done for
that (community guide perhaps)?

(one slightly related note to this: I introduced the function
datasources_open, to open the multiple datasources at once (as opposed
to the original datasource_open, which only opens one datasource). But
maybe the new function should be renamed to datasource_open_all or
datasources_open_all or something, to make it stand out a little bit
more).

 I'm not trying to rush the work, nor do I think it is essential for
 1.7, but it feels like a pretty good performance increase, and one
 that we shouldn't have any problem shipping.  (If there are currently
 API uncertainties, than it would be good to wait until 1.7.x branches,
 though.)

Yes, I think it's quite ready to be merged, and could provide a very
significant performance increase (for diff, merge and blame).

The API is stable now, I think, except maybe for the name of the
datasources_open function (see above). If we decide to go (for
optimizations reasons) for specialized prefix/suffix scanning
functions for 2, 3 or 4 datasources, I think it's best to keep the
generic datasources_open API (with an array of datasources), and only
split up between specialized versions in the implementation.

 Anyway, I just really like the concept / implementation of the branch,
 and I'm excited to get it into the hands of users.  Thoughts?

 -Hyrum


Cheers,
-- 
Johan

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1221 matches

Mail list logo