Re: svn commit: r1915519 [1/4] - in /subversion/branches/pristine-checksum-salt: ./ build/ build/ac-macros/ build/generator/ build/generator/swig/ contrib/client-side/svn_load_dirs/ contrib/hook-scrip

2024-02-03 Thread Evgeny Kotkov via dev
Daniel Sahlberg  writes:

> Log:
> On branch pristine-checksum-salt:
>
> Catchup merge with trunk

`pristine-checksum-salt` is branched off from branches/pristine-checksum-kind
(rather than from trunk), so in this case it should probably be better to
merge the trunk changes into the parent branch first, and propagate them
into the `pristine-checksum-salt` branch from parent.


Thanks,
Evgeny Kotkov


Re: Changing the permission checks in libsvn_subr/io.c

2024-02-03 Thread Evgeny Kotkov via dev
Daniel Sahlberg  writes:

>> Index: subversion/libsvn_subr/io.c
>> ===
>> --- subversion/libsvn_subr/io.c (revision 1915064)
>> +++ subversion/libsvn_subr/io.c (working copy)
>> @@ -2535,7 +2535,14 @@ svn_io__is_finfo_read_only(svn_boolean_t *read_onl
>>apr_uid_t uid;
>>apr_gid_t gid;
>>
>> -  *read_only = FALSE;
>> +  *read_only = (access(file_info->fname, W_OK) != 0);

There are a few problems, because this approach adds a I/O syscall (that
checks access to a file by its name) into a fairly low-level function.

Previously this function was analyzing the file information from the passed-in
apr_finfo_t structure.  The new version, however, performs an I/O call, and
that leads to a couple of problems:

1) Potential performance regression, not limited to just the revert.
   For example, all calls to open_pack_or_rev_file() in libsvn_fs_fs would
   now start to make additional access() calls, thus slowing down the
   repository operations.

2) This adds an opportunity for a desynchronization/race between the
   information in apr_finfo_t and the result of access(), because access()
   works by a filename.  For example, in a case where a file is recreated
   between the two calls, its result will correspond to a completely
   different file, compared to the one that produced the apr_finfo_t.

Overall, I have a feeling that solving this specific edge-case might not be
worth introducing these regressions (which look significant, at least to me).


Thanks,
Evgeny Kotkov


Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

2024-01-18 Thread Evgeny Kotkov via dev
Daniel Sahlberg  writes:

> As far as I understand, the point of multi-hash is to keep the WC format
> between versions (so older clients can continue to use the WC).

Just as a minor note, the working copies created using the implementation
on the `pristine-checksum-salt` branch don't multi-hash the contents, but
rather make the [single] used checksum kind configurable and persist it at
the moment when a working copy is created or upgraded.

> I need some help to understand how that would work in practice. Let's say
> that 1.15 adds SHAABC, 1.16 adds SHAXYZ. Then 1.17 drops SHA1. But...
> - A 1.17 client will only use SHAABC or SHAXYZ hashes.
> - A 1.16 client can use SHA1, SHAABC and SHAXYZ hashes.
> - A 1.15 client can only use SHA1 and SHAABC hashes.
>
> How can these work together? A WC created in 1.17 can't be used by a 1.15
> client and a WC created in 1.15 (with SHA1) can't be used by a 1.17 client.
> How is this different from bumping the format? How do we detect this?

In the current design available on the `pristine-checksum-salt` branch, the
supported checksum kinds are tied to a working copy format, and any supported
checksum kind may additionally use a dynamic salt.  For example, format 33
supports only SHA-1 (regular or dynamically salted), but a newer format 34
can add support for another checksum kind such as SHA-2 if necessary.

When an existing working copy is upgraded to a newer format, its current
checksum kind is retained as is (we can't rehash the content in a
`--store-pristine=no` case because the pristines are not available).

I don't know if we'll find ourselves having to forcefully phase out SHA-1
*even* for such working copies that retain an older checksum kind, i.e.,
it might be enough to use the new checksum kind only for freshly created
working copies.  However, there would be a few options to consider:

I think that milder options could include warning the user to check out a
new working copy (that would use a different checksum kind), and a harsher
option could mean adding a new format that doesn't support SHA-1 under
any circumstances, and declaring all previously available working copy
formats unsupported.


Regards,
Evgeny Kotkov


Re: Getting to first release of pristines-on-demand feature (#525).

2024-01-18 Thread Evgeny Kotkov via dev
Evgeny Kotkov  writes:

> Merged in https://svn.apache.org/r1905955
>
> I'm going to respond on the topic of SHA1 a bit later.

For the history: thread [1] proposes the `pristine-checksum-salt` branch that
adds the infrastructure to support new pristine checksum kinds in the working
copy and makes a switch to the dynamically-salted SHA1.

>From the technical standpoint, I think that it would be better to release
the first version of the pristines-on-demand feature having this branch
merged, because now we rely on the checksum comparison to determine if a
file has changed — and currently it's a checksum kind with known collisions.

At the same time, having that branch merged probably isn't a formal release
blocker for the pristines-on-demand feature.  Also, considering that the
`pristine-checksum-salt` branch is currently vetoed by danielsh (presumably,
for an indefinite period of time), I'd like to note that personally I have
no objections to proceeding with a release of the pristines-on-demand
feature without this branch.

[1] https://lists.apache.org/thread/xmd7x6bx2mrrbw7k5jr1tdmhhrlr9ljc


Regards,
Evgeny Kotkov


Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

2024-01-18 Thread Evgeny Kotkov via dev
Daniel Shahaf  writes:

> Procedurally, the long hiatus is counterproductive.

This reminds me that the substantive discussion of your veto ended with my
email from 8 Feb 2023 that had four direct questions to you and was left
without an answer:

``
  > That's not how design discussions work.  A design discussion doesn't go
  > "state decision; state pros; implement"; it goes "state problem; discuss
  > potential solutions, pros, cons; decide; implement" (cf. [4, 5, 6]).

  Well, I think it may not be as simple as it seems to you.  Who decided that
  we should follow the process you're describing?  Is there a thread with a
  consensus on this topic?  Or do you insist on using this specific process
  because it's the only process that seems obvious to you?  What alternatives
  to it have been considered?

  As far as I can tell, the process you're suggesting is effectively a
  waterfall-like process, and there are quite a lot of concerns about its
  effectiveness, because the decisions have to be made in the conditions of
  a lack of information.
``

It's been more than 11 months since that email, and those questions still
don't have an answer.  So if we are to resume this discussion, let's do it
from the proper point.

> You guys are welcome to try to /convince/ me to change my opinion, or to
> have the veto invalidated.  In either case, you will be more likely to
> succeed should your arguments relate not only to the veto's implications
> but also to its /sine qua non/ component: its rationale.

Just in case, my personal opinion here is that the veto is invalid.

Firstly, based on my understanding, the ASF rules prohibit casting a veto
without an appropriate technical justification (see [1], which I personally
agree with).  Secondly, it seems that the process you are imposing hasn't been
accepted in this community.  As far as I know, this topic was tangentially
discussed before (see [2], for example), and it looks like there hasn't been
a consensus to change our current Commit-Then-Review process into some
sort of Review-Then-Commit.

(At the same time I won't even try to /convince/ you, sorry.)

[1] https://www.apache.org/foundation/voting.html
[2] https://lists.apache.org/thread/ow2x68g2k4lv2ycr81d14p8r8w2jj1xl


Regards,
Evgeny Kotkov


Re: Subversion 1.14.3 up for testing/signing

2023-12-12 Thread Evgeny Kotkov
Nathan Hartman  writes:

> The 1.14.3 release artifacts are now available for testing/signing.
> Please get the tarballs from
>   https://dist.apache.org/repos/dist/dev/subversion
> and add your signatures there.

Summary:
+1 to release (Windows)

Tested:
[ fsfs ] x [ http v1 | http v2 ]

Results:
All tests passed

Platform:
Windows 10 x64 (Version 22H2)
Visual Studio Build Tools 2022 (Version 17.7.7)

Dependencies:
apr 1.6.5
apr-util 1.6.3
httpd 2.4.58
openssl 3.0.12
serf 1.3.10
sqlite 3.41.2
zlib 1.2.12

Committed file signature in r66017.


Thanks,
Evgeny Kotkov


Re: E160056 Item index too large in revision

2023-04-12 Thread Evgeny Kotkov via dev
Daniel Sahlberg  writes:

> In the TortoiseSVN-dev group it is claimed to be because of APR 1.7.3 and
> that reverting back to 1.7.2 restores normal operation, I havn't yet
> verified nor tried to figure out which of the changes between 1.7.2
> and 1.7.3 may be responsible for this issue. As I know some of the
> APR committers are also here, I'm hoping for some help.

The problem seems to be caused by a change in APR 1.7.3, authored by me
in an attempt to fix wrong file offsets reported after opening files on
Windows and Unix:

 *) Don't seek to the end when opening files with APR_FOPEN_APPEND on Windows.

Unfortunately, I missed a case where flushing a APR_APPEND | APR_BUFFERED
file was implicitly relying on that specific file offset, and that is what
caused the regression you're seeing.

I have now reverted this change in https://svn.apache.org/r1909088 and
https://svn.apache.org/r1909089, so hopefully this regression will be fixed
in the next APR 1.7.x patch release.


Thanks,
Evgeny Kotkov


Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

2023-04-01 Thread Evgeny Kotkov via dev
Daniel Shahaf  writes:

> What's the question or action item to/for me?  Thanks.

I'm afraid I don't fully understand your question.  As you probably remember,
the change is blocked by your veto.  To my knowledge, this veto hasn't been
revoked as of now, and I simply mentioned that in my email.  It is entirely
your decision whether or not to take any action regarding this matter.


Thanks,
Evgeny Kotkov


Re: [PROPOSAL] Allow plaintext passwords again.

2023-03-29 Thread Evgeny Kotkov via dev
Nathan Hartman  writes:

> I think a good middle ground is:
>
> * Build with --enable-plaintext-password-storage by default; users who
>   want to harden their system can do so, but will need to build their
>   own client.

+1.

> * Set the default run-time config to store-plaintext-passwords = no
>   (if it isn't already; haven't checked) and instruct users on how to
>   change it. This makes the decision to store in plaintext an explicit
>   one that the user can opt into. (I appreciate that this could be
>   changed without the user's knowledge; perhaps the systemwide config
>   should always take precedence over the user-controlled one for this
>   setting?)

So, apparently, the current default is "ask".

I haven't checked all the details, but I think that defaulting to "ask"
already makes the user decision explicit and allows it to happen naturally,
without requiring any additional instructions or knowledge.

If we change the default to "no", this part of the experience could be worse,
because for the end users it might look like the credentials aren't being
stored for unknown reasons / a bug in the software.


Thanks,
Evgeny Kotkov


Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

2023-03-22 Thread Evgeny Kotkov via dev
Evgeny Kotkov  writes:

> > Now, how hard would this be to actually implement?
>
> To have a more or less accurate estimate, I went ahead and prepared the
> first-cut implementation of an approach that makes the pristine checksum
> kind configurable in a working copy.
>
> The current implementation passes all tests in my environment and seems to
> work in practice.  It is available on the branch:
>
>   https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind
>
> The implementation on the branch allows creating working copies that use a
> checksum kind other than SHA-1.

I extended the current implementation to use a dynamically salted SHA-1
checksum, rather than a SHA-1 with a statically hardcoded salt.
The dynamic salt is generated during the creation of a wc.db.

The implementation is available on a separate branch:

  https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-salt

The change is a bit massive, but in the meantime I think that it should solve
the potential problem without any practical drawbacks, except for the lack
of the mentioned ra_serf fetch optimization.

So overall I'd propose to bring this change to trunk, to improve the current
state around checksum collisions in the working copy, and to also have the
infrastructure for supporting different checksum kinds in place, in case
we need it in the future.

This change is still being blocked by a veto, but if danielsh changes his
mind and if there won't be other objections, I'm ready to complete the few
remaining bits and merge it to trunk.


Thanks,
Evgeny Kotkov


Re: svn commit: r1907965 - in /subversion/trunk/subversion: include/ include/private/ libsvn_client/ libsvn_wc/ svn/ tests/cmdline/

2023-03-03 Thread Evgeny Kotkov via dev
Jun Omae  writes:

> After r1907965, build on Windows is failing with the following errors.
>
> [[[
> svn_client-1.lib(upgrade.obj) : error LNK2001: unresolved external symbol 
> svn_wc__version_string_from_format 
> [D:\a\subversion\subversion\build\win32\vcnet-vcproj\libsvn_client_dll.vcxproj]
> D:\a\subversion\subversion\Release\subversion\libsvn_client\libsvn_client-1.dll
>  : fatal error LNK1120: 1 unresolved externals 
> [D:\a\subversion\subversion\build\win32\vcnet-vcproj\libsvn_client_dll.vcxproj]
> ]]]

Thanks, should be fixed in r1908042.


Regards,
Evgeny Kotkov


Re: Initial patch for storing the pristines-on-demand setting in the working copy (Issue 4889)

2023-03-02 Thread Evgeny Kotkov via dev
Evgeny Kotkov  writes:

> While working on the patch, I have stumbled across a couple of issues:
>
> A) `svn upgrade` without arguments fails for a working copy with latest format
>
>   $ svn checkout --compatible-version=1.15 wc
>   $ svn upgrade wc
>   $ svn: E155021: Working copy '…' is already at version 1.15 (format 32)
> and cannot be downgraded to version 1.8 (format 31)

While the work on pristine checksum kinds is blocked by a veto, I took a
look at other things we'd probably want to handle before the release.

I think that the above case qualifies as such, so I made a related improvement
by introducing a new config option and also fixed the described error:

- r1907964 introduces a new `compatible-version` config setting, to allow
  configuring the desired default wc compatibility level globally.

- r1907965 fixes an error when `svn upgrade` is called without any arguments
  for a working copy of the newer format.


Thanks,
Evgeny Kotkov


Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

2023-02-07 Thread Evgeny Kotkov via dev
Daniel Shahaf  writes:

> Look, it's pretty simple.  You said "We should do Y because it
> addresses X".  You didn't explain why X needs to be addressed, didn't
> consider what alternatives there are to Y, didn't consider any cons that
> Y may have… and when people had questions, you just began to
> implement Y, without responding to or even acknowledging those
> questions.
>
> That's not how design discussions work.  A design discussion doesn't go
> "state decision; state pros; implement"; it goes "state problem; discuss
> potential solutions, pros, cons; decide; implement" (cf. [4, 5, 6]).

Well, I think it may not be as simple as it seems to you.  Who decided that
we should follow the process you're describing?  Is there a thread with a
consensus on this topic?  Or do you insist on using this specific process
because it's the only process that seems obvious to you?  What alternatives
to it have been considered?

As far as I can tell, the process you're suggesting is effectively a
waterfall-like process, and there are quite a lot of concerns about its
effectiveness, because the decisions have to be made in the conditions of
a lack of information.

Personally, I prefer an alternative process that starts from finding out
all available bits of information, which are then used to make informed
decisions.  The unfortunate reality, however, is that the only guaranteed
way of collecting all information means implementing all (or almost all)
significant parts in code.  Roughly speaking, this process looks like a
research project that gets completed by trial and error.

Based on what you've been saying so far, I wouldn't be surprised if you
disagree.  But I still think that forcing the others to follow a certain
process by such means as vetoing a code change is maybe a bit over the
top.  (In the meantime, I certainly won't object if you're going to use this
waterfall-like process for the changes that you implement yourself.)


Regards,
Evgeny Kotkov


Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

2023-01-29 Thread Evgeny Kotkov via dev
Daniel Shahaf  writes:

> > (I'm not saying that the above rules have to be used in this particular case
> >  and that a veto is invalid, but still thought it’s worth mentioning.)
> >
>
> I vetoed the change because it hadn't been designed on the dev@ list,
> had not garnered dev@'s consensus, and was being railroaded through.
> (as far as I could tell)

I have *absolutely* no idea where "being railroaded through" comes from.
Really, it's a wrong way of portraying and thinking about the events that have
happened so far.

Reiterating over those events: I wrote an email containing my thoughts
and explaining the motivation for such change.  I didn't reply to some of
the questions (including some tricky questions, such as the one featuring
a theoretical hash function), because they have been at least partly
answered by others in the thread, and I didn't have anything valuable
to add at that time.

During that time, I was actively coding the core part of the change,
to check if it's possible technically.  Which is important, as far as
I believe, because not all theoretically possible solutions can be implemented
without facing significant practical or implementation-related issues, and
it seems to me that you significantly undervalue such an approach.

I do not say my actions were exemplary, but as far as I can tell, they're
pretty much in line with how svn-dev has been operating so far.  But, it all
resulted in an unclear veto without any _technical_ arguments, where what's
being vetoed is unclear as well, because the change was not ready at the
moment veto got casted.

And because your veto goes in favor of a specific process (considering that
no other arguments were given), the only thing that's *actually* being
railroaded is an odd form of an RTC (review-then-commit) process that is
against our usual CTR (commit-then-review) [1,2].  That's railroading,
because it hasn't been explicitly discussed anywhere and a consensus
on it has not been reached.

[1] https://www.apache.org/foundation/glossary.html#CommitThenReview
[2] https://www.apache.org/foundation/glossary.html#ReviewThenCommit


Regards,
Evgeny Kotkov


Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

2023-01-29 Thread Evgeny Kotkov via dev
Daniel Shahaf  writes:

> > That could happen after a public disclosure of a pair of executable
> > files/scripts where the forged version allows for remote code execution.
> > Or maybe something similar with a file format that is often stored in
> > repositories and that can be executed or used by a build script, etc.
> >
>
> Err, hang on.  Your reference described a chosen-prefix attack, while
> this scenario concerns a single public collision.  These are two
> different things.

A chosen-prefix attack allows finding more meaningful collisions such as
working executables/scripts.  When such collisions are made public, they
would have a greater exploitation potential than just a random collision.

> Disclosure of of a pair of executable files/scripts isn't by itself
> a problem unless one of the pair ("file A") is in a repository
> somewhere.  Now, was the colliding file ("file B") generated _before_ or
> _after_ file A was committed?
>
> - If _before_, then it would seem Mallory had somehow managed to:
>
>   1. get a file of his choosing committed to Alice's repository; and
>
>   2. get a wc of Alice's repository into one of the codepaths that
>  assume SHA-1 is one-to-one / collission-free (currently that's the
>  ra_serf optimization and the 1.15 wc status).

Not only.  There are cases when the working copy itself installs the working
file with a hash lookup in the pristine store.  This is more true for 1.14
than trunk, because in trunk we have the streamy checkout/update that avoid
such lookups by writing straight to the working file.  However, some of
the code paths still install the contents from the pristine store by hash.
Examples include reverting a file, copying an unmodified file, switching
a file with keywords, the mentioned ra_serf optimization, and etc.

>   Now, step #1 seems plausible enough.  As to step #2, it's not clear to
>   me how file B would reach the wc in step #2…

If Mallory has write access, she could commit both files, thus arranging for
a possible content change if both files are checked out to a single working
copy.  This isn't the same as just directly modifying the target file, because
file content isn't expected to change due to changes in other files (that can
be of any type), so this attack has much better chances of being unnoticed.

If Mallory doesn't have write access, there should be other vectors, such
as distributing a pair of files (harmless in the context of their respective
file formats) separately via two upstream channels.  Then, if both of the
upstream distributions are committed into a repository and their files are
checked out together, the content will change, allowing for a malicious
action.


Regards,
Evgeny Kotkov


Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

2023-01-22 Thread Evgeny Kotkov via dev
Daniel Shahaf  writes:

> > I can complete the work on this branch and bring it to a production-ready
> > state, assuming there are no objections.
>
> Your assumption is counterfactual:
>
> https://mail-archives.apache.org/mod_mbox/subversion-dev/202301.mbox/%3C20230119152001.GA27446%40tarpaulin.shahaf.local2%3E
>
> https://mail-archives.apache.org/mod_mbox/subversion-dev/202212.mbox/%3CCAMHy98NqYBLZaTL5-FAbf24RR6bagPN1npC5gsZenewZb0-EuQ%40mail.gmail.com%3E

I don't see any explicit objections in these two emails (here I assume that
if something is not clear to a PMC member, it doesn't automatically become
an objection).  If the "why?" question is indeed an objection, then I would
say it has already been discussed and responded to in the thread.

Now, returning to the problem:

As described in the advisory [1], we have a supported configuration that
makes data forgery possible:

- A repository with disabled rep-sharing allows storing different files with
  colliding SHA-1 values.
- Having a repository with disabled rep-sharing is a supported configuration.
  There may be a certain number of such repositories in the wild
  (for example, created with SVN < 1.6 and not upgraded afterwise).
- A working copy uses an assumption that the pristine contents are equal if
  their SHA-1 hashes are equal.
- So committing different files with colliding SHA-1 values makes it possible
  to forge the contents of a file that will be checked-out and used by the
  client.

I would say that this state is worrying just by itself.

However, with the feasibility of chosen-prefix attacks on SHA-1 [2], it's
probably only a matter of time until the situation becomes worse.

That could happen after a public disclosure of a pair of executable
files/scripts where the forged version allows for remote code execution.
Or maybe something similar with a file format that is often stored in
repositories and that can be executed or used by a build script, etc.

[1] https://subversion.apache.org/security/sha1-advisory.txt
[2] https://sha-mbles.github.io/


Speaking of the proposed switch to SHA-256 or a different checksum, there's
an argument by contradiction: if we were designing the pristineless working
copy from scratch today, would we choose SHA-1 as the best available hash
that can be used to assert content equality?  If yes, how can one prove that?

> Objections have been raised, been left unanswered, and now implementation
> work has commenced following the original design.  That's not acceptable.
> I'm vetoing the change until a non-rubber-stamp design discussion has
> been completed on the public dev@ list.

I would like to note that vetoing a code modification should be accompanied
with a technical justification, and I have certain doubts that the above
arguments qualify as such:

https://www.apache.org/foundation/voting.html
[[[
To prevent vetoes from being used capriciously, the voter must provide
with the veto a technical justification showing why the change is bad
(opens a security exposure, negatively affects performance, etc. ).
A veto without a justification is invalid and has no weight.
]]]

(I'm not saying that the above rules have to be used in this particular case
 and that a veto is invalid, but still thought it’s worth mentioning.)

Anyway, I'll stop working on the branch, because a veto has been casted.


Regards,
Evgeny Kotkov


Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

2023-01-19 Thread Evgeny Kotkov via dev
Karl Fogel  writes:

> Now, how hard would this be to actually implement?

To have a more or less accurate estimate, I went ahead and prepared the
first-cut implementation of an approach that makes the pristine checksum
kind configurable in a working copy.

The current implementation passes all tests in my environment and seems to
work in practice.  It is available on the branch:

  https://svn.apache.org/repos/asf/subversion/branches/pristine-checksum-kind

The implementation on the branch allows creating working copies that use a
checksum kind other than SHA-1.

The checksum kind is persisted in the settings table.  Upgraded working copies
of the older formats will have SHA-1 recorded as their pristine checksum kind
and will continue to use it for compatibility.  Newly created working copies
of the latest format (with --compatible-version=1.15 or --store-pristine=no),
as currently implemented, will use the new pristine checksum kind.

Currently, as a proof-of-concept, the branch uses salted SHA-1 as the new
pristine checksum kind.  For the production-ready state, I plan to support
using multiple new checksum types such as SHA-256.  I think that it would
be useful for future compatibility, because if we encounter any issues with
one checksum kind, we could then switch to a different kind without having
to change the working copy format.

One thing worth noting is that ra_serf contains a specific optimization for
the skelta-style updates that allows skipping a GET request if the pristine
store already contains an entry with the specified SHA-1 checksum.  Switching
to a different checksum type for the pristine entries is going to disable
that specific optimization.  Re-enabling it would require an update of the
server-side.  I consider this to be out of scope for this branch.

I can complete the work on this branch and bring it to a production-ready
state, assuming there are no objections.


Thanks,
Evgeny Kotkov


Re: Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format

2022-12-29 Thread Evgeny Kotkov via dev
Karl Fogel  writes:

> Now, how hard would this be to actually implement?

I plan to take a more detailed look at that, but I'm currently on vacation
for the New Year holidays.


Thanks,
Evgeny Kotkov


Switching from SHA1 to a checksum type without known collisions in 1.15 working copy format (was: Re: Getting to first release of pristines-on-demand feature (#525).)

2022-12-20 Thread Evgeny Kotkov via dev
Karl Fogel  writes:

> > While here, I would like to raise a topic of incorporating a switch from
> > SHA1 to a different checksum type (without known collisions) for the new
> > working copy format.  This topic is relevant to the pristines-on-demand
> > branch, because the new "is the file modified?" check relies on the
> > checksum comparison, instead of comparing the contents of working and
> > pristine files.
> >
> > And so while I consider it to be out of the scope of the pristines-on-
> > demand branch, I think that we might want to evaluate if this is something
> > that should be a part of the next release.
>
> Good point.  Maybe worth a new thread?

[Moving discussion to a new thread]

We currently have a problem that a working copy relies on the checksum type
with known collisions (SHA1).  A solution to that problem is to switch to a
different checksum type without known collisions in one of the newer working
copy formats.

Since we plan on shipping a new working copy format in 1.15, this seems to
be an appropriate moment of time to decide whether we'd also want to switch
to a checksum type without known collisions in that new format.

Below are the arguments for including a switch to a different checksum type
in the working copy format for 1.15:

1) Since the "is the file modified?" check now compares checksums, leaving
   everything as-is may be considered a regression, because it would
   introduce additional cases where a working copy currently relies on
   comparing checksums with known collisions.

2) We already need a working copy format bump for the pristines-on-demand
   feature.  So using that format bump to solve the SHA1 issue might reduce
   the overall number of required bumps for users (assuming that we'll still
   need to switch from SHA1 at some point later).

3) While the pristines-on-demand feature is not released, upgrading with a
   switch to the new checksum type seems to be possible without requiring a
   network fetch.  But if some of the pristines are optional, we lose the
   possibility to rehash all contents in place.  So we might find ourselves
   having to choose between two worse alternatives of either requiring a
   network fetch during upgrade or entirely prohibiting an upgrade of
   working copies with optional pristines.

Thoughts?


Thanks,
Evgeny Kotkov


Re: Getting to first release of pristines-on-demand feature (#525).

2022-12-13 Thread Evgeny Kotkov via dev
Evgeny Kotkov  writes:

> I think that the `pristines-on-demand-on-mwf` branch is now ready for a
> merge to trunk.  I could do that, assuming there are no objections.

Merged in https://svn.apache.org/r1905955

I'm going to respond on the topic of SHA1 a bit later.


Thanks,
Evgeny Kotkov


Re: Getting to first release of pristines-on-demand feature (#525).

2022-12-07 Thread Evgeny Kotkov via dev
Evgeny Kotkov  writes:

> > IMHO, once the tests are ready, we could merge it and release
> > it to the world.
>
> Apart from the required test changes, there are some technical
> TODOs that remain from the initial patch and should be resolved.
> I'll try to handle them as well.

I think that the `pristines-on-demand-on-mwf` branch is now ready for a
merge to trunk.  I could do that, assuming there are no objections.

  
https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand-on-mwf

The branch includes the following:
– Core implementation of the new mode where required pristines are fetched
  at the beginning of the operation.
– A new --store-pristine=yes/no option for `svn checkout` that is persisted
  as a working copy setting.
– An update for `svn info` to display the value of this new setting.
– A standalone test harness that tests main operations in both
  --store-pristine modes and gets executed on every test run.
– A new --store-pristine=yes/no option for the test suite that forces all
  tests to run with a specific pristine mode.

The branch passes all tests in my Windows and Linux environments, in both
--store-pristine=yes and =no modes.


While here, I would like to raise a topic of incorporating a switch from
SHA1 to a different checksum type (without known collisions) for the new
working copy format.  This topic is relevant to the pristines-on-demand
branch, because the new "is the file modified?" check relies on the checksum
comparison, instead of comparing the contents of working and pristine files.

And so while I consider it to be out of the scope of the pristines-on-demand
branch, I think that we might want to evaluate if this is something that
should be a part of the next release.


Thanks,
Evgeny Kotkov


Re: svn commit: r1905663 - in /subversion/branches/pristines-on-demand-on-mwf/subversion: include/ include/private/ libsvn_client/ libsvn_wc/

2022-12-02 Thread Evgeny Kotkov via dev
Bert Huijben  writes:

> All the now deprecated functions now fail unconditionally when the setting
> is enabled. Isn’t it possible to do this more graceful whenever a file is
> encountered which misses it’s prisite version?

The problem with this approach is that the functions are going to work, but
only *sometimes*, and will fail unpredictably, depending on whether a pristine
was fetched or removed by any previous API call.

With that in mind, failing consistently seems to be a more appropriate choice
for a deprecated function than failing randomly.

> As far as I know it is expected that some of the files do have pristines,
> while others don’t… That would allow things like diffs on old clients that
> didn’t switch apis yet.

Thinking about this, the currently documented assumption is that a file has
no pristine if and only if it's being added/replaced without a copyfrom.
So maybe we cannot really extend that to "any file might not have a pristine"
without it being an incompatible change.


Thanks,
Evgeny Kotkov


Re: svn commit: r1905663 - in /subversion/branches/pristines-on-demand-on-mwf/subversion: include/ include/private/ libsvn_client/ libsvn_wc/

2022-12-01 Thread Evgeny Kotkov via dev
Daniel Sahlberg  writes:

>> +  /** @since New in 1.15 */
>> +  SVN_ERRDEF(SVN_ERR_WC_DEPRECATED_API_STORE_PRISTINE,
>> + SVN_ERR_WC_CATEGORY_START + 43,
>> + "This client was not updated to support working copies "
>> + "without local pristines")
>> +
>>/* fs errors */
>
> Is it really "This client"? It looks more to be based on the WC setting.

This new error should only occur when someone (a 3rd party client) uses the
deprecated API function for a working copy without pristines.

>From that point of view, I would say that the problem is in the client that is
still using the deprecated API and thus cannot properly handle such working
copies.

Also, the current message aims to be somewhat consistent with another existing
error for a case when a client is too old to read a newer working copy format:
 "This client is too old to work with the working copy […] (format 32)"

(After writing this, I realized that the "was not updated" part of the message
 may be misinterpreted as a requirement to update to the newer version of the
 software.  In r1905682, I slightly rephrased that part so that it would say:
   "This client uses a deprecated API that does not support working copies
without local pristines")


Thanks,
Evgeny Kotkov


Re: Getting to first release of pristines-on-demand feature (#525).

2022-11-16 Thread Evgeny Kotkov via dev
Karl Fogel  writes:

> Thank you, Evgeny!  Just to make sure I understand correctly --
> the status now on the 'pristines-on-demand-on-mwf' branch is:
>
> 1) One can do 'svn checkout --store-pristines=no' to get an
> entirely pristine-less working copy.  In that working copy,
> individual files will be hydrated/dehydrated automagically on an
> as-needed basis.
>
> 2) There is no command to hydrate or dehydrate a particular file.
> Hydration and dehydration only happen as a side effect of other
> regular Subversion operations.
>
> 3) There is no way to rehydrate the entire working copy.  E.g.,
> something like 'svn update --store-pristines=yes' or 'svn hydrate
> --depth=infinity' does not exist yet.
>
> 4) Likewise, there is no way to dehydrate an existing working copy
> that currently has its pristines (even if that working copy is at
> a high-enough version format to support pristinelessness).  E.g.,
> something like 'svn update --store-pristines=no' or 'svn dehydrate
> --depth=infinity' does not exist yet.
>
> Is that all correct?

Yes, I believe that is correct.

> By the way, I do not think (2), (3), and (4) are blockers.  Just
> (1) by itself is a huge step forward and solves issue #525;

+1 on keeping the scope of the feature to just (1) for now.

> IMHO, once the tests are ready, we could merge it and release
> it to the world.

Apart from the required test changes, there are some technical
TODOs that remain from the initial patch and should be resolved.
I'll try to handle them as well.


Thanks,
Evgeny Kotkov


Re: Getting to first release of pristines-on-demand feature (#525).

2022-11-15 Thread Evgeny Kotkov via dev
Evgeny Kotkov  writes:

> Perhaps we could transition into that state by committing the patch
> and maybe re-evaluate things from there.  I could do that, assuming
> no objections, of course.

Committed the patch in https://svn.apache.org/r1905324

I'll try to handle the related tasks in the near future.


Thanks,
Evgeny Kotkov


Re: Getting to first release of pristines-on-demand feature (#525).

2022-11-08 Thread Evgeny Kotkov via dev
Karl Fogel  writes:

> By the way, in that thread, Evgeny Kotkov -- whose initial work
> much of this is based on -- follows up with a patch that does a
> first-pass implementation of 'svn checkout --store-pristines=no'
> (by implementing a new persistent setting in wc.db).

Perhaps we could transition into that state by committing the patch
and maybe re-evaluate things from there.  I could do that, assuming
no objections, of course.


Thanks,
Evgeny Kotkov


Initial patch for storing the pristines-on-demand setting in the working copy (Issue 4889)

2022-04-17 Thread Evgeny Kotkov
Julian Foad  writes:

> Issue #4889 "per-WC config" is the subject of Johan's new dev@ post
> "Pristines-on-demand=enabled == format 32?". We already concurred that
> it's wise to decouple "pristines-on-demand mode is enabled in this WC"
> from "the WC format is (at least) 32 so can support that mode".
> <https://subversion.apache.org/issue/4889>. This may be considered
> higher priority than fixing the remaining tests.

I have been thinking about this recently, and here is a patch with the
first-cut implementation that persists the pristines-on-demand setting
in a working copy.  Unfortunately, I am getting a bit swamped with other
things to complete the work on it, but perhaps it could be useful as a
building block for the full implementation.

The patch currently allows doing an `svn checkout --store-pristines=no`,
which is going to create a working copy that doesn't store the pristine
copies of the files and fetches them on demand.  The setting is persisted
in wc.db.


The patch doesn't include the following:

1) An update for the tests and the test suite to run the tests in both modes.

   Personally, I think that we should update the test runner so that it would
   execute the tests for both pristine modes by default, without requiring any
   specific switches.  Because otherwise, there is a chance that one of the
   equally supported core configurations may receive far less attention
   during the development and test runs.

2) An update for `svn info` to display the value of the new setting.

3) An ability to take the --store-pristines value from a user config, perhaps
   on a per-URL basis.


While working on the patch, I have stumbled across a couple of issues:

A) `svn upgrade` without arguments fails for a working copy with latest format

  $ svn checkout --compatible-version=1.15 wc
  $ svn upgrade wc
  $ svn: E155021: Working copy '…' is already at version 1.15 (format 32)
and cannot be downgraded to version 1.8 (format 31)

  I haven't given it a lot of thought, but we might want to handle this case
  without an error or even think about making `svn upgrade` by default upgrade
  to the latest available version instead of the minimum supported (similar to
  `svnadmin upgrade`).

B) Shelving and pristines-on-demand

  It seems that both v2 and v3 shelving implementations are currently not
  updated to support pristines-on-demand working copies.

C) Bumping the related API

  This part originates from B).  For example, v3 shelving uses the libsvn_wc
  APIs, such as svn_wc_revert6().  If the working copy is created without the
  pristine contents, those calls are going to fail with an error saying that
  there is no text-base for the corresponding path.  This is a tricky error to
  understand, and the failure itself is unpredictable, because it depends on
  whether any of the previous API calls have fetched the missing text-bases.

  So if we think about v3 shelving as an example of the libsvn_wc API user,
  other existing third-party users of the API could face the same problem.

  Perhaps, we could bump the APIs that currently rely on the text-bases to
  always be available.  And we could then make their deprecated versions
  fail (predictably) for working copies that don't store pristine contents.


Thanks,
Evgeny Kotkov
Index: subversion/include/private/svn_wc_private.h
===
--- subversion/include/private/svn_wc_private.h (revision 1899956)
+++ subversion/include/private/svn_wc_private.h (working copy)
@@ -2231,9 +2231,11 @@ const svn_version_t *
 svn_wc__min_supported_format_version(void);
 
 /**
- * Set @a format to the format of the nearest parent working copy root of
- * @a local_abspath in @a wc_ctx, or to the oldest format of any root stored
- * there. If @a wc_ctx is empty, return the library's default format.
+ * Set @a *format_p and @a *store_pristines_p to the settings of the
+ * nearest parent working copy root of @a local_abspath in @a wc_ctx,
+ * or to settings of any root stored there, preferring the one with
+ * the oldest format. If @a wc_ctx is empty, return the library's
+ * default settings.
  *
  * Use @a scratch_pool for temporary allocations.
  *
@@ -2240,16 +2242,18 @@ svn_wc__min_supported_format_version(void);
  * @since New in 1.15.
  */
 svn_error_t *
-svn_wc__format_from_context(int *format,
-svn_wc_context_t *wc_ctx,
-const char *local_abspath,
-apr_pool_t *scratch_pool);
+svn_wc__settings_from_context(int *format_p,
+  svn_boolean_t *store_pristines_p,
+  svn_wc_context_t *wc_ctx,
+  const char *local_abspath,
+  apr_pool_t *scratch_pool);
 
 /**
  * Ensure that an administrative area exists for @a local_abspath, so that @a
  * local_abspa

Re: svn commit: r1899341 - in /subversion/branches/1.14.x: ./ STATUS subversion/libsvn_subr/io.c

2022-04-13 Thread Evgeny Kotkov
Branko Čibej  writes:

> > 1) Previously, the check for EINVAL within the svn_io_file_flush_to_disk()
> > function only happened in the #else block, so it did not affect the
> > behavior on Windows.  With the change, the check happens unconditionally
> > on all platforms.
>
> You're right, I hadn't considered that. And Windows behaves sufficiently
> differently that the EINVAL check as it stands is not appropriate. Would
> you consider putting that entire check in an #ifdef an appropriate fix?

Yes, I think it would be fine if we restored the pre-r1883355 way of handling
EINVAL under an #ifdef.

> > 2) Unless I am mistaken, this change replaces an fcntl(F_FULLFSYNC) with
> > fdatasync() when the latter is supported.
>
> Depending platform yes, APR may pick fdatasync() over fcntl(F_FULLFSYNC).

I think that since fdatasync() has lower integrity guarantees, starting to
unconditionally use it in the existing API might be problematic.

For example, the API users (ourselves and third-party) may implicitly rely on
the current integrity guarantees and assume that metadata is flushed to disk.
And in this case it's not limited to just the svn_io_file_flush_to_disk()
function, because there are other parts of the public API that call
svn_io_file_flush_to_disk() internally — such as svn_io_file_rename2()
or svn_io_write_atomic2().

If the original intention of this change was to save I/O in cases where we
would be fine with just flushing the file contents, then perhaps we could do
something like svn_io_file_flush_to_disk2(…, svn_boolean_t sync_metadata)
that uses apr_file_datasync() if sync_metadata is false — and incrementally
start passing sync_metadata=FALSE on the calling sites where it's safe to
do so.

But if the original intention was to fully switch to APR functions, I am not
too sure if we can do so, because both apr_file_datasync() and apr_file_sync()
have lower integrity guarantees in some cases, compared to the previous code.
(apr_file_datasync() — due to calling fdatasync() in some cases and
 apr_file_sync() due to always doing just an fsync() instead of F_FULLFSYNC
 when it's supported).


Thanks,
Evgeny Kotkov


Re: svn commit: r1899341 - in /subversion/branches/1.14.x: ./ STATUS subversion/libsvn_subr/io.c

2022-04-11 Thread Evgeny Kotkov
svn-role  writes:

> Merge r1883355 from trunk:
>
>  * r1883355
>Use the APR-1.4+ API for flushing file contents to disk.
>Justification:
>  Reduce code duplication between APR and SVN.
>Votes:
>  +1: brane, jun66j5, markphip
…
> +  do {
> +apr_err = apr_file_datasync(file);
> +  } while(APR_STATUS_IS_EINTR(apr_err));
> +
> +  /* If the file is in a memory filesystem, fsync() may return
> + EINVAL.  Presumably the user knows the risks, and we can just
> + ignore the error. */
> +  if (APR_STATUS_IS_EINVAL(apr_err))
> +return SVN_NO_ERROR;
> +
> +  if (apr_err)
> +return svn_error_wrap_apr(apr_err,
> +  _("Can't flush file '%s' to disk"),
> +  try_utf8_from_internal_style(fname, pool));

A few thoughts on this change:

1) Previously, the check for EINVAL within the svn_io_file_flush_to_disk()
   function only happened in the #else block, so it did not affect the
   behavior on Windows.  With the change, the check happens unconditionally
   on all platforms.

   On Windows, APR_STATUS_IS_EINVAL() is defined as follows:

#define APR_STATUS_IS_EINVAL(s) ((s) == APR_EINVAL \
|| (s) == APR_OS_START_SYSERR + ERROR_INVALID_ACCESS \
|| (s) == APR_OS_START_SYSERR + ERROR_INVALID_DATA \
|| (s) == APR_OS_START_SYSERR + ERROR_INVALID_FUNCTION \
|| (s) == APR_OS_START_SYSERR + ERROR_INVALID_HANDLE \
|| (s) == APR_OS_START_SYSERR + ERROR_INVALID_PARAMETER \
|| (s) == APR_OS_START_SYSERR + ERROR_NEGATIVE_SEEK)

   So with this change, all of these error codes are now going to be ignored,
   and the svn_io_file_flush_to_disk() function will return SVN_NO_ERROR,
   indicating the success of flushing the data to disk.  Some of these error
   codes, such as ERROR_INVALID_HANDLE, are quite generic.
   I would think that this change opens a data corruption possibility in
   the following case:

   - A real error happens during FlushFileBuffers().
   - It gets translated into one of the error codes above by the I/O stack.
   - The error is ignored in the implementation of svn_io_file_flush_to_disk().
   - The caller of svn_io_file_flush_to_disk() interprets this situation as a
 successful data flush.
   - He continues to work under an assumption that the data was successfully
 flushed to disk, whereas in fact it was not. For example, by committing
 a repository transaction.
   - A power loss after the transaction commit causes the data to be lost or
 corrupted.

2) Unless I am mistaken, this change replaces an fcntl(F_FULLFSYNC) with
   fdatasync() when the latter is supported.  So assuming that both operations
   are supported, the new behavior of the svn_io_file_flush_to_disk() function
   has weaker integrity guarantees than before, because fdatasync() does not
   ensure that metadata such as the last modification time is written to disk.

   I haven't done a thorough check if we rely on mtime being flushed to the
   disk in our code, but even if we don't, svn_io_file_flush_to_disk() is part
   of the public API, so altering its integrity guarantees in a patch release
   might be unexpected for its callers.

Thoughts?

(A small disclaimer: these changes have slipped past my attention until now,
 when I tried updating to 1.14.2.  But better late than sorry…)


Thanks,
Evgeny Kotkov


Re: Pristines-on-demand: authz denied during textbase sync (#4888)

2022-04-06 Thread Evgeny Kotkov
Julian Foad  writes:

> I suggest:
>
> - revert the patch I applied, as it's papering over the problem in an
> incomplete way and so possibly causes more confusion than it fixes.
>
> - leave this issue open and come back to it later; it's an edge case not
> part of common work flows.

+1.


Regards,
Evgeny Kotkov


Re: Issue #525/#4892: on only fetching the pristines we really need

2022-03-11 Thread Evgeny Kotkov
Julian Foad  writes:

> Conclusions:
> 
>
> It is certainly possible that we could modify "update" and the other
> "online" operations, at least, and the previously "offline" operations
> too if we want, to make them fetch pristines at point-of-use in this way.
>
> Such modifications are not trivial. There is the need to run additional
> RA requests in between the existing ones, perhaps needing an additional
> RA session to be established in parallel, or taking care with inserting
> RA requests into an existing session.

I think that this part has a lot of protocol constraints and hidden complexity.
And things could probably get even more complex for merge and diff.

Consider a bulk update report over HTTP, which is just a single response that
has to be consumed in a streamy fashion.  There is no request multiplexing,
and fetching data through a separate connection is going to limit the maximum
size of a pristine file that can be downloaded without receiving a timeout on
the original connection. Assuming the default HTTP timeout of httpd 2.4.x
(60 seconds) and 100 MB/s data transfer rate, the limit for a pristine size
is going to be around 6 GB.

This kind of problem probably isn't limited to just this specific example and
protocol, considering things like that an update editor driver transfers the
control at certain points (e.g., during merge) and thus cannot keep reading
the response.

When I was working on the proof-of-concept, encountering these issues stopped
me from considering the approach with fetching pristines at the point of access
as being practically feasible.  That also resulted in the alternative approach,
initially implemented on the `pristines-on-demand` branch.

Going slightly off-topic, I tend to think that even despite its drawbacks,
the current approach on the branch should work reasonably well for the MVP.

To elaborate on that, let me first share a few assumptions and thoughts I had
in mind at that point of time:

1) Let's assume a high-throughput connection to the server, 1 Gbps or better.

   With slower networks, working with large files is going to be problematic
   just by itself, so that might be thought as being out of scope.

2) Let's assume that a working copy contains a large number of blob-like files.

   In other words, there are thousands of 10-100 MB files, as opposed to one
   or two 50 GB files.

3) Let's assume that in a common case, only a small fraction of these files
   are modified in the working copy.

Then for a working copy with 1,000 of 100 MB files and 10 modified files:

A) Every checkout saves 100 GB of disk space; that's pretty significant for
   a typical solid state drive.

B) Hydrating won't transfer more than 1 GB of data, or 10 seconds under an
   optimistic assumption.

C) For a more uncommon case with 100 modified files, it's going to result in
   10 GB of data transferred and about two minutes of time; I think that's
   still pretty reasonable for an uncommon case.

So while the approach used in the proof-of-concept might be non-ideal, I tend
to think it should work reasonably well in a variety of use cases.

I also think that this approach should even be releasable in the form of an
MVP, accompanied with a UI option to select between the two states during
checkout (all pristines / pristines-on-demand) that is persisted in the
working copy.

A small final note is that I could be missing some details or other cases,
but in the meantime I felt like sharing the thoughts I had while working on
the proof-of-concept.


Thanks,
Evgeny Kotkov


Re: [PATCH] Sketch of per-user/per-wc config for pristines-mode

2022-02-10 Thread Evgeny Kotkov
Julian Foad  writes:

> Some people said it should be in wc.db. Evgeny wrote, "this sounds like
> a property associated with a specificwc_id in the database.  I would say
> that this pretty much rules out optionsof storing it outside the wc.db."
> But (Brane wrote) "WC_ID is hardcoded to 1 pretty much everywhere."
>
> What do you all make of this now? Is a new WC config file plausible?

I tend to think that using a per-wc config file could potentially introduce
more problems that it solves.

A few examples:

- Even if WC_ID is currently hardcoded to 1, I think that storing the new
  property without associating it with a WC_ID would probably circumvent
  this part of the working copy design.  In other words, if at some point the
  database starts to use more than one WC_ID, we would probably have to do
  something about a config option that works on a per-WC_ID basis but isn't
  persisted as such.

- A user-editable config option implies that it can be changed externally
  and that the working copy has to cope with such a possibility.  This rules
  out a relatively simple approach where the property is set when a working
  copy is created and is kept immutable afterwards.

  (This approach also allows entirely skipping the textbase sync by seeing
   that the stored pristines-on-demand property is false.  Compared to that,
   the approach with a config option implies having to do a more thorough
   check because some of the pristines might be missing.)

- To change this option programmatically, one would have to update an existing
  config file, which could be a non-trivial technical task, considering things
  like persisting comments and whitespace trivia.


Thanks,
Evgeny Kotkov
(On vacation for a couple of weeks, so might not be able to respond promptly.)


Re: A two-part vision for Subversion and large binary objects.

2022-01-28 Thread Evgeny Kotkov
Julian Foad  writes:

> SEMANTICS
>
> The user can choose one mode, per WC, from a list of options that may
> include:
>
>   - off: as in previous versions, no checking, just assume all pristines
> are present
>   - pristines-on-demand: fetch "wanted" pristines; discard others
>   - fetch-only: fetch any pristine that's absent; do not discard

My two cents on this would be that it might be easier to start with an on/off
property that gets stored when a working copy is created and doesn't change
afterwards, at least for now.

> Help please: Where should we properly store this setting in the WC?
>
> - in '.svn/entries' or '.svn/format'?
>   (Both currently contain a single line saying "12". We could add an
> extra line, or in general N extra lines each with one setting, for example.)
> - in a new file such as '.svn/pristines-on-demand'?
> - in the wc.db somewhere?

Thinking out loud, this sounds like a property associated with a specific
wc_id in the database.  I would say that this pretty much rules out options
of storing it outside the wc.db.


Thanks,
Evgeny Kotkov


Re: A two-part vision for Subversion and large binary objects.

2022-01-28 Thread Evgeny Kotkov
Julian Foad  writes:

> We could swap the scanning logic around to do the (quick) check for
> missing pristines before deciding whether a (slower) file "stat" is
> necessary. Positives: eliminates the redundant "stat" overhead which may
> be significant in working trees containing many files. Negatives: some
> re-work needed in the current implementation.
>
> Of these, the last one currently looks viable and useful.
>
> Does that one look promising to you?

I might be missing something, but I don't yet see how we could save a stat().

Currently, a pristine is hydrated if and only if the corresponding working
file is modified.  Let's say we check if a pristine is hydrated beforehand.
If we find out that pristine is dehydrated, we have to stat(), because if the
file is modified, then we need to hydrate.  If we find out that pristine is
hydrated, we still have to stat(), because if the file is no longer modified,
then we need to dehydrate.


Thanks,
Evgeny Kotkov


Re: A two-part vision for Subversion and large binary objects.

2022-01-27 Thread Evgeny Kotkov
Julian Foad  writes:

> Scanning with 'stat'
>
> I'm concerned about the implementation scanning the whole subtree,
> calling 'stat' on every file to determine whether the file is "changed"
> (locally modified). This is done in svn_wc__textbase_sync() with its
> textbase_walk_cb().
>
> It does this scan on every sync, which is twice on every syncing
> operation such as diff.
>
> Don't we already have an optimised scan for local modifications
> implemented in the "status" code? Could we re-use this?

In a few of my experiments, performance of textbase_sync() was more or
less comparable to a status walk.  So maybe it's not actually worthwhile
spending time on improving this part, at least for now.

Also, I tend to think that DRY doesn't really apply here, because a status
walk and a textbase sync are essentially different operations that just
happen to have something in common internally.  For example, a textbase
sync doesn't have to follow the tree structure and can be implemented
with an arbitrarily ordered walk over NODES.

> Premature Hydrating
>
> The present implementation "hydrates" (fetches missing pristines) every
> file within the whole subtree the operation targets. This is done by
> every major client operation calling svn_client__textbase_sync() before
> and afterwards.
>
> That is pessimistic: the operation may not actually touch all these
> files if limited in any way such as by
>
>   - depth filtering
>   - other filtering (changelist, properties-only, ...)
>   - terminating early (e.g. output piped to 'head')
>
> That introduces all the fetching overhead for the given subtree as a
> latency before the operation shows its results, which for something
> small at the root of the tree such as "svn diff --depth=empty
> --properties-only ./" may make a significant usability impact.
>
> Presumably we could add the depth and some other kinds of filtering to
> the tree walk. But that will always leave terminating early, and
> possibly other cases, sub-optimal.
>
> I would prefer a solution that defers the hydrating until closer to the
> moment of demand.

I think that fetching the pristine contents at the moment of demand is a
particularly problematic concept to pursue, because it implies that there is
a network request that can now happen at an unpredictable moment of time.
So any operation that may access the pristine contents has to be ready for
a network fetch.  Compared to that, fetching the required pristines before
the operation does not impose that kind of requirement on the existing code.


Thanks,
Evgeny Kotkov


Re: A two-part vision for Subversion and large binary objects.

2022-01-13 Thread Evgeny Kotkov
Julian Foad  writes:

> Hello everyone. Thanks to sponsorship arranged by Karl, I'm able to work
> on completing this.

Hi Julian,

It's great to have you on board!

> Important Question:
>
>   * Is the approach with a WC format bump the best approach?
>
> I see two options to consider:
>
>   * with format bump like this, plus completing the multi-WC-format
> branch work
>   * re-work the implementation of the concept, to not change the
> declared format number or DB schema, but instead to query "is the file
> present on disk" instead of the DB 'hydrated' flag
>
> For each of those options,
>
>   * what is the outcome in terms of interoperability?
>   * how much work is involved? (or is the 2nd one impossible?)

Between these two options, I don't think that it would be actually possible to
have the pristines-on-demand functionality available without a format bump,
as the existing clients working with the current format rely on the pristine
contents always being available on disk.

Personally, I tend to think that making a format bump and accompanying it with
the multi-wc-format support should work pretty well from a user's standpoint.

> Right now I'm reviewing this work and the long discussion about it and
> I will come back with a summary and plan for your consideration, and no
> doubt many questions, in the next few days.

Perhaps, one way to start off would be to fix the temporary assumption that
any working copy of the newer format always fetches the pristines on-demand.

While the final way of configuring that undoubtedly is an important question to
answer, there might be a way to make progress while keeping the configuration
simple, at least for now:

— Make the "are pristines being fetched on-demand" a property of the
  working copy
— Introduce a new configuration option that controls this new behavior
  and is applied at the moment when a working copy is created

I think that carrying these steps would bring the feature closer to its final
shape, and would also mean that a lot of the important subsurface work (such
as preserving the old behavior for existing working copies, running the tests
for both configurations, etc.) is already in place, so that we wouldn't have
to think about it further on.


Thanks,
Evgeny Kotkov


Re: [Regression in httpd 2.4.49] mod_dav: REPORT requests no longer return errors

2021-09-22 Thread Evgeny Kotkov
Ruediger Pluem  writes:

> Does the attached patch solve your issue?

It does appear to solve the problem with missing errors, thanks!

I haven't checked that in detail, but I think there might be a discrepancy
in how `err` is handled in the patch and for example when calling the
method_precondition() hook.

With the patch, `err` is checked even if all hooks DECLINE the operation.
Not too sure if that's intended, because the variable could potentially
contain an arbitrary value or a leftover from some previous call.


Thanks,
Evgeny Kotkov


[Regression in httpd 2.4.49] mod_dav: REPORT requests no longer return errors

2021-09-20 Thread Evgeny Kotkov
Hi,

I think that I have stumbled across a significant regression in httpd 2.4.49
where mod_dav has been changed in a way that makes it ignore the errors
returned by a versioning provider during REPORT requests.

I haven't checked that in detail, but I think that r1892513 [1] is the change
that is causing the new behavior.

Consider the new core part of the dav_method_report() function:

…
result = dav_run_deliver_report(r, resource, doc,
r->output_filters, );
switch (result) {
case OK:
return DONE;
case DECLINED:
/* No one handled the report */
return HTTP_NOT_IMPLEMENTED;
default:
if ((err) != NULL) {
… handle the error
}

Assuming there are no deliver_report hooks, this code is going to call
dav_core_deliver_report(), whose relevant part is as follows:

…
if (vsn_hooks) {
*err = (*vsn_hooks->deliver_report)(r, resource, doc,
r->output_filters);
return OK;
}
…

Here the "return OK" part skips the error handling on the calling site,
even if there is an associated error returned in *err.

In turn, this causes the following regression:

- For a failed request where the server hasn't started sending the response,
  it is now going to erroneously send a 200 OK with an empty body instead of
  an error response with the appropriate HTTP status.

- For a failed request where the server has started sending the response body,
  it is now going to handle it as if it was successfully completed instead of
  aborting the connection.  The likely outcome of this is that the server is
  going to send a truncated payload with a 2xx status indicating success.

This regression can cause unexpected behavior of the Subversion clients,
for example in cases where the (ignored) error occurred due to a non-successful
authorization check.  Other DAV clients may be susceptible to some kinds of
unexpected behavior as well.

[1] https://svn.apache.org/r1892513


Thanks,
Evgeny Kotkov


Re: A two-part vision for Subversion and large binary objects.

2021-08-27 Thread Evgeny Kotkov
Evgeny Kotkov  writes:

> The current state is that the commit is going to fetch the original
> text-base before sending the delta.
>
> It should be technically possible to implement a behavior where the commit
> sends a delta if the local text-base is present, and a fulltext otherwise.

This behavior where a commit does not fetch the text-bases, but only uses
those that are available locally, can be enabled with the following patch:

[[[
Index: subversion/libsvn_client/commit.c
===
--- subversion/libsvn_client/commit.c   (revision 1892661)
+++ subversion/libsvn_client/commit.c   (working copy)
@@ -729,7 +729,7 @@ svn_client_commit6(const apr_array_header_t *targe
 goto cleanup;

   cmt_err = svn_error_trace(
-svn_client__textbase_sync(lock_root, TRUE, TRUE,
+svn_client__textbase_sync(lock_root, FALSE, TRUE,
   ctx, iterpool));
   if (cmt_err)
 goto cleanup;
]]]


Thanks,
Evgeny Kotkov


Re: A two-part vision for Subversion and large binary objects.

2021-08-27 Thread Evgeny Kotkov
Mark Phippard  writes:

> > (Although as far as I can tell, we would have to extend the delta editor
> >  to allow us to send a fulltext against an existing file, and the
> >  optimization would only work for newer servers.)
>
> This is just from memory of a list conversation that probably happened
> 10+ years ago now but ... if you use a generic WebDAV client with a
> SVN server and you commit changes to an existing file the WebDAV
> client is just sending the server the latest full text (no deltas) and
> the server knows how to handle it. So theoretically some of what is
> needed to support this sort of commit ought to already exist in our
> server code.

With a bit more thought, sending a fulltext even to existing servers
indeed shouldn't be an issue, as it can happen in the form of a delta
against the empty source.


Thanks,
Evgeny Kotkov


Re: A two-part vision for Subversion and large binary objects.

2021-08-27 Thread Evgeny Kotkov
Mark Phippard  writes:

> Suppose I am using this feature for binaries, which I think is the
> main use case. Using whatever tool I produce a modified version of my
> binary file. At this point in time, there is nothing in the text-base
> because I have not done any SVN operations other than checkout.  Since
> it is a binary, other than maybe running status my next likely command
> is going to be commit. Are you saying that during the processing of
> this commit the client will first fetch the original text base from
> the server before doing the commit? That seems sub-optimal.  Perhaps
> you are only doing this for diff or revert which would make perfect
> sense?

The current state is that the commit is going to fetch the original text-base
before sending the delta.

It should be technically possible to implement a behavior where the commit
sends a delta if the local text-base is present, and a fulltext otherwise.

(Although as far as I can tell, we would have to extend the delta editor to
 allow us to send a fulltext against an existing file, and the optimization
 would only work for newer servers.)

But at this point, I'm not too sure that we should make the commit special
in that aspect.  This optimization is only going to work reliably if the only
things you do is checkout and commit.  Because an update with modified
files is going to fetch those text-bases by itself.  And the update itself
could happen in response to trying to do an out-of-date commit.

> Aside from all of that ... do the text bases get cleaned up? For
> example at the end of a successful commit does it get rid of that text
> base that no longer are current?

Yes.  The text-base file corresponding to the new version of the file is not
going to be written to the disk.  The text-base file corresponding to the
old version gets cleaned up if it is not referenced by any other remaining
modified files.


Thanks,
Evgeny Kotkov


Re: A two-part vision for Subversion and large binary objects.

2021-08-27 Thread Evgeny Kotkov
Karl Fogel  writes:

> 1) Make pristine text-base files optional.  See issue #525 for
> details.  In summary: currently, every large file uses twice the
> storage on the client side, and yet for most of these files
> there's little benefit.  They're usually not plaintext, so 'svn
> diff' against the pristine base is pointless (unless you have some
> specialized diff tool for the particular binary format, but that's
> rare), and 'svn commit' likewise just sends up the whole working
> file.  The only thing a local base gets you is local 'svn revert',
> which can be nice, but many of us would happily give it up for
> large files to avoid the 2x local storage cost.

A proof-of-concept implementation of one possible approach for making
the text-bases optional is available on the branch:

  https://svn.apache.org/repos/asf/subversion/branches/pristines-on-demand

While still being rough on the edges, it passes the whole test suite in
my environment and seems to work quite well in practice.

A few notes on the current state:

- The implementation includes a bump of the working copy format.

- Although this would most certainly have to change in the future, for now
  any working copy of the new format is considered to always have optional
  text-bases (for easier testing, etc.).

- There is no UI and no configuration yet, just the first cut of the core part.

The core idea is that we maintain the following invariant: only the modified
files have their pristine text-base files available on the disk.

- To avoid having to access the text-base, the "is the file modified?" check
  is performed by calculating the checksum of a file and comparing that to
  what's recorded in the working copy.

- A text-base of the unmodified file is the file itself, appropriately
  detranslated.

- To get into the appropriate state at the beginning of the operation, we walk
  through the current text-base info in the db and check if the corresponding
  working files are modified.  The missing text-bases are fetched using the
  svn_ra layer.  The operations also include a final step during which the
  no longer required text-bases are removed from disk.

- The operations that don't need to access the text-bases (such as "svn ls"
  or the updated "svn st") do not perform this walk and do not synchronize
  the text-base state.

The pros and cons of the approach could probably be summarized as follows:

Pros:

 1. A working copy without modifications does not store the text-bases,
thus avoiding the 2x local storage cost that could be significant
in case of large files.  A working copy with modifications only stores
the text-bases for the modified files.

 2. The fetched text-bases for modified files are reused across operations.

For example, calling "svn diff" twice is only going to fetch the required
text-bases once.  In case of a third-party client that makes such calls
N times under the assumption that they are "local", the fetch is going to
only happen once, rather than N times.

 3. With a generic approach like this, we might have fewer issues with
introducing any kind of new working copy functionality that depends on
the text-bases.

Cons:

 1. The worst-case scenario for this approach (where every single file in the
working copy is modified) would work by fetching all of the text-bases.

And while that is going to take just as much storage as it currently does,
there’s the added cost of the fetch itself.

 2. The missing text-bases will be fetched before the operation. Fetching very
large text-bases may take a significant amount of time and may have an
impact on the user experience, although that might be alleviated by
providing a UI of some kind.

 3. Even without the fetch, the "synchronize the text-base state" part has
some performance penalty.  This penalty will only occur in the working
copies with optional text-bases and only if the operation synchronizes
the text-base state.

The added cost is going to be of the order of two "svn st" calls for
the target of the operation.

Overall, I tend to think that this approach should work reasonably well even
despite these cons, and that it is going to solve the issue in those cases
where the 2x local storage requirement is a problem.

So, how does that sound in general?

If we were to try to get this to a production-ready state, it would probably
make sense to also:

 1. Complete the work on ^/subversion/branches/multi-wc-format so that the
client would work with both the new and old working copy formats, for
a seamless user experience and better compatibility.

 2. For the new working copy format, incorporate a switch to a different
checksum type without known collisions instead of SHA-1.

 3. Fix the minor issues written down as TODOs in the code.


Thanks,
Evgeny Kotkov


Re: svn commit: r1886490 - in /subversion/trunk/subversion: include/private/ libsvn_subr/ libsvn_wc/ tests/cmdline/ tests/libsvn_subr/ tests/libsvn_wc/

2021-03-11 Thread Evgeny Kotkov
Evgeny Kotkov  writes:

> My attempt to fix this is by making checkout stream data to both pristine
> and the (projected) working file, so that the actual install would then
> happen as just an atomic rename.

For the record, I made a few additional changes, including a significant
overhaul of the install stream's timestamp handling in r1886774 [1].

> Noting that this change only fixes "svn checkout", but not "svn export".
> Export uses a separate implementation of the delta editor, and it should
> be possible to update it in a similar way — but I'm leaving that for future
> work for now.

For this part with "svn export", r1886843 [2] should do the trick.

While here, r1887108 [3] begins using the new file writer utility also
when exporting from an existing working copy.

[1] https://svn.apache.org/r1886774
[2] https://svn.apache.org/r1886843
[3] https://svn.apache.org/r1887108


Thanks,
Evgeny Kotkov


Re: svn commit: r1886490 - in /subversion/trunk/subversion: include/private/ libsvn_subr/ libsvn_wc/ tests/cmdline/ tests/libsvn_subr/ tests/libsvn_wc/

2021-02-18 Thread Evgeny Kotkov
Daniel Shahaf  writes:

> Thanks for checking.  How about adding it somewhere, then?

Yes, that will probably be useful.  Would you have an opportunity to do so?


Thanks,
Evgeny Kotkov


Re: svn commit: r1886490 - in /subversion/trunk/subversion: include/private/ libsvn_subr/ libsvn_wc/ tests/cmdline/ tests/libsvn_subr/ tests/libsvn_wc/

2021-02-18 Thread Evgeny Kotkov
Daniel Shahaf  writes:

> > Noting that this change only fixes "svn checkout", but not "svn export".
> > Export uses a separate implementation of the delta editor, and it should
> > be possible to update it in a similar way — but I'm leaving that for
> > future work for now.
>
> Are this problem and its proposed solution documented in a jira ticket?

No, I don't think so.

(Apparently, SVN-3264 [1] describes the checkout case, but I haven't done
a thorough search to see if export is mentioned somewhere.)

[1] https://issues.apache.org/jira/browse/SVN-3264


Regards,
Evgeny Kotkov


Re: svn commit: r1886490 - in /subversion/trunk/subversion: include/private/ libsvn_subr/ libsvn_wc/ tests/cmdline/ tests/libsvn_subr/ tests/libsvn_wc/

2021-02-13 Thread Evgeny Kotkov
Evgeny Kotkov  writes:

> URL: http://svn.apache.org/viewvc?rev=1886490=rev
> Log:
> In the update editor, stream data to both pristine and working files.
...
> Several quick benchmarks:
>
>   - Checking out subversion/trunk over https://
>
> Total time:  3.861 s  →  2.812 s
>Read IO:  57322 KB  →  316 KB
>   Write IO:  455013 KB  →  359977 KB
>
>   - Checking out 4 large binary files (7.4 GB) over https://
>
> Total time:   91.594 s   →  70.214 s
>Read IO:   7798883 KB  →  19 KB
>   Write IO:   15598167 KB  →  15598005 KB

Hey everyone,

Here's an improvement I have been working on recently.

Apparently, the client has an (implicit) limit on the size of directories that
can be safely checked out over HTTP without hitting a timeout.  The problem is
that when the client installs the new working files, it does so in a separate
step.  This step happens per-directory and involves copying and possibly
translating the pristine contents into new working files.  While that happens,
nothing is read from the connection.  So the amount of work that can be done
without hitting a timeout is limited.

Assuming the default HTTP timeout = 60 seconds of httpd 2.4.x and a relatively
fast disk, that puts the limit at around 6 GB for any directory.  Not cool.

My attempt to fix this is by making checkout stream data to both pristine and
the (projected) working file, so that the actual install would then happen as
just an atomic rename.  Since we now never stop reading from the connection,
the timeouts should no longer be an issue.  The new approach also has several
nice properties, such as not having to re-read the pristine files, not
interfering with the network-level buffering, TCP slow starts, and etc.
I see that it reduces the amount of both read and write I/O during all
checkouts, which should give a mild overall increase of how fast the
checkouts work.

Noting that this change only fixes "svn checkout", but not "svn export".
Export uses a separate implementation of the delta editor, and it should
be possible to update it in a similar way — but I'm leaving that for future
work for now.


Thanks,
Evgeny Kotkov


Re: Bug review of the week

2020-01-08 Thread Evgeny Kotkov
Nathan Hartman  writes:

> SVN-4588 "Item index too large in revision" is marked as Unresolved, but
> from my reading of all the comments, it seems the root cause of the problem
> was not restarting httpd after dumping and loading, leaving stale
> information in cache.
>
> The last comment mentions a similar report on Stack Overflow. That one
> magically fixed itself by rebooting. See:
> https://stackoverflow.com/questions/33904618/how-to-increase-l2p-page-size/33938501
>
> So... is there anything else to do for this one?

There is an option of making the filesystem instance IDs part of the cache
keys.  Doing so would fix the case described in SVN-4588 and allow other
operations that happen through the command-line tools, such as "svnadmin
hotcopy" to work properly without the server restart.

See 
https://lists.apache.org/thread.html/6db0186496a46c9823a752b377e369eb4361c8a42e62fc7e601453c6%401440460036%40%3Cdev.subversion.apache.org%3E


Thanks,
Evgeny Kotkov


Re: svn st shows missing for root folder

2019-11-19 Thread Evgeny Kotkov
Stefan Kueng  writes:

> Using svn 1.13.0 on Windows:
>
> SUBST G:\ D:\Development\TestWC
> G:
> svn st .
>
> !   .
> ?   Foo
> ?   src\nonversioned
>
> As you can see, the root folder doesn't show the correct status.
> This showed the correct status with 1.12.x.

On my side, this issue *does not* reproduce with Subversion 1.13.0 built
against APR 1.6.x, but reproduces with the TortoiseSVN 1.13.1 command-line
binaries.

I see that TortoiseSVN 1.13 has recently switched to APR 1.7.0, so that may
be causing the issue:

  https://osdn.net/projects/tortoisesvn/scm/svn/commits/28674

Among the changes in APR 1.7.0, this one could be responsible for the faulty
behavior, since it changes how apr_stat() works for reparse points:

  https://svn.apache.org/r1855950


Regards,
Evgeny Kotkov


Re: --normalize-probs doesn't do its thing

2019-10-09 Thread Evgeny Kotkov
Evgeny Kotkov  writes:

> Apparently, there is a bug in the implementation of --normalize-props
> that only makes it automatically fix line endings in the values of the
> revision properties, but not node properties (as per the code around
> load-fs-vtable.c: set_node_property()).

I think that this issue should be fixed by https://svn.apache.org/r1868203

(Will try to nominate this fix to the stable branches, once I get some spare
time for that.)


Regards,
Evgeny Kotkov


Re: svn-crash-log20190911192859

2019-09-14 Thread Evgeny Kotkov
Johan Corveleyn  writes:

> I'm not sure what the reason could be. Perhaps the dumpfile itself is
> huge, or has something special ... Or maybe the memory footprint is
> normal, but the machine you're running it on is low on available
> memory.

Apparently, our svn_stream_readline() implementation may cause excessive
memory allocations if the input contains nul bytes.  While I can't say for
sure, this could be the likely cause of the reported problem.  For example,
the dump file could have been resaved in UTF-16, or something like that.

I committed r1866950 that should fix this specific problem:
  https://svn.apache.org/r1866950

The other possible, although less likely reason for this problem is that
the dump file itself is a large binary file without any \n characters.
I committed r1866951 that should make the parser more resilient to
such files:
  https://svn.apache.org/r1866951

Full solution for the latter part, I think, would require introducing and
implementing something like svn_stream_readline2() with a `limit` parameter,
but this change should at least ensure that the parser doesn't choke in the
described case with a binary file and other similar situations.

I will try to nominate both of these fixes to our stable branches, once I
have some time for that.


Thanks,
Evgeny Kotkov


Re: svn commit: r1865522 - /subversion/trunk/subversion/libsvn_subr/sqlite3wrapper.c

2019-08-21 Thread Evgeny Kotkov
Daniel Shahaf  writes:

> Would it break anything if some application that uses libsvn had
> its own copy of SQLite that was compiled with another
> SQLITE_DEFAULT_MEMSTATUS value?  I don't see anything about that
> in the linked page.

I don't think so, because our amalgamated SQLite has all its symbols marked
as static and is not exposed to anyone else.

Also, this compile option only changes the default value of the corresponding
runtime option (SQLITE_CONFIG_MEMSTATUS), meaning that it can still be
changed later with sqlite3_config().


Thanks,
Evgeny Kotkov


Re: svn commit: r1865523 - /subversion/trunk/subversion/libsvn_subr/sqlite3wrapper.c

2019-08-21 Thread Evgeny Kotkov
Daniel Shahaf  writes:

> Quoting https://sqlite.org/compile.html#_options_to_omit_features:
>
>  Important Note: The SQLITE_OMIT_* options may not work with
>  the amalgamation. SQLITE_OMIT_* compile-time options usually
>  work correctly only when SQLite is built from canonical
>  source files.
>
> and a bit later:
>
>  Also, not all SQLITE_OMIT_* options are tested. Some
>  SQLITE_OMIT_* options might cause SQLite to malfunction and/or
>  provide incorrect answers.
>
> I think we should:
>
> - Revert r1865523

I see that this option has been supported from its introduction in 2010, and
that it appears to be covered in tests.  Also, since it only affects the
journaling mode, apparently it does not participate in the parser/keyword
generation, which is listed as the reason of not using it with amalgamation:

  https://github.com/sqlite/sqlite/blob/cc3f3d1f055/src/parse.y
  https://github.com/sqlite/sqlite/blob/cc3f3d1f055/tool/mkkeywordhash.c
  https://github.com/sqlite/sqlite/blob/cc3f3d1f055/test/attach4.test#L82
  
https://github.com/sqlite/sqlite/commit/bc6b8d73592ad72e674b10152012170a01c31c62


Thanks,
Evgeny Kotkov


Re: svn commit: r1863018 - /subversion/trunk/subversion/libsvn_subr/win32_crypto.c

2019-07-14 Thread Evgeny Kotkov
Branko Čibej  writes:

> Uh. I don't think so.
>
> First of all, the reference to Chromium source is a red herring ... that
> code disables CRL/OCSP checks *if some caller required it to*. You'll
> find that browsers do, indeed, check CRL or OCSP info, if it's available.

I went through an additional round of fact-checking to ensure that Chromium
browsers have never been doing online revocation checks by default.

Back in 2014, this could be controlled by a user preference (disabled by
default), but since then the UI option was removed and the current state
is that such checks only operate from cache:

  
https://codereview.chromium.org/250583004/diff/20001/chrome/browser/net/ssl_config_service_manager_pref.cc
  
https://chromium.googlesource.com/chromium/src/+/refs/tags/77.0.3852.2/chrome/browser/ssl/ssl_config_service_manager_pref.cc#243

On Windows, this can be additionally verified by examining the CryptoAPI
log where I can see the certificate chains being constructed with the
CERT_CHAIN_REVOCATION_CHECK_CACHE_ONLY flag.

> Your change disables all online verification, regardless, and that seems
> quite wrong.

This change affects the Win32 callback used to *override specific* certificate
validation failures, namely SVN_AUTH_SSL_UNKNOWNCA.  So in the current
design, the revocation checks are only supplemental, but not authoritative.
As an example, if the Serf/OpenSSL layer thinks that the certificate is OK
by itself (without performing a revocation check), this callback is not going
to be called at all.

Since all infrastructure-related failures, stalls and timeouts during online
revocation checks appear to be fairly common, combined, this means that we
are putting the users through all these potential problems, but in the end
the whole checking is still non-exhaustive.  Perhaps, the actual result of
that is just adding to the perception that Subversion is slow in general.

With that in mind, I tend to think that it would be much better to just stick
to the locally available data for this particular callback, that is currently
used to only resolve a subset of certificate validation failures.

> So ... -1, please revert

Will certainly do, unless this new data alters your opinion on the topic.


Thanks,
Evgeny Kotkov


Re: API to get revision size on disk

2019-04-12 Thread Evgeny Kotkov
Julian Foad  writes:

> I have reviewed and tested it. Can you commit?

Fixed the typo and committed in https://svn.apache.org/r1857435, thanks!

> * We could keep the existing private API declarations in
> include/private/svn_fs_fs_private.h as well as the new ioctl versions of
> them, rather than moving the existing declarations to a local header. It
> would reduce the patch size, and some people might like the ability to
> link to libsvn_fs_fs directly and and call those functions directly. But
> I don't feel strongly about this; either way is fine for me.

For now I chose to move the declarations to libsvn_fs_fs/fs_fs.h, to avoid
having two parallel interfaces and to only export the version that works
through the FS loader.


Thanks,
Evgeny Kotkov


Re: API to get revision size on disk

2019-04-09 Thread Evgeny Kotkov
Julian Foad  writes:

> Similar consideration applies to the API. I think this could be a libsvn_fs
> API rather than only in libsvn_fs_fs. Agreed?

Maybe, as an alternative to abstracting and promoting this to a public FS API,
we could add a way of performing backend-specific I/O operations through an
API like svn_fs_ioctl()?

See the attached patch that implements it for existing fsfs stats, dump index
and load index operations, and, as I suppose, may also be extended for the
"get revision size/stats" operation.

This concept and the patch serve two purposes:

- They allow us to properly expose FS-specific details and invoke FS-specific
  operations everywhere without necessarily promoting them into a proper public
  API (the ioctl code itself may be made either public or private, depending on
  the requirements).

- They solve an potential dependency/linking problem where tools like `svnfsfs`
  work through the libsvn_fs's loader, but also have to load and call private
  APIs from libsvn_fs_fs thus ignoring the loader.  The latter part may
  potentially cause issues with the global shared state, etc.  With the
  patch, all such operations always go through the FS loader.


Regards,
Evgeny Kotkov
Index: build.conf
===
--- build.conf  (revision 1857216)
+++ build.conf  (working copy)
@@ -442,7 +442,7 @@ description = Subversion FSFS Repository Manipulat
 type = exe
 path = subversion/svnfsfs
 install = bin
-libs = libsvn_repos libsvn_fs libsvn_fs_fs libsvn_delta libsvn_subr apriconv 
apr
+libs = libsvn_repos libsvn_fs libsvn_delta libsvn_subr apriconv apr
 
 # 
 #
@@ -831,7 +831,7 @@ type = exe
 path = subversion/tests/libsvn_fs_fs
 sources = fs-fs-private-test.c
 install = test
-libs = libsvn_test libsvn_fs libsvn_fs_fs libsvn_delta
+libs = libsvn_test libsvn_fs libsvn_delta
libsvn_repos libsvn_subr apriconv apr
 msvc-force-static = yes
 
Index: subversion/include/private/svn_fs_fs_private.h
===
--- subversion/include/private/svn_fs_fs_private.h  (revision 1857216)
+++ subversion/include/private/svn_fs_fs_private.h  (working copy)
@@ -255,22 +255,6 @@ typedef struct svn_fs_fs__stats_t
   apr_hash_t *by_extension;
 } svn_fs_fs__stats_t;
 
-
-/* Scan all contents of the repository FS and return statistics in *STATS,
- * allocated in RESULT_POOL.  Report progress through PROGRESS_FUNC with
- * PROGRESS_BATON, if PROGRESS_FUNC is not NULL.
- * Use SCRATCH_POOL for temporary allocations.
- */
-svn_error_t *
-svn_fs_fs__get_stats(svn_fs_fs__stats_t **stats,
- svn_fs_t *fs,
- svn_fs_progress_notify_func_t progress_func,
- void *progress_baton,
- svn_cancel_func_t cancel_func,
- void *cancel_baton,
- apr_pool_t *result_pool,
- apr_pool_t *scratch_pool);
-
 /* A node-revision ID in FSFS consists of 3 sub-IDs ("parts") that consist
  * of a creation REVISION number and some revision- / transaction-local
  * counter value (NUMBER).  Old-style ID parts use global counter values.
@@ -325,34 +309,37 @@ typedef svn_error_t *
 void *baton,
 apr_pool_t *scratch_pool);
 
-/* Read the P2L index for the rev / pack file containing REVISION in FS.
- * For each index entry, invoke CALLBACK_FUNC with CALLBACK_BATON.
- * If not NULL, call CANCEL_FUNC with CANCEL_BATON from time to time.
- * Use SCRATCH_POOL for temporary allocations.
- */
-svn_error_t *
-svn_fs_fs__dump_index(svn_fs_t *fs,
-  svn_revnum_t revision,
-  svn_fs_fs__dump_index_func_t callback_func,
-  void *callback_baton,
-  svn_cancel_func_t cancel_func,
-  void *cancel_baton,
-  apr_pool_t *scratch_pool);
+typedef struct svn_fs_fs__ioctl_get_stats_input_t
+{
+  svn_fs_progress_notify_func_t progress_func;
+  void *progress_baton;
+} svn_fs_fs__ioctl_get_stats_input_t;
 
+typedef struct svn_fs_fs__ioctl_get_stats_output_t
+{
+  svn_fs_fs__stats_t *stats;
+} svn_fs_fs__ioctl_get_stats_output_t;
 
-/* Rewrite the respective index information of the rev / pack file in FS
- * containing REVISION and use the svn_fs_fs__p2l_entry_t * array ENTRIES
- * as the new index contents.  Allocate temporaries from SCRATCH_POOL.
- *
- * Note that this becomes a no-op if ENTRIES is empty.  You may use a zero-
- * sized empty entry instead.
- */
-svn_error_t *
-svn_fs_fs__load_index(svn_fs_t *fs,
-  svn_revnum_t revision,
-  apr_array_header_t *entries,
-  apr_pool_t *scratch_pool);
+SVN_FS_DECLARE_IOCTL_CODE(SVN_FS_FS__IOCTL_GET_STATS, SVN_FS_TYPE_FSFS

Re: --normalize-probs doesn't do its thing

2019-03-26 Thread Evgeny Kotkov
Thorsten Goetzke  writes:

> uls/trunk/uls/TestAutomation/UMS_Testmanagement/02_Implementation/java/TestManagementUI
> ...svnadmin: E125005: A property with invalid line ending found in
> dumpstream; consider using --normalize-props while loading.
>
> Are there some catches on how to use --normalize-probs?

Apparently, there is a bug in the implementation of --normalize-props
that only makes it automatically fix line endings in the values of the
revision properties, but not node properties (as per the code around
load-fs-vtable.c: set_node_property()).

I'll put fixing that on my TODO list.


Regards,
Evgeny Kotkov


Re: svn commit: r1854072 - in /subversion/trunk/subversion: libsvn_subr/io.c tests/libsvn_subr/io-test.c

2019-02-21 Thread Evgeny Kotkov
Branko Čibej  writes:

> @@ -4493,6 +4571,12 @@ svn_io_dir_remove_nonrecursive(const cha
>
>SVN_ERR(cstring_from_utf8(_apr, dirname, pool));
>
> +  /* On Windows, a read-only directory cannot be removed. */
> +#if defined(WIN32) || defined(__OS2__)
> +  SVN_ERR(io_set_readonly_flag(dirname_apr, dirname,
> +   FALSE, FALSE, FALSE, pool));
> +#endif
> +
>status = apr_dir_remove(dirname_apr, pool);

Would it be feasible for us to only attempt to remove the read-only
attribute after receiving an error? (along the lines of the attached patch)

The reason is that getting and setting file attributes usually results in
CreateFile() syscalls, whereas opening files and the I/O-related syscalls are
not cheap on Win32 in general.  So changing the directory attributes per every
remove could increase the amount of I/O operations and maybe slow down the
cases where we have to remove multiple directories, with the post-commit
transaction directory cleanup being an example of such case.


Thanks,
Evgeny Kotkov
Index: subversion/libsvn_subr/io.c
===
--- subversion/libsvn_subr/io.c (revision 1854090)
+++ subversion/libsvn_subr/io.c (working copy)
@@ -4571,12 +4571,6 @@ svn_io_dir_remove_nonrecursive(const char *dirname
 
   SVN_ERR(cstring_from_utf8(_apr, dirname, pool));
 
-  /* On Windows, a read-only directory cannot be removed. */
-#if defined(WIN32) || defined(__OS2__)
-  SVN_ERR(io_set_readonly_flag(dirname_apr, dirname,
-   FALSE, FALSE, FALSE, pool));
-#endif
-
   status = apr_dir_remove(dirname_apr, pool);
 
 #ifdef WIN32
@@ -4583,6 +4577,16 @@ svn_io_dir_remove_nonrecursive(const char *dirname
   {
 svn_boolean_t retry = TRUE;
 
+if (APR_STATUS_IS_EACCES(status) || APR_STATUS_IS_EEXIST(status))
+  {
+/* Set the destination directory writable because Windows will not
+   allow us to delete when path is read-only. */
+SVN_ERR(io_set_readonly_flag(dirname_apr, dirname,
+ FALSE, FALSE, TRUE, pool));
+
+status = apr_dir_remove(dirname_apr, pool);
+  }
+
 if (status == APR_FROM_OS_ERROR(ERROR_DIR_NOT_EMPTY))
 {
 apr_status_t empty_status = dir_is_empty(dirname_apr, pool);


Re: [PATCH] A test for "Can't get entries" error

2018-11-30 Thread Evgeny Kotkov
Daniel Shahaf  writes:

> > I'm glad you included a FS-level test for it. In case anyone is
> > interested, I attach a patch that backports the initial RA-level test to
> > 1.8. 1.9 and 1.10.
>
> Any reason not to commit that one, too?

Personally, I think that having another test on the RA layer for this case
is unnecessary and would just increase the maintenance costs — since this
is an FS layer bug and given that we already have the (more or less) minimal
regression test in terms of the amount of the FS API calls.


Regards,
Evgeny Kotkov


Re: svn commit: r1847572 - in /subversion/trunk/subversion: libsvn_fs_fs/tree.c tests/libsvn_fs/fs-test.c

2018-11-27 Thread Evgeny Kotkov
Branko Čibej  writes:

> > +  else
> > +return err_not_directory(root, directory, pool);
> > +}
> >  }
> >  }
> >
>
> Would be nice to wrap this return value in svn_error_trace(), as that
> makes it easier to find where the error came from.

Makes sense, committed in https://svn.apache.org/r1847596


Thanks,
Evgeny Kotkov


Re: [PATCH] A test for "Can't get entries" error

2018-11-27 Thread Evgeny Kotkov
Dmitry Pavlenko  writes:

> I've run into an error: when performing 2 specially constructed updates one
> after another within the same session, SVN fails with
>
>   $ ./ra-test 15
>   svn_tests: E160016: Can't get entries of non-directory
>   XFAIL: ra-test 15: check that there's no "Can't get entries" error
>
> error. I believe these updates constructed that way are valid, so the problem
> is somewhere in FSFS code. It's also interesting that if these updates are
> run separately (e.g. by adding "if (FALSE)" to one or another), they succeed.

Apparently, this behavior is caused by a problem in the specific optimization
within the FSFS open_path() routine.

I constructed an FS regression test and committed it and the fix in:
   https://svn.apache.org/r1847572

(I'll try to nominate it for backport once I get some time.)


Thanks,
Evgeny Kotkov


[PATCH] Proof of concept of the better-pristines (LZ4 + storing small pristines as BLOBs) (Was: Re: svn commit: r1843076)

2018-10-22 Thread Evgeny Kotkov
Branko Čibej  writes:

> Still missing is a mechanism for the libsvn_wc (and possibly
> libsvn_client) to determine the capabilities of the working copy at
> runtime (this will be needed for deciding whether to use compressed
> pristines).

FWIW, I tried the idea of using LZ4 to compress the pristines and storing small
pristines as blobs in the `PRISTINE` table.  I was particularly interested in
how such change would affect the performance and what kind of obstacles
would have to be dealt with.

In the attachment you will find a more or less functional implementation of
this idea that might be useful to some extent.  The patch is a proof of
concept: it doesn't include the WC compatibility bits and most certainly
doesn't have everything necessary in place.  But in the meanwhile, I think
that is might give a good approximation of what can be expected from the
approach.

The patch applies to the `better-pristines` branch.

A couple of observations:

 - As expected, the combined size of the pristines is halved when the data
   itself is compressible, thus making the working copy 25% smaller.

 - A variety of the callers currently access the pristine contents by reading
   the corresponding files.  That doesn't work in case of compressed pristines
   or pristines stored as BLOBs.

   I think that ideally we would want to use streams as much as possible, and
   only spill the uncompressed pristine contents to temporary files when we
   need to pass them to external tools, etc.; and that temporary files need
   to be backed by a work queue to avoid leaving them in place in case of an
   application crash.

   The patch does that kind of plumbing to some extent, but that part of the
   work is not complete.  The starting point is around wc_db_pristine.c:
   svn_wc__db_pristine_get_path().

 - Using BLOBs to store the pristine contents didn't have a measurable impact
   on the speed of the WC operations such as checkout in my experiments on
   Windows.  These experiments were not comprehensive, and also I didn't run
   the tests on *nix.

 - There's also the deprecated svn_wc_get_pristine_copy_path() public API that
   would require plumbing to maintain compatibility; the patch performs it by
   spilling the pristine contents result into a temporary file whose lifetime
   is attached to the `result_pool`.

 (I probably won't be able to continue the work on this patch in the nearby
 future; posting this in case it might be useful.)


Thanks,
Evgeny Kotkov
Index: notes/wc-ng/pristine-store
===
--- notes/wc-ng/pristine-store  (revision 1844566)
+++ notes/wc-ng/pristine-store  (working copy)
@@ -1,6 +1,7 @@
 This spec defines how the Pristine Store works (part A) and how the WC
 uses it (part B).
 
+TODO: Update
 
 A. THE PRISTINE STORE
 =
Index: subversion/include/private/svn_io_private.h
===
--- subversion/include/private/svn_io_private.h (revision 1844566)
+++ subversion/include/private/svn_io_private.h (working copy)
@@ -119,14 +119,6 @@ svn_error_t *
 svn_stream__install_delete(svn_stream_t *install_stream,
apr_pool_t *scratch_pool);
 
-/* Optimized apr_file_stat / apr_file_info_get operating on a closed
-   install stream */
-svn_error_t *
-svn_stream__install_get_info(apr_finfo_t *finfo,
- svn_stream_t *install_stream,
- apr_int32_t wanted,
- apr_pool_t *scratch_pool);
-
 /* Internal version of svn_stream_from_aprfile2() supporting the
additional TRUNCATE_ON_SEEK argument. */
 svn_stream_t *
Index: subversion/include/private/svn_subr_private.h
===
--- subversion/include/private/svn_subr_private.h   (revision 1844566)
+++ subversion/include/private/svn_subr_private.h   (working copy)
@@ -591,6 +591,12 @@ svn__decompress_lz4(const void *data, apr_size_t l
 svn_stringbuf_t *out,
 apr_size_t limit);
 
+svn_stream_t *
+svn_stream__lz4_compress(svn_stream_t *stream, apr_pool_t *pool);
+
+svn_stream_t *
+svn_stream__lz4_decompress(svn_stream_t *stream, apr_pool_t *pool);
+
 /** @} */
 
 /**
Index: subversion/libsvn_subr/compress_lz4.c
===
--- subversion/libsvn_subr/compress_lz4.c   (revision 1844566)
+++ subversion/libsvn_subr/compress_lz4.c   (working copy)
@@ -23,9 +23,14 @@
 
 #include 
 
+#define APR_WANT_BYTEFUNC
+#include 
+
+#include "private/svn_string_private.h"
 #include "private/svn_subr_private.h"
 
 #include "svn_private_config.h"
+#include "svn_pools.h"
 
 #ifdef SVN_INTERNAL_LZ4
 #include "lz4/lz4internal.h"
@@ -142,3 +147,322 @@ svn_lz4__runtime_version(void)
 {
   return LZ4_versionNumber();
 }
+

Re: Possible bug in the tree conflict resolver: "Accept incoming deletion" option doing nothing for a locally deleted file

2018-04-25 Thread Evgeny Kotkov
Stefan Sperling <s...@apache.org> writes:

>> Should I file an issue or add a regression test for this problem?
>
> Yes please.
>
> Thanks!

Issue: https://issues.apache.org/jira/browse/SVN-4739

Regression test committed in https://svn.apache.org/r1830083


Thanks,
Evgeny Kotkov


Possible bug in the tree conflict resolver: "Accept incoming deletion" option doing nothing for a locally deleted file

2018-04-24 Thread Evgeny Kotkov
Hi everyone,

I think that I might have stumbled across an issue with the new tree conflict
resolver in case of a delete-vs-delete conflict.  Below is a simple PowerShell-
based reproduction script that adds a file (r1), removes it (r2), updates to
the previous revision, removes the file in the working copy (svn rm), and
attempts an update to r2:

rm test -Recurse -Force -ErrorAction Ignore
mkdir test
svnadmin create test/repo
$url = 'file:///{0}/test/repo' -f (pwd) -replace '\\', '/'
svn co $url test/wc
echo content > test/wc/file.txt
svn add test/wc/file.txt
svn ci test/wc -m"r1"
svn rm test/wc/file.txt
svn ci test/wc -m"r2"
svn up test/wc -r1
svn rm test/wc/file.txt
svn up test/wc
svn st test/wc

In this case, the tree conflict resolver offers the "Accept incoming deletion"
option for file.txt, but choosing it does nothing, and leaves the tree conflict
unresolved, which I think is unexpected:

> svn up

Updating 'test\wc':
   C test\wc\file.txt
At revision 2.
Summary of conflicts:
  Tree conflicts: 1
Searching tree conflict details for 'test\wc\file.txt' in repository:
Checking r2... done
Tree conflict on 'test\wc\file.txt':
File updated from r1 to r2 was deleted by Evgeny.Kotkov in r2.
A deleted file was found in the working copy.
Select: (p) Postpone, (r) Mark as resolved, (a) Accept incoming deletion,
(h) Help, (q) Quit resolution: a
Summary of conflicts:
  Tree conflicts: 1

> svn st

! C test\wc\file.txt
  >   local file delete, incoming file delete or move upon update
Summary of conflicts:
  Tree conflicts: 1

Should I file an issue or add a regression test for this problem?


Thanks,
Evgeny Kotkov


Re: svnadmin load error with 1.10 that seems to be fsfs cache related

2018-03-15 Thread Evgeny Kotkov
Philip Martin <phi...@codematters.co.uk> writes:

> That works as expected, but vary the cache size of the load process and
> it fails.  The load succeeds with -M64 and smaller but fails with -M65
> and larger:

[...]

Maybe this behavior could be related to the cache size threshold in svnadmin
that enables the block read feature in fsfs (currently set to 64 MB, as per
svnadmin.c:BLOCK_READ_CACHE_THRESHOLD).


Regards,
Evgeny Kotkov


Re: Rolling 1.10.0-rc2

2018-03-14 Thread Evgeny Kotkov
Julian Foad <julianf...@apache.org> writes:

> Please review the 'STATUS' file and report here if there's anything else I
> should take into account.

I have found a bug (my fault) in the new svn_txdelta_to_svndiff_stream()
API that can result in failing commits over http://.

Should be fixed by https://svn.apache.org/r1826747 and proposed
for backport to 1.10.x.


Thanks,
Evgeny Kotkov


Re: 1.10 release notes: protocol changes, LZ4, etc.

2018-03-05 Thread Evgeny Kotkov
Julian Foad <julianf...@apache.org> writes:

>> I incorporated this tweak and staged the changes in r1825865.
>
> Lovely. I merged it all (includes some other people's changes) to 'publish'
> in r1825869.

Thanks!

> To merge from staging to publish I used "svn merge
> ^/subversion/site/staging/docs publish/docs" to keep the "root" of the merge
> as high as possible while omitting the longer-term staged changes in
> 'staging/quick-start.html.
>
> I encourage you to commit follow-ups straight to 'publish' rather than
> 'staging' unless you want (yourself or others) to review it. Up to you, just
> trying to remove friction.

Personally, I prefer always committing to staging first, as I think that
it helps to avoid unwanted mistakes which can otherwise slip through to
the live site.  Assuming no objections or tweak suggestions, I'll merge
the most recent relnotes follow-ups to 'publish' somewhere later today or
tomorrow.


Thanks,
Evgeny Kotkov


Re: 1.10 release notes: protocol changes, LZ4, etc.

2018-03-05 Thread Evgeny Kotkov
Julian Foad <julianf...@apache.org> writes:

> +1.
>
> (Nit: English idiom is "[is still] subject to".)

I incorporated this tweak and staged the changes in r1825865.


Thanks,
Evgeny Kotkov


Re: Potential regression: high server-side memory consumption during import (was: Subversion 1.10 RC1?)

2018-03-02 Thread Evgeny Kotkov
Stefan Sperling <s...@elego.de> writes:

> I'd rather ship 1.10.0 at the prospected release date followed closely
> by 1.10.1 to fix bugs such as these, than delay general access to 1.10.0
> even further.

While I do not have significant objections against such plan, I find the
idea of shipping a performance feature that causes a massive slowdown
instead of an improvement somewhat controversial.  (In other words,
I am -0 for that.)

> You may not have realized this, but I have been waiting for 1.10.0 to
> happen for *over a year* https://svn.haxx.se/dev/archive-2017-01/0043.shtml
> For all this time, I have wanted the conflict resolver to get real world
> exposure because I need feedback from users out there to improve it.
> I kept mostly quiet because I didn't want to push too hard for this
> release all by myself because of the relatively high share of burden
> this would imply. So I waited for activity from the community to make
> it happen as a true collective effort.

Not too sure about how this is connected to the soak period and to the
release process — speaking of which, I would say that your e-mail may
discourage people from reporting issues during the soak period.

> If this one bug really bothers you enough to hold the planned release back
> it makes me wonder why you didn't push for a fix much earlier. We have had
> plenty of time.

I haven't been and am not pushing for a fix.  Rather than that, I have just
included the additional information about the problem with a comment that
it might be viable to look into before the GA release.

Moreover, I reported the issue at the very moment I found it with an edge-case
reproduction.  Once I was asked to bisect for a specific revision, I should
have probably stated that I won't have time to do that.  But I have been
thinking that I would be able to find some.  When I stumbled across it again,
I found the revision and the simple reproduction — but as it seems, this
hasn't been the most appropriate time for including these new details.

Putting all that aside, I wouldn't say that it is productive to discuss issues
in such way.  In my opinion, we should be doing it the other way around by
actually encouraging reports of various problems during the soak period.


Regards,
Evgeny Kotkov


Re: Potential regression: high server-side memory consumption during import (was: Subversion 1.10 RC1?)

2018-03-02 Thread Evgeny Kotkov
Stefan Fuhrmann <stef...@apache.org> writes:

> Would it be possible for you to bisect this to find the offending revision?
> My random guess would be that in the context of mod_dav_svn, we might use
> an unsuitable pool for authz caching.

While looking through the various 1.10-related topics, I remembered about
this issue.  I have been able to narrow it down in my environment to
https://svn.apache.org/r1778923 (Ensure that even long-lived DAV
connections use up-to-date authz.)

Perhaps, a simpler reproduction script would be to issue an 'svn log' for
a medium-sized repository.  In my environment, doing so for the trunk of
TortoiseSVN's repository with 25,000 revisions causes the httpd process
to consume up to a 1 GB of RAM while processing the request.  Overall,
the log takes around 11 seconds instead of 2, compared to 1.9.7.

Unless I am missing something, this might be worth considering before the
1.10 GA release.


Thanks,
Evgeny Kotkov


Re: 1.10 release notes: protocol changes, LZ4, etc.

2018-03-01 Thread Evgeny Kotkov
Julian Foad <julianf...@apache.org> writes:

>> - On the server-side, SVNCompressionLevel 0 can be used to disable
>> compression altogether.  The special value of SVNCompressionLevel 1 forces
>> the use of LZ4 compression for clients that support it.  All other values
>> result in negotiating the use of zlib compression with the respective
>> compression level, unless the compression is disabled on the client.
>
> There's a small disagreement with the table there: all other values result
> in negotiating either zlib or LZ4.

Indeed.  Perhaps, we could rephrase it like this?

  [...] All other values result in preferring zlib compression with the
  respective compression level.  Note that the negotiated algorithm is still
  a subject to the client's configuration.  For example, even if the server
  is configured to prefer zlib compression over LZ4, a client may still
  negotiate the use of LZ4 compression when its http-compression option
  is set to auto.

>> - On the client-side, setting http-compression to either yes or no will
>> disable or enable compression that is then negotiated based on the
>> server's
>> configuration.  The default value of auto will result in preferring LZ4
>> compression over low latency networks and zlib compression otherwise.
>
> Can we link to the reference documentation for the client and server
> options, so the reader (me) can cross-check the definition and look for
> related settings?

At the time being, the documentation for these options is quite scarce and
doesn't dive into the details of which compression algorithm is negotiated
behind the scenes.  The reasoning behind this was to avoid overwhelming
the users with technical and implementation details and to have a ground
for future extension by not promising too much.  This makes me think that
linking to these docs from the release notes probably wouldn't be too useful:

  
https://svn.apache.org/viewvc/subversion/trunk/subversion/mod_dav_svn/mod_dav_svn.c?view=markup=1820778#l1392
  
https://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/config_file.c?view=markup=1820778#l1151

For the purpose of cross-checking the description, the following locations in
mod_dav_svn and ra_serf may prove useful:

  mod_dav_svn/repos.c:negotiate_encoding_prefs()
  libsvn_ra_serf/util.c:svn_ra_serf__setup_svndiff_accept_encoding()
  libsvn_ra_serf/commit.c:negotiate_put_encoding()


Thanks,
Evgeny Kotkov


Re: 1.10 release notes: protocol changes, LZ4, etc.

2018-03-01 Thread Evgeny Kotkov
Julian Foad <julianf...@apache.org> writes:

> I added a skeleton "protocol changes" section to the release notes:
>
>   http://subversion.a.o.local/docs/release-notes/1.10.html#protocol-changes
>
> Here is a starting point. Can anyone help fill in the details?

[...]

> Thoughts? Help?

Maybe we could start with adding a subsection about that describes using
LZ4 for http:// and svn:// to the users.  If needed, we could then include
the technical details about the negotiation into the "Client-Server Protocol
Changes" and cross-link these two sections.

For the http:// part, we could probably do something along the lines of
the attached patch.  Here is also the same content as plain text for
convenience:

[[[
LZ4 compression over the wire in http:// and svn:// connections

Deltas transferred between Subversion 1.10 clients and servers may be
compressed with LZ4.  The actual choice of the compression algorithm depends
on the used protocol, environment and its configuration — see below.

For http:// protocol, use of LZ4 compression depends on the values of
the server-side SVNCompressionLevel directive, client-side http-compression
configuration option and on the network capabilities.  LZ4 compression
generally offers much faster compression and decompression speeds, but
slightly worse compression ratio than zlib.  By default, it is only
preferred over low latency networks where the overhead associated with
transferring the additional amount of data is assumed to be negligible.

- On the server-side, SVNCompressionLevel 0 can be used to disable
compression altogether.  The special value of SVNCompressionLevel 1 forces
the use of LZ4 compression for clients that support it.  All other values
result in negotiating the use of zlib compression with the respective
compression level, unless the compression is disabled on the client.

- On the client-side, setting http-compression to either yes or no will
disable or enable compression that is then negotiated based on the server's
configuration.  The default value of auto will result in preferring LZ4
compression over low latency networks and zlib compression otherwise.

Below is the table explaining the used compression algorithm in each
combination of the client- and server-side configuration options:

[table]


]]]


Thanks,
Evgeny Kotkov
Index: docs/release-notes/1.10.html
===
--- docs/release-notes/1.10.html(revision 1825639)
+++ docs/release-notes/1.10.html(working copy)
@@ -136,7 +136,7 @@ and what impact these changes may have.
 Use SVN 1.8 and above clients only for best results.
   
 
-  LZ4 compression over the wire in http:// connections
+  LZ4 compression over the wire in http:// 
and svn:// connections
 
 1.10
 1.10
@@ -526,7 +526,9 @@ containing large files.
 
 
 LZ4 compression is now used by default for the on-disk data in repositories
-with filesystem format 8 (see below).
+with filesystem format 8.  Also, Subversion 1.10 adds support for automatic
+negotiation and use of LZ4 compression over the wire for http:// and svn://
+connections when it is supported by both endpoints.
 
 Note: this does not apply to the pre-release
 https://lists.apache.org/thread.html/dd78432b301f3f95567930c242c46b308c8a017fd754c0f388245915@%3Cannounce.subversion.apache.org%3E;>Subversion
 1.10.0-alpha3
@@ -559,7 +561,7 @@ format number of a repository.)
 
 
 Configuring the repository to use LZ4 compression
-  
 
 
@@ -583,6 +585,116 @@ cycle into a new format 8 repository created with
 
   
 
+
+LZ4 compression over the wire in http:// and svn:// connections
+  
+
+
+Deltas transferred between Subversion 1.10 clients and servers may be
+compressed with LZ4. The actual choice of the compression algorithm depends
+on the used protocol, environment and its configuration  see below.
+
+For http:// protocol, use of LZ4 compression depends on the values
+of the server-side SVNCompressionLevel directive, client-side
+http-compression configuration option and on the network
+capabilities.  LZ4 compression generally offers much faster compression
+and decompression speeds, but slightly worse compression ratio than zlib.
+By default, it is only preferred for low latency networks where the
+overhead associated with transferring the additional amount of data is
+assumed to be negligible.
+
+
+  
+On the server-side, SVNCompressionLevel 0
+can be used to disable compression altogether.  The special value of
+SVNCompressionLevel 1 forces the use of LZ4 compression for
+clients that support it.  All other values result in negotiating the
+use of zlib compression with the respective compression level, unless
+the compression is disabled on the client.
+  
+  
+On the client-side, setting http-compression to
+either yes or no will disable or enable compression
+that is then negotiated based on the

Re: Problems with commit feature negotiation and write-through proxies

2018-01-10 Thread Evgeny Kotkov
Evgeny Kotkov <evgeny.kot...@visualsvn.com> writes:

>> Index: subversion/mod_dav_svn/version.c
>> ===
>> --- subversion/mod_dav_svn/version.c(revision 1820704)
>> +++ subversion/mod_dav_svn/version.c(working copy)
>> @@ -152,9 +152,6 @@ get_vsn_options(apr_pool_t *p, apr_text_header *ph
>>apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_INHERITED_PROPS);
>>apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_INLINE_PROPS);
>>apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_REVERSE_FILE_REVS);
>> -  apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_SVNDIFF1);
>> -  apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_SVNDIFF2);
>> -  apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_PUT_RESULT_CHECKSUM);
>>apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_LIST);
>>/* Mergeinfo is a special case: here we merely say that the server
>> * knows how to handle mergeinfo -- whether the repository does too
>> @@ -297,6 +294,19 @@ get_option(const dav_resource *resource,
>>  { "create-txn-with-props",  { 1, 8, 0, "" } },
>>};
>>
>> +  /* These capabilities are used during commit and when acting as
>> + a WebDAV slave (SVNMasterURI) their availablity depends on
>> + the master version (SVNMasterVersion) rather than our own
>> + (slave) version. */
>> +  struct capability_versions_t {
>> +const char *capability_name;
>> +svn_version_t min_version;
>> +  } capabilities[] = {
>> +{ SVN_DAV_NS_DAV_SVN_SVNDIFF1, { 1, 7, 0, ""} },
>> +{ SVN_DAV_NS_DAV_SVN_SVNDIFF2, { 1, 10, 0, ""} },
>> +{ SVN_DAV_NS_DAV_SVN_PUT_RESULT_CHECKSUM, { 1, 10, 0, ""} },
>> +  };
>
> I would be fine with this approach as well, but perhaps with a few tweaks:

I also notice that this patch only advertises the new capabilities if the
server is configured to enable HTTPv2.  This is probably unintended, as
I think that there is no reason to disable these PUT-related capabilities
(such as being able to parse svndiff1/svndiff2) if the server is configured
to prefer HTTPv1 protocol with SVNAdvertiseV2Protocol off.


Thanks,
Evgeny Kotkov


Re: Problems with commit feature negotiation and write-through proxies

2018-01-10 Thread Evgeny Kotkov
Philip Martin <phi...@codematters.co.uk> writes:

> We already have system for handling create-txn and create-txn-with-props
> so I was thinking of extending that code:
>
> Index: subversion/mod_dav_svn/version.c
> ===
> --- subversion/mod_dav_svn/version.c(revision 1820704)
> +++ subversion/mod_dav_svn/version.c(working copy)
> @@ -152,9 +152,6 @@ get_vsn_options(apr_pool_t *p, apr_text_header *ph
>apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_INHERITED_PROPS);
>apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_INLINE_PROPS);
>apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_REVERSE_FILE_REVS);
> -  apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_SVNDIFF1);
> -  apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_SVNDIFF2);
> -  apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_PUT_RESULT_CHECKSUM);
>apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_LIST);
>/* Mergeinfo is a special case: here we merely say that the server
> * knows how to handle mergeinfo -- whether the repository does too
> @@ -297,6 +294,19 @@ get_option(const dav_resource *resource,
>  { "create-txn-with-props",  { 1, 8, 0, "" } },
>};
>
> +  /* These capabilities are used during commit and when acting as
> + a WebDAV slave (SVNMasterURI) their availablity depends on
> + the master version (SVNMasterVersion) rather than our own
> + (slave) version. */
> +  struct capability_versions_t {
> +const char *capability_name;
> +svn_version_t min_version;
> +  } capabilities[] = {
> +{ SVN_DAV_NS_DAV_SVN_SVNDIFF1, { 1, 7, 0, ""} },
> +{ SVN_DAV_NS_DAV_SVN_SVNDIFF2, { 1, 10, 0, ""} },
> +{ SVN_DAV_NS_DAV_SVN_PUT_RESULT_CHECKSUM, { 1, 10, 0, ""} },
> +  };

I would be fine with this approach as well, but perhaps with a few tweaks:

 - I think that it would be better to handle all four capabilities that
   relate to proxied writes in the same place:

 SVN_DAV_NS_DAV_SVN_EPHEMERAL_TXNPROPS
 SVN_DAV_NS_DAV_SVN_SVNDIFF1
 SVN_DAV_NS_DAV_SVN_SVNDIFF2
 SVN_DAV_NS_DAV_SVN_PUT_RESULT_CHECKSUM

   Otherwise, the handling becomes asymmetric, as the first of them is still
   checked using the separate dav_svn__check_ephemeral_txnprops_support()
   function, but the three new capabilities are checked using the new array-
   based approach.

 - The SVN_DAV_NS_DAV_SVN_SVNDIFF1 capability should probably be
   bound to version 1.10 instead of 1.7.

   The reason for that is that mod_dav_svn didn't properly advertise
   svndiff1 support with this specific header until 1.10, and I think
   that we should have the same behavior for the setups with a write-
   through proxy.

   (Even though mod_dav_svn has been able to parse it for a long time,
currently ra_serf avoids sending svndiff1 to servers that don't
advertise it with this specific header, as there might be third-party
implementations of the Subversion's HTTP protocol that can only
parse svndiff0).


Thanks,
Evgeny Kotkov


Re: Problems with commit feature negotiation and write-through proxies

2018-01-10 Thread Evgeny Kotkov
Philip Martin <phi...@codematters.co.uk> writes:

> In the past we supported different versions on the master and slave and
> we introduced the SVNMasterVersion directive to configure it.
> Unfortunately that information is not immediately available in
> get_vsn_options() where the server advertises some commit features.  If
> I simply remove the SVN_DAV_NS_DAV_SVN_SVNDIFF2 header from
> get_vsn_options() then I can commit.
>
> I'm not sure how we fix this.  Do we delay the svndiff2 negotiation
> until later in the commit?  Do we abandon mixed master/slave versions?

I think that we might be able to selectively stop advertising the new-
in-1.10 capabilities based on the SVNMasterVersion value, similarly to
how we handle SVN_DAV_NS_DAV_SVN_EPHEMERAL_TXNPROPS.

Perhaps, we could do something along the lines of the attached patch?


Thanks,
Evgeny Kotkov
Index: subversion/mod_dav_svn/dav_svn.h
===
--- subversion/mod_dav_svn/dav_svn.h(revision 1820731)
+++ subversion/mod_dav_svn/dav_svn.h(working copy)
@@ -359,12 +359,24 @@ svn_boolean_t dav_svn__get_list_parentpath_flag(re
master server version (if provided via SVNMasterVersion).  */
 svn_boolean_t dav_svn__check_httpv2_support(request_rec *r);
 
-/* For the repository referred to by this request, should ephemeral
-   txnprop support be advertised?  */
-svn_boolean_t dav_svn__check_ephemeral_txnprops_support(request_rec *r);
+typedef enum dav_svn__commit_capability_t
+{
+  dav_svn__capability_ephemeral_txnprops,
+  dav_svn__capability_svndiff1,
+  dav_svn__capability_svndiff2,
+  dav_svn__capability_put_result_checksum
+} dav_svn__commit_capability_t;
 
+/* For the repository referred to by this request, is the specified
+   commit-related capability supported?  Note that this also takes into
+   account the support level expected of based on the specified
+   master server version (if provided via SVNMasterVersion).  */
+svn_boolean_t
+dav_svn__check_commit_capability(request_rec *r,
+ dav_svn__commit_capability_t capability);
 
 
+
 /* SPECIAL URI
 
SVN needs to create many types of "pseudo resources" -- resources
Index: subversion/mod_dav_svn/mod_dav_svn.c
===
--- subversion/mod_dav_svn/mod_dav_svn.c(revision 1820731)
+++ subversion/mod_dav_svn/mod_dav_svn.c(working copy)
@@ -924,17 +924,30 @@ dav_svn__check_httpv2_support(request_rec *r)
 
 
 svn_boolean_t
-dav_svn__check_ephemeral_txnprops_support(request_rec *r)
+dav_svn__check_commit_capability(request_rec *r,
+ dav_svn__commit_capability_t capability)
 {
   svn_version_t *version = dav_svn__get_master_version(r);
 
-  /* We know this server supports ephemeral txnprops.  But if we're
- proxying requests to a master server, we need to see if it
- supports them, too.  */
-  if (version && (! svn_version__at_least(version, 1, 8, 0)))
-return FALSE;
+  /* Unless we are proxying requests to a master server with a declared
+ version number, there is no need to dumb ourselves down. */
+  if (!version)
+return TRUE;
 
-  return TRUE;
+  /* We _are_ proxying requests to another master server with a declared
+ version number.  See if we need to selectively disable some of the
+ capabilities that it doesn't support. */
+  switch (capability)
+{
+  case dav_svn__capability_ephemeral_txnprops:
+return svn_version__at_least(version, 1, 8, 0);
+  case dav_svn__capability_svndiff1:
+  case dav_svn__capability_svndiff2:
+  case dav_svn__capability_put_result_checksum:
+return svn_version__at_least(version, 1, 10, 0);
+  default:
+SVN_ERR_MALFUNCTION_NO_RETURN();
+}
 }
 
 
Index: subversion/mod_dav_svn/version.c
===
--- subversion/mod_dav_svn/version.c(revision 1820731)
+++ subversion/mod_dav_svn/version.c(working copy)
@@ -152,9 +152,6 @@ get_vsn_options(apr_pool_t *p, apr_text_header *ph
   apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_INHERITED_PROPS);
   apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_INLINE_PROPS);
   apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_REVERSE_FILE_REVS);
-  apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_SVNDIFF1);
-  apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_SVNDIFF2);
-  apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_PUT_RESULT_CHECKSUM);
   apr_text_append(p, phdr, SVN_DAV_NS_DAV_SVN_LIST);
   /* Mergeinfo is a special case: here we merely say that the server
* knows how to handle mergeinfo -- whether the repository does too
@@ -210,12 +207,29 @@ get_option(const dav_resource *resource,
   "");
 
   /* If we're allowed (by configuration) to do so, advertise support
- for ephemeral transaction properties. */
-  if (dav_svn__check_ep

Re: svn commit: r1818496 - /subversion/site/staging/docs/release-notes/1.10.html

2017-12-17 Thread Evgeny Kotkov
Daniel Shahaf <d...@daniel.shahaf.name> writes:

> This language implies that we made a backwards incompatible change in 1.6,
> which is not precisely the case; 1.6 simply started enforcing an API
> requirement that 1.5 did not.
>
> Could we perhaps rephrase this accordingly?  I.e., point out that we weren't
> inventing a backwards-incompatible requirement, but starting to enforce one
> that had always been there?

I would think that from the standpoint of a user reading the release notes,
both versions look more or less the same (maybe, with the second one being
slightly harder to parse due to its verbosity):

  Such property line endings were accepted by older servers and can be found
  in repository dumps, but are considered invalid starting with Subversion 1.6.

  Such invalid line endings were accepted by older servers and can be found
  in repository dumps of older repositories, but are rejected by Subversion 1.6
  and later.

But, presumably, both variants are fine, as both of them explain the reason
behind the "Cannot accept non-LF line endings" error — so please feel free
to commit this tweak if you think it would be more appropriate.


Thanks,
Evgeny Kotkov


Re: svn commit: r1807836 - in /subversion/trunk: subversion/include/ subversion/include/private/ subversion/libsvn_repos/ subversion/svnadmin/ subversion/svnrdump/ subversion/tests/cmdline/ subversion

2017-12-17 Thread Evgeny Kotkov
Johan Corveleyn <jcor...@gmail.com> writes:

> Hi Evgeny,
>
> I just mentioned this upcoming --normalize-props feature in a users
> thread (and included a link to your commit). I noticed that it's not
> mentioned in the release notes of 1.10 yet. Do you have time to write
> a small release notes entry for it?

Thanks for the reminder, committed in r1818496:

  https://svn.apache.org/r1818496
  
https://subversion-staging.apache.org/docs/release-notes/1.10.html#normalize-props


Regards,
Evgeny Kotkov


Re: [PATCH] Use the `WITHOUT ROWID` SQLite optimization for rep-cache.db

2017-12-08 Thread Evgeny Kotkov
Bert Huijben <b...@qqmail.nl> writes:

>> The recent SQLite versions (starting from 3.8.2, released in December 2013)
>> feature a `WITHOUT ROWID` optimization [1] that can be enabled when
>> creating a table.  In short, it works well for tables that have non-integer
>> primary keys, such as
>>
>> name TEXT PRIMARY KEY
>>
>> by not maintaining the hidden rowid values and an another B-Tree to match
>> between a primary key value and its rowid.  This reduces the on-disk size
>> and makes the lookups faster (a key → rowid → data lookup is replaced with
>> a key → data lookup).
>
> It doesn't add another B-tree for the primary key and its rowids. For the
> primary key the main table is used as the index.
>
> The case where things differ is when there are multiple indexes. In this
> case normal table will always refer to the primary key using the rowed,
> while for 'WITHOUT ROWID' there will be referred to the primary key,
> which in general is larger.

For the sake of finding out the truth :), I think that this contradicts the
explanation in https://sqlite.org/withoutrowid.html :

CREATE TABLE IF NOT EXISTS wordcount(
  word TEXT PRIMARY KEY,
  cnt INTEGER
);

As an ordinary SQLite table, "wordcount" is implemented as two separate
B-Trees. The main table uses the hidden rowid value as the key and stores
the "word" and "cnt" columns as data. The "TEXT PRIMARY KEY" phrase of
the CREATE TABLE statement causes the creation of an unique index on
the "word" column. This index is a separate B-Tree that uses "word" and
the "rowid" as the key and stores no data at all. Note that the
complete text of every "word" is stored twice: once in the main
table and again in the index.

Although I didn't check if the most recent SQLite version still behaves in
the described way internally, I have witnessed the described close-to-2x
reduction in the size of rep-cache.db — which, unless I am missing
something, follows the described idea of this optimization.


Thanks,
Evgeny Kotkov


Re: Subversion 1.10 RC1?

2017-12-05 Thread Evgeny Kotkov
Stefan Fuhrmann <stef...@apache.org> writes:

> There seems to be little that could be done here (suggestions welcome).
> The problem is that the asterisk is being expanded by the shell itself.
> I made SVN print its command line parameters and this is the result:
>
> $ ./subversion/svn/svn ls svn://localhost/kde --search M*
> 0: ./subversion/svn/svn
> 1: ls
> 2: svn://localhost/kde
> 3: --search
> 4: Makefile
> 5: Makefile.in
>
> That can be prevented by adding quotation marks:
>
> $ ./subversion/svn/svn ls svn://localhost/kde --search "M*"
> 0: ./subversion/svn/svn
> 1: ls
> 2: svn://localhost/kde
> 3: --search
> 4: M*

Unfortunately, on Windows both `--search M*` and (quoted) `--search "M*"`
would expand the asterisk.  While this is not the default shell behavior,
currently it's enabled for svn and a couple of other binaries by linking
to setargv.obj.  In turn, this probably means the command-line client
users on Windows could get quite unexpected results when using the
`--search ARG` syntax.

A potential cheap solution for this issue, I think, would be to make the
--search argument work as a substring to search for in filenames, instead
of using it as a pattern that the (whole) filename should match.  While
there are some cases where the latter approach gives more flexibility,
my guess would be that a substring search would work well in the majority
of scenarios.

(Also, as far as I recall, `log --search` currently searches for a substring,
 so that would be consistent with it, and would probably avoid surprising
 the users by having a switch with the same name, but behaving differently.)


Thanks,
Evgeny Kotkov


Potential regression: high server-side memory consumption during import (was: Subversion 1.10 RC1?)

2017-12-05 Thread Evgeny Kotkov
Julian Foad <julianf...@apache.org> writes:

> After any issues raised in this discussion are resolved, we feel we should
> go ahead and produce RC1 as soon as possible.

I think that I am seeing a 1.10 regression in terms of httpd's memory
usage during large imports.

In my environment, when I `svn import` an unpacked version of Boost
(https://www.boost.org/) that consists of around 60,000 files, I see that
the memory consumption by the server process rapidly grows up to 1.5 GB.
The environment is a Windows 8.1 x64 machine with httpd 2.4.29 configured
to use short-circuit authz and HTTPS.  Apparently, this behavior is specific
to 1.10, as I cannot reproduce it with 1.9 binaries.

 (I haven't investigated the issue any further at this time, and it might as
  well be reproducible in other configurations.)


Thanks,
Evgeny Kotkov


Re: [PATCH] Use the `WITHOUT ROWID` SQLite optimization for rep-cache.db

2017-12-03 Thread Evgeny Kotkov
Evgeny Kotkov <evgeny.kot...@visualsvn.com> writes:

> I think that it would be nice to have this optimization in rep-cache.db,
> and that we can start using it in a compatible way:
>
>   - All existing rep-cache.db statements are compatible with it.
>
>   - Since SQLite versions prior to 3.8.2 don't support it, we would
> only create the new tables with this optimization in fsfs format 8,
> and simultaneously bump the minimal required SQLite version from
> 3.7.12 (May 2012) to 3.8.2 (December 2013).  This would ensure that
> all binaries supporting format 8 can work with the tables with this
> optimization.
>
> Would there be any objections to a change like this (see the attached patch)?

I committed the patch (with a couple of additional tweaks for the
recommended SQLite version and the build scripts) in

    https://svn.apache.org/r1817042


Thanks,
Evgeny Kotkov


[PATCH] Use the `WITHOUT ROWID` SQLite optimization for rep-cache.db

2017-11-30 Thread Evgeny Kotkov
Hi all,

The recent SQLite versions (starting from 3.8.2, released in December 2013)
feature a `WITHOUT ROWID` optimization [1] that can be enabled when creating
a table.  In short, it works well for tables that have non-integer primary
keys, such as

name TEXT PRIMARY KEY

by not maintaining the hidden rowid values and an another B-Tree to match
between a primary key value and its rowid.  This reduces the on-disk size
and makes the lookups faster (a key → rowid → data lookup is replaced with
a key → data lookup).

Currently, the rep-cache.db schema uses a non-integer primary key:

hash TEXT NOT NULL PRIMARY KEY

and can benefit from this optimization.  A quick experiment showed a
reduction of the on-disk size of the database by ~1.75x.  The lookups
should also be faster, both due to the reduced database size and due to
the lesser amount of internal bsearches.  This should improve the times
of new commits and `svnadmin load`, especially for large repositories
that also have large rep-cache.db files.

I think that it would be nice to have this optimization in rep-cache.db,
and that we can start using it in a compatible way:

  - All existing rep-cache.db statements are compatible with it.

  - Since SQLite versions prior to 3.8.2 don't support it, we would
only create the new tables with this optimization in fsfs format 8,
and simultaneously bump the minimal required SQLite version from
3.7.12 (May 2012) to 3.8.2 (December 2013).  This would ensure that
all binaries supporting format 8 can work with the tables with this
optimization.

Would there be any objections to a change like this (see the attached patch)?

[1] https://sqlite.org/withoutrowid.html


Thanks,
Evgeny Kotkov
fsfs: Use the `WITHOUT ROWID` optimization for rep-cache.db in format 8.

This optimization, introduced in SQLite 3.8.2, works well for tables that
have non-integer primary keys, such as

hash TEXT NOT NULL PRIMARY KEY

in the rep-cache.db.  (See the https://sqlite.org/withoutrowid.html article
for additional details.)

A quick experiment showed a reduction of the on-disk size of the database
by ~1.75x.  The lookups should also be faster, both due to the reduced
database size and due to the lesser amount of internal bsearches.  This
should improve the times of new commits and `svnadmin load`, especially
for large repositories that also have large rep-cache.db files.

In order to maintain compatibility, since SQLite versions prior to 3.8.2
do not support this statement, we only start using it for fsfs format 8
repositories and simultaneously bump the minimal required SQLite version
from 3.7.12 (May 2012) to 3.8.2 (December 2013).  The last step ensures that
all binaries compiled to support format 8 can work with the tables with
this optimization.

* subversion/libsvn_fs_fs/rep-cache-db.sql
  (STMT_CREATE_SCHEMA): Rename this ...
  (STMT_CREATE_SCHEMA_V1): ...to this.
  (STMT_CREATE_SCHEMA_V2): New, enables `WITHOUT ROWID` optimization.
  (STMT_GET_REP, STMT_SET_REP, STMT_GET_REPS_FOR_RANGE,
   STMT_GET_MAX_REV, STMT_DEL_REPS_YOUNGER_THAN_REV,
   STMT_LOCK_REP, STMT_UNLOCK_REP):
   Note that these statements work for both V1 and V2 schemas.

* subversion/libsvn_fs_fs/fs.h
  (SVN_FS_FS__MIN_REP_CACHE_SCHEMA_V2_FORMAT): New.

* subversion/libsvn_fs_fs/rep-cache.c
  (REP_CACHE_SCHEMA_FORMAT): Remove.
  (open_rep_cache): Select between creating a V1 or V2 schemas based
   on the format of the filesystem.

* subversion/libsvn_subr/sqlite.c
  (): Bump minimum required SQLite version to 3.8.2.

* subversion/tests/cmdline/svnadmin_tests.py
  (check_hotcopy_fsfs_fsx): Check if the Python's built-in SQLite version
   is enough to interpret the schema of rep-cache.db, and skip the check
   if it's not.

Index: subversion/libsvn_fs_fs/fs.h
===
--- subversion/libsvn_fs_fs/fs.h(revision 1816730)
+++ subversion/libsvn_fs_fs/fs.h(working copy)
@@ -196,6 +196,10 @@ extern "C" {
  */
 #define SVN_FS_FS__MIN_REP_STRING_OPTIONAL_VALUES_FORMAT 8
 
+ /* The minimum format number that supports V2 schema of the rep-cache.db
+database. */
+#define SVN_FS_FS__MIN_REP_CACHE_SCHEMA_V2_FORMAT 8
+
 /* On most operating systems apr implements file locks per process, not
per file.  On Windows apr implements the locking as per file handle
locks, so we don't have to add our own mutex for just in-process
Index: subversion/libsvn_fs_fs/rep-cache-db.sql
===
--- subversion/libsvn_fs_fs/rep-cache-db.sql(revision 1816730)
+++ subversion/libsvn_fs_fs/rep-cache-db.sql(working copy)
@@ -21,7 +21,7 @@
  * 
  */
 
--- STMT_CREATE_SCHEMA
+-- STMT_CREATE_SCHEMA_V1
 /* A table mapping representation hashes to locations in a rev file. */
 CREATE TABLE rep_cache (
   hash TEXT NOT NULL PRIMARY KEY,
@@ -33,26 +33,50

Re: Let's switch to time-based releases

2017-11-28 Thread Evgeny Kotkov
Branko Čibej <br...@apache.org> writes:

>> I know this has been discussed before, but it was never made reality.
>> On last week's hackathon we also didn't come to a full consensus on
>> various details. But I think we should simply work out those details,
>> hopefully come to some compromise, and if so, JFD it.
>
> It has been discussed before and for a while we even did it this way.
> The problem is that we're again at the point where we have to find a
> reluctant volunteer to RM any release we want to make. So ...
>
> "There's no 'we' in Team" :p  Some individual has to step up and do
> this. The sort of release train you describe needs someone to drive it,
> to whit, a "permanent" release manager. No amount of documentation and
> schedules and good intentions will make this happen all by itself.

I concur that the time-based releases most likely won't happen just based
on the agreement, and that there has to be a person driving them.

Perhaps, we could try this approach by finding a volunteer responsible for
RM-ing the 1.11 release — for example, right after 1.10 is released.


Regards,
Evgeny Kotkov


Supporting precooked fsfs v7 and v1-v3 in the test suite (was: Re: svn commit: r1813898 [...])

2017-11-27 Thread Evgeny Kotkov
[Changing subject, as apparently the test failures are not connected to
 the discussed fix for the rep sharing and issues #4623, #4700]

Stefan Fuhrmann <stef...@apache.org> writes:

> Test expectations may need to be adapted.
> With v7 repos (see r1816402), I get two
> test failures while v4 and v6 pass:

[...]

> FAIL:  svnadmin_tests.py 12: svnadmin verify detects corruption dump can't

[...]

> FAIL:  svnadmin_tests.py 24: svnadmin verify with non-UTF-8 paths

I committed two follow-ups to the addition of precooked v7 repositories
in r1816402:

https://svn.apache.org/r1816424
https://svn.apache.org/r1816425

The first of these follow-ups limits the mentioned svnadmin tests to only
run with precooked repositories of formats 4 and 6 — as these tests perform
a set of hardcoded manipulations, assuming that the repository does not
use logical addressing.

Speaking of the added precooked repositories of formats 1-3, I am not too
sure if they should be created with the 1.9 binaries, as stated in the log
message for r1816402.  I think that the actual repositories created with
older SVN versions may be different from the ones created with the
`--compatible=version=` option, and that using the latter approach may
just create a false sense of security.

Perhaps, an alternative way would be to find the 1.1 binaries, prepare the
precooked format 1 repositories using it and limit the set of precooked
repositories to v1, v4, v6 and v7.

(Also, I think that the main.py:parse_options() function may need an update
 to properly handle the added v1-v3 precooked repositories by downgrading
 the server_minor_version — as otherwise, the various test suite checks such
 as server_has_mergeinfo() won't kick in.)


Thanks,
Evgeny Kotkov


Re: svn commit: r1813898 - in /subversion/trunk/subversion: libsvn_fs_fs/transaction.c libsvn_repos/reporter.c tests/cmdline/basic_tests.py tests/cmdline/svnadmin_tests.py

2017-11-25 Thread Evgeny Kotkov
Stefan Fuhrmann <stef...@apache.org> writes:

>> An alternative approach that might be worth considering here would be:
>>
>>   (1) Extend the on-disk format and allow representation strings without
>>   SHA1, but with the uniquifier, something like this (where "-" stands
>>   for "no SHA1"):
>>
>>   15 0 563 7809 28ef320a82e7bd11eebdf3502d69e608 - 14-g/_5
>>
>>   (The new format would be allowed starting from FSFS 8.)
>
> Yes, that would be one option. If you are willing to provide a patch,
> I would probably +1 it. Not bothering with the space savings would
> be a valid choice as well, IMO.
>>
>>   (2) Use the new format to allow rep sharing for properties that writes
>>   the uniquifier so that svn_fs_props_changed() would work correctly,
>>   and doesn't introduce the overhead of writing SHA1 in the
>>   representation string for every property.

[...]

>> Barring objections and alternative suggestions, I could give a shot at
>> implementing this.
>
> Go for it. Maybe notify me once you are done b/c currently, I don't
> monitor the commit activity closely.

I committed the implementations of (1) and (2) in

https://svn.apache.org/r1816347 and
https://svn.apache.org/r1816348

respectively.


Thanks,
Evgeny Kotkov


Re: svn commit: r1813898 - in /subversion/trunk/subversion: libsvn_fs_fs/transaction.c libsvn_repos/reporter.c tests/cmdline/basic_tests.py tests/cmdline/svnadmin_tests.py

2017-11-22 Thread Evgeny Kotkov
Stefan Sperling <s...@apache.org> writes:

> However, if rep-sharing is enabled, svn_fs_props_changed() does not work
> as advertised because properties do not carry a SHA1 checksum with a
> "uniquifier" which identifies the transaction they were created in.
> The uniquifier is used by svn_fs_props_changed() to tell apart property
> representations which share content but were created in different revisions.
>
> To fix that problem, make FSFS write SHA1 checksums along with uniquifiers
> for file properties, just as it is already done for file content.
>
> A source code comment indicates that SHA1 checksums for property reps
> were not written due to concerns over disk space. In hindsight, this was
> a bad trade-off because it affected correctness of svn_fs_props_changed().

Thinking about this change, it could be that writing the additional 40-byte
SHA1 for the property representations is going to eliminate the benefit of
sharing them in the first place.

If I recall correctly, rep sharing for properties is mostly there to gain
from deduplicating small properties, such as svn:eol-style or svn:keywords.
But with the additional SHA1 written in every representation string, this
overhead is likely to take more space than the property itself.

An alternative approach that might be worth considering here would be:

 (1) Extend the on-disk format and allow representation strings without
 SHA1, but with the uniquifier, something like this (where "-" stands
 for "no SHA1"):

 15 0 563 7809 28ef320a82e7bd11eebdf3502d69e608 - 14-g/_5

 (The new format would be allowed starting from FSFS 8.)

 (2) Use the new format to allow rep sharing for properties that writes
 the uniquifier so that svn_fs_props_changed() would work correctly,
 and doesn't introduce the overhead of writing SHA1 in the representation
 string for every property.

 (3) Disable rep sharing for properties in FSFS formats < 8 that cannot
 read and write such representation strings without SHA1, but with an
 uniquifier.

Barring objections and alternative suggestions, I could give a shot at
implementing this.


Regards,
Evgeny Kotkov


Re: svn commit: r1816069 - /subversion/branches/1.8.x-issue4707/subversion/svnrdump/dump_editor.c

2017-11-22 Thread Evgeny Kotkov
Julian Foad <julianf...@apache.org> writes:

> -": %lu\n",
> -(unsigned long)info->size));
> +": %" APR_SIZE_T_FMT "\n",
> +info->size));

I think that using APR_SIZE_T_FMT would still lead to the same issue with
large file sizes in the 32-bit environment (where size_t is also 32 bit).

Perhaps, the code should be using APR_OFF_T_FMT as the format specifier?

(The apr_file_info_t.size value is an apr_off_t)


Regards,
Evgeny Kotkov


Re: svn commit: r1803143 - in /subversion/trunk/subversion: include/ libsvn_delta/ libsvn_ra_serf/ mod_dav_svn/

2017-11-22 Thread Evgeny Kotkov
Evgeny Kotkov <evgeny.kot...@visualsvn.com> writes:

> Daniel Shahaf <d...@daniel.shahaf.name> writes:
>
>>> +   *
>>> +   * Any temporary allocations may be performed in @a scratch_pool.
>>
>> Need to add an @since tag here.
>
> [...]
>
>>> +   */
>>> +  svn_error_t *(*apply_textdelta_stream)(
>>
>> Could you update the docstring of svn_delta_editor_t itself to mention this
>> new callback?  All other callbacks are discussed there.
>
> [...]
>
>>> +const svn_delta_editor_t *editor,
>>
>> This parameter is undocumented.
>
> Thank you for the review.  I agree with these three points, and will try
> to make the suggested tweaks to the documentation.

Committed in r1816063.


Daniel Shahaf <d...@daniel.shahaf.name> writes:

> +1, but note that the "shouldn't" language in current HEAD, which this
> patch [once rebased to HEAD] will remove, was copied from some other
> docstring.  I made no note of which one specifically, because I think
> that "shouldn't" language is our standard formula for docstrings of
> non-opaque struct types.

Committed in r1816066.


Regards,
Evgeny Kotkov


Re: svn commit: r1815799 - in /subversion/trunk/subversion: libsvn_ra_serf/commit.c tests/libsvn_ra/ra-test.c

2017-11-22 Thread Evgeny Kotkov
Evgeny Kotkov <evgeny.kot...@visualsvn.com> writes:

> (2) I am going to tweak the new test so that it would properly open the
> parent directory and commit to a locked file, to have this case
> covered with a native test.

Committed in r1816060.

Julian Foad <julianf...@apache.org> writes:

>> (1) Keep the new simpler check in maybe_set_lock_token_header() — as,
>>  unless I missing something, there should be no reason to explicitly
>>  filter empty relpaths for the lock tokens since they are invalid.
>
> I agree it is good to remove the '*relpath &&' condition if there is no
> reason why it needs to be there.
>
> Don't replace it with 'relpath &&' instead, however. If relpath is null then
> I think the next line (svn_hash_gets(..., relpath)) would immediately crash
> anyway, so allowing it here is useless and therefore confusing. Remove that
> condition entirely. That's my suggestion.

The condition is inverted in the sense that it checks for a null relpath
and returns if so — in other words, the svn_hash_gets() won't get called
if the relpath is null.

However, as it turns out, this condition can indeed be simplified by not
checking for null, as the only calling site where the relpath might be
null, setup_proppatch_headers(), already checks for it.  (Among the other
two calling sites, the relpath cannot be null as it would have segfaulted
earlier).

I committed this simplification in r1816061, thanks!


Regards,
Evgeny Kotkov


Re: Subversion 1.10 RC1?

2017-11-22 Thread Evgeny Kotkov
Julian Foad <julianf...@apache.org> writes:

> At the hackathon today we (me, Stefan Hett, Bert, Johan) have been talking
> about how to progress 1.10.
>
> We think all the features and changes are safe to release and are not going
> to get more testing until we produce a "release candidate". (For example, at
> that point Stefan will be able to justify taking time at his work to test
> the client in production scenarios.)

I did a couple of quick tests for these new features; some of my findings
that we might want to address before 1.10 GA and additional comments are
below:

>   * conflict resolution: we understand it is already much better than 1.9;
> low risk if parts of it don't work quite right; designed so that
> improvements can be made in patch releases.

I noticed that if the resolution of a tree conflict leads to a text conflict,
the temporary files (base, mine, theirs) are not cleaned up upon revert
and remain as unversioned files in the working copy.

C  +NewFile.txt
> moved from OldFile.txt
?   NewFile.txt.2.tmp
?   NewFile.txt.3.tmp
?   NewFile.txt.tmp

>   * shelving v1: is isolated -- doesn't affect anything else; is limited but
> already useful; will be changed in the next release so APIs are marked
> "SVN_EXPERIMENTAL"; changes shelved by this release could be detected and
> 'upgraded' by a future release; should the CLI commands be marked
> "experimental" in the help, too (Johan thinks yes)?

One comment that I have is that marking the new commands experimental
might not prevent the users from using them on a regular basis (e.g., for
those who don't read help or use a GUI client like TSVN that might not
have the "experimental" labels).

Which, in turn, could mean that if the future format changes, we would
have to convert the data stored in the patch files to avoid a data loss
for such users.

Also, out of curiosity, are there any current plans to support binary changes
for the shelves?

As far as I recall, there has been a mention of using diff --git for these
purposes, and I also saw a recent commit where you added a test for diff
--git, which might be related to this topic :)


The other two features that I remember, are:

* improved authz with support for wildcards

* server-side search with `svn ls --search`

Speaking of the `ls --search`, I think that there is an issue with the
command-line parsing.  Based on what I see, the --search argument may
be interpreted as the target for listing, but not as the search pattern:

svn ls --search *

svn: warning: apr_err=SVN_ERR_WC_PATH_NOT_FOUND
svn: warning: W155010: The node 'C:\Project\unversioned' was not found.
..\..\..\subversion\svn\list-cmd.c:453: (apr_err=SVN_ERR_ILLEGAL_TARGET)
svn: E29: Could not list all targets because some targets don't exist

svn ls http://spbvo-ws09.ostyserver.net:8080/svn/master2 --search *
svn: E155007: 'C:\AnotherProject' is not a working copy


Thanks,
Evgeny Kotkov


Re: svn commit: r1815799 - in /subversion/trunk/subversion: libsvn_ra_serf/commit.c tests/libsvn_ra/ra-test.c

2017-11-21 Thread Evgeny Kotkov
Bert Huijben <b...@qqmail.nl> writes:

> I'm not sure if we should really allow this.
>
> The delta editor explicitly describes that you are opening a directory and
> then edit the nodes inside. Only changing properties on the root is allowed
> and other operations are all on nodes within. Allowing to open the node
> itself again may cause all kinds of problems as there are now multiple
> handles pointing to the same thing. How will this be expressed in the
> filesystem/transaction?
>
> I'm surprised that all the other filesystems allow this, so perhaps this
> is a safe change... but the documentation in svn_delta.h doesn't describe
> this as a safe extension. (Which would theoretically allow this as a safe
> extension in later versions... but we must make sure that we are not
> opening new issues this way)
>
> Currently I would guess that making the ra layers provide a proper error
> for this case would not be a bad thing... All our drivers explicitly open
> an existing directory when they want to edit a file...

I think that I have missed that open_root() must open a directory and so
it cannot be used when working with a file URL.  This, in turn, means
that the test example is indeed an undocumented/invalid usage of the
delta editor API and that it works by a coincidence.

I can propose doing the following:

(1) Keep the new simpler check in maybe_set_lock_token_header() — as,
unless I missing something, there should be no reason to explicitly
filter empty relpaths for the lock tokens since they are invalid.

(2) I am going to tweak the new test so that it would properly open the
parent directory and commit to a locked file, to have this case
covered with a native test.


Thanks,
Evgeny Kotkov


Upcoming Subversion hackathon 2017

2017-09-21 Thread Evgeny Kotkov
Dear Members of the Subversion community,

The Apache Subversion project organizes a yearly 1-week-long hackathon
for its project members where we gather around at one location and push
forward the development of the project.  This gives us a time frame where
we can fully focus and concentrate on the development rather than only
invest our split spare time.  This event also serves as a chance to start
and progress more complex ideas/features which require more coordination
between multiple developers.

The event is planned to be held in Aachen, Germany in the late November.
The approximate dates for the main part of this event are either November
13-17th or 20-24th.  One of the Subversion PMC members, Stefan Hett, who
currently resides in Aachen, is going to take on most of the organizational
work for this hackathon.

We will post additional information on the event in the nearby future.
If you would be interested in participating in it, please don't hesitate to
contact us at priv...@subversion.apache.org

Sponsorship:

This is a great opportunity for companies involved in the Apache Subversion
project, or who benefit from its code - your employers - to get their name
in front of the community.  If you are interested in becoming a sponsor of
this event and supporting the further development of the project by such
means, please contact us at priv...@subversion.apache.org for details.


Thanks,
Evgeny Kotkov (VP Apache Subversion)


Re: FAQ entries #dumpload and (new) #fix-nonLF-log

2017-09-08 Thread Evgeny Kotkov
Johan Corveleyn <jcor...@gmail.com> writes:

> Also: this reminds me of those other whiteboard items from 2015:
> "svnadmin load --normalize-revprops", "- normalize svn:* props", "-
> normalize mergeinfo", to get 'svnadmin load' on par with svnsync
> regarding normalization. Having those options (at least the first two)
> would make those faq answers a lot shorter ... If anyone has some
> cycles to spare, 1.10 might be a good target to try and get these in
> ...

I committed the first implementation of the --normalize-props option for
svnadmin load in https://svn.apache.org/r1807836

This option currently translates non-LF line endings found in the svn:
property values.  I think that this should improve the situation for
users who attempt to load dump files with such property values.


Regards,
Evgeny Kotkov


Re: svn commit: r1803143 - in /subversion/trunk/subversion: include/ libsvn_delta/ libsvn_ra_serf/ mod_dav_svn/

2017-09-05 Thread Evgeny Kotkov
Daniel Shahaf <d...@daniel.shahaf.name> writes:

>> Also, I am not against the idea of tweaking the note by saying something
>> like "...because we may add new fields in the future", but I don't think
>> that it is required either.  (In other words, I am +0 to that.)
>
> Done in r1807028.

I think that although r1807028 provides the additional information to the
users, it simultaneously makes the API requirements less strict:

> - * @note Don't try to allocate one of these yourself.  Instead, always
> - * use svn_delta_default_editor() or some other constructor, to ensure
> - * that unused slots are filled in with no-op functions.
> + * @note Fields may be added to the end of this structure in future
> + * versions.  Therefore, users shouldn't allocate structures of this
> + * type, to preserve binary compatibility.
> + *
> + * @note It is recommended to use svn_delta_default_editor() or some other
> + * constructor, to ensure that unused slots are filled in with no-op 
> functions.

While it does clarify that new fields can be added to the struct, this
changeset also replaces words like "don't try" and "always" with "shouldn't"
and "is recommended" thus making the requirements a recommendation:

 - "Don't try to allocate one of these yourself" became "users shouldn't
   allocate structures"

 - "Instead, always use svn_delta_default_editor()" is now "It is
   recommended to use svn_delta_default_editor()"

Perhaps, a better way would be to keep the original wording, but mention
that the structure may be extended, as in this alternative patch:

[[[
--- subversion/include/svn_delta.h  (revision 1806549)
+++ subversion/include/svn_delta.h  (working copy)
@@ -694,8 +694,10 @@ svn_txdelta_skip_svndiff_window(apr_file_t *file,
  * as it produces the delta.
  *
  * @note Don't try to allocate one of these yourself.  Instead, always
- * use svn_delta_default_editor() or some other constructor, to ensure
- * that unused slots are filled in with no-op functions.
+ * use svn_delta_default_editor() or some other constructor, to avoid
+ * backwards compatibility problems if the structure is extended in
+ * future releases and to ensure that unused slots are filled in with
+ * no-op functions.
  *
  * Function Usage
  *
]]]


Regards,
Evgeny Kotkov


Re: svn commit: r1803143 - in /subversion/trunk/subversion: include/ libsvn_delta/ libsvn_ra_serf/ mod_dav_svn/

2017-09-01 Thread Evgeny Kotkov
Daniel Shahaf <d...@daniel.shahaf.name> writes:

> I'm not sure I'm getting through here.
>
> The note does say "Don't allocate one of these yourself" but in the
> second sentence implies that the reason for this is that if you allocate
> it yourself and don't initialize all pointer-to-function members,
> Trouble will ensue.  Therefore, the note could be construed as meaning
> that it is okay to allocate one of these yourself if you take care to
> initialize all pointer members.
>
> In other words: the comment could have been interpreted as a piece
> of advice --- "it is more robust to use _default_editor() than to allocate
> with malloc" --- as opposed to a blanket prohibition on allocating this
> struct type with malloc.
>
> (If one uses malloc and doesn't initialize all pointers, the result
> would be that an uninitialized pointer-to-function struct member is
> dereferenced.)
>
> In contrast, most of our other structs explicitly say that "Don't
> allocate yourself because we may add new fields in the future".  This
> struct does not give that reason.
>
> Makes sense?
>
> I suppose can just add an API erratum and/or release notes entry about this.

I think that the important thing about this documentation is that it
states that:

 (1) The API user shouldn't try to allocate the struct himself.

 (2) A constructor such as svn_delta_default_editor() should *always*
 be used.

To my mind, the current statement is clear and it is not possible to
interpret it as if it's allowed to malloc() the struct and initialize it
manually.

My opinion here is that neither the API erratum nor including this in the
release notes are required, and doing so would just unnecessarily restate
the information that's already available.

Also, I am not against the idea of tweaking the note by saying something
like "...because we may add new fields in the future", but I don't think
that it is required either.  (In other words, I am +0 to that.)


Regards,
Evgeny Kotkov


Re: svn commit: r1803143 - in /subversion/trunk/subversion: include/ libsvn_delta/ libsvn_ra_serf/ mod_dav_svn/

2017-08-31 Thread Evgeny Kotkov
Daniel Shahaf <d...@daniel.shahaf.name> writes:

>> +   *
>> +   * Any temporary allocations may be performed in @a scratch_pool.
>
> Need to add an @since tag here.

[...]

>> +   */
>> +  svn_error_t *(*apply_textdelta_stream)(
>
> Could you update the docstring of svn_delta_editor_t itself to mention this
> new callback?  All other callbacks are discussed there.

[...]

>> +const svn_delta_editor_t *editor,
>
> This parameter is undocumented.

Thank you for the review.  I agree with these three points, and will try
to make the suggested tweaks to the documentation.

>> +  /** Apply a text delta stream, yielding the new revision of a file.
>> +   *
>> +   * @a file_baton indicates the file we're creating or updating, and the
>> +   * ancestor file on which it is based; it is the baton set by some
>> +   * prior @c add_file or @c open_file callback.
>> +   *
>> +   * @a open_func is a function that opens a #svn_txdelta_stream_t object.
>> +   * @a open_baton is provided by the caller.
>> +   *
>> +   * @a base_checksum is the hex MD5 digest for the base text against
>> +   * which the delta is being applied; it is ignored if NULL, and may
>> +   * be ignored even if not NULL.  If it is not ignored, it must match
>
> What's the rationale for allowing drivees to ignore the checksum?
>
> This leeway enables failure modes that wouldn't be possible without it.
> (Drivers that are aware of this leeway will validate checksums even if the
> drivee doesn't, leading to duplicate work; drivers that are unaware of this
> requirement might not get checksum errors they should have.)
>
> I get that you just copied this part of the docstring from apply_textdelta(),
> but I'd like to understand what's the rationale here.  (And to see if this
> leeway should be deprecated)

My interpretation of the documentation is that "the result checksum must
be validated by the implementation, whereas validating the checksum of the
base text may be omitted".

Perhaps, there are cases where the base checksum won't be validated by
our existing implementations (what about BDB, for instance?), but in the
meanwhile, I'm not too sure about the gain from _always_ requiring such
checks.

I think that, from the editor driver's point of view, the important thing
is the result checksum.  If the implementation also validates the base
checksum, that may allow it to skip actually applying the delta in case
of the early mismatch, give away better error messages ("I have a base
checksum mismatch" instead of "I can't apply instruction 7 at offset
12345") and maybe even detect the coding mistakes which cause the
delta to be applied to an unexpected base, but still yielding the
expected resulting fulltext.

Having all these properties is nice, but probably not mandatory.  And I
think that lifting the optionality of this check could shoot us in the foot
in the future, if we find ourselves writing an implementation where it is
particularly tricky to always implement the base text checks — for instance,
due to the used protocol or any other technical reasons.

Hope that I am not missing something subtle here :)

>> +   * the checksum of the base text against which svndiff data is being
>> +   * applied; if it does not, @c apply_textdelta_stream call which detects
>> +   * the mismatch will return the error SVN_ERR_CHECKSUM_MISMATCH
>> +   * (if there is no base text, there may still be an error if
>> +   * @a base_checksum is neither NULL nor the hex MD5 checksum of the
>> +   * empty string).
>
> To the best of my knowledge, we don't special case the empty string's
> checksum, d41d8cd98f00b204e9800998ecf8427e, anywhere.  We do special-case
> , though.  I assume the parenthetical should
> be fixed accordingly (both here and in apply_textdelta())?

I agree that this part of the docstring (same in svn_fs_apply_textdelta())
looks odd, and I don't think that the digest of an empty string is specially
handled somewhere in the implementations.

Perhaps, it would make sense to check on whether this has been true at
some point in the past and also tweak the docstring.

> Is adding this function an ABI-compatible change?  The docstring of
> svn_delta_editor_t does say """
>
>  * @note Don't try to allocate one of these yourself.  Instead, always
>  * use svn_delta_default_editor() or some other constructor, to ensure
>  * that unused slots are filled in with no-op functions.
>
> """, but an API consumer might have interpreted this note as meaning "You may
> use malloc(..., sizeof(svn_delta_editor_t)) if you take care to initialize
> all struct members", in which case, his code will not be ABI compatible
> with 1.10.

I think that adding this callback does not affect the ABI compatibility.
The note says "Don't try to allocate one of these yourself", whereas the
malloc(..., sizeof(svn_delta_editor_t)) example does exactly the opposite.


Thanks,
Evgeny Kotkov


Re: svn commit: r1806017 - /subversion/trunk/subversion/libsvn_ra_serf/merge.c

2017-08-30 Thread Evgeny Kotkov
Bert Huijben <b...@qqmail.nl> writes:

>> ra_serf: Prevent the server from generating and sending the full MERGE
>> response in cases when we don't require it.
>>
>> The full response is not required when working over HTTPv2 protocol.
>> When working over HTTPv1, it's only required when the new working copy
>> properties need to be stored as part of a commit (indicated by a non-null
>> svn_ra_push_wc_prop_func_t callback).
>
> Nice catch!
>
> Does this affect performance enough that we should backport this fix?

Thanks!

I guess that it would be nice to backport this fix, as it prevents the
server from reading the list of the committed changes after the commit
and also reduces the size of the sent response.  I think that the full
MERGE response can be quite large for commits with thousands of
changed paths, although I don't have any real numbers at this time.

With that in mind, I have put nominating this change on my todo list,
unless someone else beats me to it.


Regards,
Evgeny Kotkov


Re: svn commit: r1806548 - in /subversion/trunk/subversion: svn/svn.c svnbench/svnbench.c tests/cmdline/basic_tests.py

2017-08-29 Thread Evgeny Kotkov
Bert Huijben <b...@qqmail.nl> writes:

>> As Julian discovered, '--search' as used with 'svn log' is may not suitable
>> for 'svn ls'.  File name matching should be case-sensitive and requires
>> full patterns just like e.g. the ordinary unix command line 'ls' command.
>>
>> Therefore, introduce a separate '--pattern' option for 'svn log' that works
>> similar to patterns with Unix command line 'ls'.  Since the actual matching
>> already confirms to that, we only need a different option pre-processing.
>
> Perhaps we could use --glob, to allow other syntax patterns later?
>
> Not sure... perhaps --glob is too technical.

My 2 cents on this would be that having "svn ls --pattern" and "svn log
--search" which only differ in terms of case sensitivity have a chance of
being confusing for the users.

Perhaps, a slightly better way would be to keep the case-insensitive behavior
by default, but add a "--case-sensitive" switch to handle the cases where
it's required?

That is,

svn ls --search

svn ls --search --case-sensitive


Regards,
Evgeny Kotkov


Re: [PATCH] Reduce the amount of WriteFile() syscalls when writing to buffered files

2017-08-23 Thread Evgeny Kotkov
Evgeny Kotkov <evgeny.kot...@visualsvn.com> writes:

> In the meanwhile, apparently, there is an oversight in the core V1 patch
> (3-reduce-syscalls-for-buffered-writes-v1.patch.txt):
>
> If the buffer is not empty, and the caller issues a write with a chunk
> that slightly exceeds the buffer size, for example, 4100 bytes, it would
> result in flushing the buffer _and_ writing the remaining couple of bytes
> with WriteFile().  An appropriate thing to do here would be to flush the
> buffer, but put the few remaining bytes into the buffer, instead of writing
> them to disk and thus making an unnecessary syscall.
>
> With all this in mind, I will prepare the V2 version of the core patch.

I have attached the V2 version of the core patch.

To solve the aforementioned oversight in the V1 patch, I implemented the
optimization in a slightly different manner, by keeping the existing loop
but also handling a condition where we can write the remaining data with
a single WriteFile() call.  Apart from this, the V2 patch also includes an
additional test, test_small_and_large_writes_buffered().

Speaking of the discussed idea of adjusting the strategy to avoid a memcpy()
with the write pattern like

WriteFile(13)
WriteFile(86267)
...

instead of

WriteFile(4096)
WriteFile(82185)
...

I preferred to keep the latter approach which keeps the minimum size of
the WriteFile() chunk equal to 4096, for the following reasons:

  - My benchmarks do not show that the alternative pattern that also avoids
a memcpy() results in a quantifiable performance improvement.

  - The existing code had a property that all WriteFile() calls, except
for maybe the last one, were performed in chunks with sizes that are
never less than 4096.  Although I can't reproduce this in my environment,
it could be that introducing a possibility of writing in smaller chunks
would result in the decreased performance with specific hardware, non-
file handles or in situations when the OS write caching is disabled.
Therefore, currently, I think that it would be better to keep this
existing property.

  - Probably, implementing the first approach would result in a bit more
complex code, as I think that it would require introducing an additional
if-else code path.

The log message is included in the beginning of the patch file.


Thanks,
Evgeny Kotkov
Win32: Improve apr_file_write() performance on buffered files by reducing
the amount of WriteFile() calls for large writes.

Previously, writing has been implemented with a loop that keeps copying the
data to the internal 4KB buffer and writing this buffer to disk by calling
WriteFile(4096).  This patch reduces the amount of syscalls for large writes
by performing them with a single syscall, if possible.  If the buffer is
not empty at the moment when the large write occurs, it is first filled
up to its 4KB capacity, flushed, and the remaining part of the data is
written with a single syscall.

* file_io/win32/readwrite.c
  (write_buffered): Within the write loop, check if we have a situation
   with an empty buffer and a large chunk pending to be written.  In this
   case, bypass the buffering and write the remaining chunk with a single
   syscall.  Return an appropriate number of written bytes to satisfy
   the apr_file_write() function contract.
  (apr_file_write): Adjust call to write_buffered().

* test/testfile.c
  (test_large_write_buffered,
   test_two_large_writes_buffered,
   test_small_and_large_writes_buffered,
   test_write_buffered_spanning_over_bufsize): New tests.
  (testfile): Run the new tests.

Patch by: Evgeny Kotkov 

Index: file_io/win32/readwrite.c
===
--- file_io/win32/readwrite.c   (revision 1805607)
+++ file_io/win32/readwrite.c   (working copy)
@@ -290,9 +290,8 @@ static apr_status_t write_helper(HANDLE filehand,
 }
 
 static apr_status_t write_buffered(apr_file_t *thefile, const char *buf,
-   apr_size_t len)
+   apr_size_t len, apr_size_t *pwritten)
 {
-apr_size_t blocksize;
 apr_status_t rv;
 
 if (thefile->direction == 0) {
@@ -306,20 +305,41 @@ static apr_status_t write_buffered(apr_file_t *the
 thefile->direction = 1;
 }
 
-rv = 0;
-while (rv == 0 && len > 0) {
-if (thefile->bufpos == thefile->bufsize)   /* write buffer is full */
+*pwritten = 0;
+
+while (len > 0) {
+if (thefile->bufpos == thefile->bufsize) { /* write buffer is full */
 rv = apr_file_flush(thefile);
+if (rv) {
+return rv;
+}
+}
+/* If our buffer is empty, and we cannot fit the remaining chunk
+ * into it, write the chunk with a single syscall and return.
+ */
+if (thefile->bufpos == 0 && len > thef

Re: svn commit: r1805897 - /subversion/trunk/subversion/libsvn_fs_fs/fs_fs.c

2017-08-23 Thread Evgeny Kotkov
Stefan Sperling <s...@apache.org> writes:

> \o/
>
> Can you please update the 1.10 release notes accordingly?

Will do, thanks for the reminder.


Regards,
Evgeny Kotkov


Re: [RFC] Using LZ4 compression by default

2017-08-23 Thread Evgeny Kotkov
Stefan Sperling <s...@elego.de> writes:

>> With all that in mind, I propose that we do (2).  Any objections?
>
> I want (2) as well. Thanks for doing this work, and for very clearly
> and transparently describing the tradeoffs and our options.

Johan Corveleyn <jcor...@gmail.com> writes:

>> With all that in mind, I propose that we do (2).  Any objections?
>
> I want (2) as well. Thanks for doing this work, and for very clearly
> and transparently describing the tradeoffs and our options.
>
> +1

Thank you for the comments.

I committed the core change that makes LZ4 the new default for
format 8 repositories in https://svn.apache.org/r1805897


Regards,
Evgeny Kotkov


  1   2   3   4   >