Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-16 Thread Frans Pop
On Wednesday 03 June 2009, Frans Pop wrote:
> As you may remember we had a problem before the release of Lenny with
> the l10n-sync script running wild and creating an insanely large Danish
> PO file for sublevel 4.
> This was eventually corrected, but the commits increasing the size of
> that da.po master file to eventually 250MB (and the same again spread
> out the da.po files for several individual packages) are still there.
>
> These commits waste space on alioth and will also continue to cause
> problems, for example when people create a git-svn checkout [1].

I have done extensive testing and checks and am convinced there are no 
remaining issues with the cleanup method.

Unless there are strong objections I intend to perform the cleanup soon. 
I'll of course announce the date in advance; the repository will be 
unavailable for some time for commits, but I expect that will be less 
than 4 hours.

The result of the final cleanup method will be:
- SVN database will shrink, but only a small part is a result of the
  cleanup itself; mostly it is because the dump/load gets rid of cruft
  from old SVN versions;
- the cleanup will remove only broken l10n-sync commits and one
  incomplete early cleanup commit; no changes by users are lost or
  changed;
- the cause of the l10n-sync failure (broken PO file headers) is not
  removed, only the consequences (file corruption and extreme growth);
- these consequences are removed completely: after the cleanup the
  affected da.po files are all "clean", except for the broken headers;
- tagged versions from uploads of affected packages remain identical
  to what was uploaded to the archive because (as part of the cleanup)
  the corruption at the time of the upload is made part of the tag.

The main advantages of the cleanup are:
- a cleaner and more useful revision history for the affected files and
  packages;
- reduced risk of issues during future uses of the repository, such as
  git-svn checkouts, revision analysis, repository backup, possible
  repository conversion.

Cheers,
FJP


signature.asc
Description: This is a digitally signed message part.


Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-05 Thread Frans Pop
On Friday 05 June 2009, Bastian Blank wrote:
> On Wed, Jun 03, 2009 at 10:42:39PM +0200, Frans Pop wrote:
> > As a result of the cleanup the 'svnadmin dump' file shrinks by more
> > than 2GB (!) and the repository database shrinks from 2.4GB to 1.7GB.
>
> A direct dump and load gives the following:
> | wa...@alioth:~$ du -s /svn/d-i/db
> | 2448288 /svn/d-i/db
> | wa...@alioth:~$ du -s debian/d-i/test/db
> | 1724144 debian/d-i/test/db

Cleaned version (with tagged releases now identical to existing tags!):
$ du -s repo/db
1716912 repo/db
$ cat repo/db/current
58721

So not a major difference. Fairly logical as the errors are repeating and 
thus compress well.


-- 
To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-05 Thread Bastian Blank
On Wed, Jun 03, 2009 at 10:42:39PM +0200, Frans Pop wrote:
> As a result of the cleanup the 'svnadmin dump' file shrinks by more than 
> 2GB (!) and the repository database shrinks from 2.4GB to 1.7GB.

A direct dump and load gives the following:

| wa...@alioth:~$ du -s /svn/d-i/db 
| 2448288 /svn/d-i/db
| wa...@alioth:~$ du -s debian/d-i/test/db 
| 1724144 debian/d-i/test/db
| wa...@alioth:~$ cat /svn/d-i/db/current
| 58721
| wa...@alioth:~$ cat debian/d-i/test/db/current
| 58721

Bastian

-- 
It is a human characteristic to love little animals, especially if
they're attractive in some way.
-- McCoy, "The Trouble with Tribbles", stardate 4525.6


signature.asc
Description: Digital signature


Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-05 Thread Frans Pop
On Thursday 04 June 2009, Bastian Blank wrote:
> > A tag is a copy, but the files are not actually copied. So if I
> > change the file in trunk in a revision before the tag, the tagged
> > version of the file will automatically change as well.
>
> You can change the file along with the copy operation.

Right, I see what you mean now. I've extended my awk script to do that:
- when a da.po file for a package is removed, it's also saved to a tmp
  dir, so I always have the latest version for each package available
- when I encounter a revision that creates a tag for the package, I read
  the last saved file back in (modifying the path to match the tag dir)

The cleaned dump file it creates looks good; next is testing if it loads 
and checking resulting revisions.

New version of script attached (for posterity and the curious :-)

> Another problem: does anyone use dumps with deltas? The hotbackup
> script bundled with subversion does this.

No idea, though the subversion-tools package description says it's for bdb 
based repos. If someone does I hope they speak up.

Cheers,
FJP

Output of the awk script

Tag r55973 found for cdebconf
   282 lines restored
Tag r56032 found for partman-efi
   96 lines restored
Tag r56062 found for nobootloader
   227 lines restored
Tag r56074 found for partman-target
   264 lines restored
Tag r56090 found for flash-kernel
   92 lines restored
Tag r56092 found for silo-installer
   224 lines restored
Tag r56094 found for partman-ext2r0
   267 lines restored
Tag r56119 found for partman-palo
   80 lines restored
Tag r56157 found for sibyl-installer
   101 lines restored
Tag r56160 found for arcboot-installer
   168 lines restored
Tag r56399 found for quik-installer
   454 lines restored
Tag r56402 found for prep-installer
   133 lines restored
Tag r56404 found for yaboot-installer
   355 lines restored
Tag r56406 found for partman-prep
   81 lines restored
Tag r56408 found for partman-newworld
   106 lines restored
Tag r56411 found for cdebconf
   282 lines restored
Tag r56825 found for cdebconf
   282 lines restored

BEGIN {
start = 1
clean = 0
infile = 0
save_trans = 0
restore_trans = 0
}

# Set limits of cleaning operation
/^Revision-number: 55934/ {
clean = 1
}
/^Revision-number: 57134/ {
clean = 0
}

# New revision; close previous one
/^Revision-number:/ {
rev = substr($0, 18)
infile = 0
if (save_trans == 1) {
close(pfile)
save_trans=0
}

# Restore last version of corrupted file for taged version
if  (restore_trans == 1) {
cnt = 0
# Skip first (blank) line
getline line "/dev/stderr"
close(pfile)
restore_trans = 0
}
}
# New file in current revision
/^Node-path:/ {
infile = 0
npath = $0
if (save_trans == 1) {
close(pfile)
save_trans = 0
}
}

# These are the files we want
/^Node-path: 
trunk.*\/(po\/sublevel4|cdebconf|nobootloader|flash-kernel|partman-(prep|newworld|target|ext2r0|efi|palo)|(silo|prep|quik|yaboot|sibyl|arcboot|vmelilo)-installer)\/.*da\.po/
 {
# Save a copy of the last version we encounter
if ($0 !~ /\/sublevel4\//) {
s = match($0, "[^/]+/debian")
package = substr($0, s, RLENGTH - 7)
pfile = "tmp/" package ".sv"
save_trans = 1
}
infile = 1
}

# We're tagging a cleaned package => restore the da.po file to the
# uploaded version (last saved "cleaned" instance from trunk)
/^Node-copyfrom-path: 
trunk.*\/(cdebconf|nobootloader|flash-kernel|partman-(prep|newworld|target|ext2r0|efi|palo)|(silo|prep|quik|yaboot|sibyl|arcboot|vmelilo)-installer)$/
 {
if (clean == 1) {
s = match($0, "[^/]+$")
package = substr($0, s)
pfile = "tmp/" package ".sv"
print "Tag r" rev " found for " package >"/dev/stderr"
restore_trans = 1
}
}

# The prevline construction is needed because if we restore a translation
# that needs to be done before the extra newline that starts a new revision
/.*/ {
if (clean == 0 || infile == 0) {
if (start != 1) {
print prevline
}
} else if (save_trans == 1) {
print prevline >pfile
}
start = 0
prevline = $0
}

END {
print prevline
}


Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-04 Thread Christian Perrier
Quoting Frans Pop (elen...@planet.nl):
> As you may remember we had a problem before the release of Lenny with the 
> l10n-sync script running wild and creating an insanely large Danish PO 
> file for sublevel 4.


I can't comment deeply on your proposal, but I'd like to thank you for
taking care to repair that damage as much as possible, while it
occurred mostly because I was not attentive enough to commit logs.

I'm highly confident that you'll take all care needed to avoid
damaging the SVN so, the only thing I can really do, is wishing you
good luck and courage for that task that obviously need to be done
with grreat care. Again, thanks.




signature.asc
Description: Digital signature


Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-04 Thread Bastian Blank
On Thu, Jun 04, 2009 at 11:28:06AM +0200, Frans Pop wrote:
> On Thursday 04 June 2009, Bastian Blank wrote:
> > On Wed, Jun 03, 2009 at 10:42:39PM +0200, Frans Pop wrote:
> > > The way my cleanup works is that I remove all changes to the affected
> > > files made between revisions 55934 and 57133 (both inclusive).
> > > As a result of the cleanup the 'svnadmin dump' file shrinks by more
> > > than 2GB (!) and the repository database shrinks from 2.4GB to 1.7GB.
> >
> > Which sizes did you compare? The d-i repo still includes plenty of
> Current database versus reloaded cleaned database.
> > vdelta revisions from repository format <= 3. A dump/load cycle should
> > reduce the size anyway.
> Ah, that is possible. The other advantages remain though.

An easy estimate is the file size of the affected revisions in db/revs.

> > Working copies with references to this revisions gets invalidated.
> Hmm, yes that could be. Did not consider that.
> But what risk is there that there _are_ (m)any working copies that 
> reference those revisions? The last commit I change was 08-01-2009, so 
> most users should have 'svn updated' by now.

Really low, and the workaround is to remove the "broken" directories.

> > > Because of the way tagging in subversion works, it is not possible to
> > > do the cleanup and still keep the tagged versions exactly as they
> > > were uploaded (see below for affected package versions).
> > Please explain. A tag is just a copy, which can also include
> > modifications.
> A tag is a copy, but the files are not actually copied. So if I change the 
> file in trunk in a revision before the tag, the tagged version of the 
> file will automatically change as well.

You can change the file along with the copy operation.

> > > If we are agreed, I will pick a day to do the actual cleanup. During
> > > part of that day the repository will be blocked for commits.
> > There is not need to block anything. You can only change intermediate
> > revisions, so the top is not affected.
> I don't see how I could manipulate intermediate revs without rebuilding 
> the database from the bottom up. What exact procedure are you referring 
> to?

I thought again and realized that the internal ids will not permit this.

Another problem: does anyone use dumps with deltas? The hotbackup script
bundled with subversion does this.

Bastian

-- 
Where there's no emotion, there's no motive for violence.
-- Spock, "Dagger of the Mind", stardate 2715.1


signature.asc
Description: Digital signature


Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-04 Thread Frans Pop
On Thursday 04 June 2009, Frans Pop wrote:
> On Thursday 04 June 2009, Bastian Blank wrote:
> > Working copies with references to this revisions gets invalidated.
>
> OK. I'll test that and if it is a problem we'll have to warn about it.
> I don't think it's a huge problem if such users would have to do a new
> checkout.

I've tested that now and it is indeed an issue.

I installed the cleaned repo on a local server and then did a checkout of 
revision 56250 (in middle of cleanup) from the official repo. I then 
relocated the checkout to the local cleaned up repo and ran an svn up.

Result was that I got checksum mismatch errors for the affected da.po 
files.

But there's also a simple workaround. Just delete the parent directory of 
the "damaged" files, and svn will refetch that whole directory and 
continue happily.
So users only have to delete selected packages//debian/po dirs to 
repair the damage.


-- 
To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-04 Thread Frans Pop
On Thursday 04 June 2009, peter green wrote:
> > 3) The relevant versions are now no longer available anywhere [2]:
> > they are no longer in the archive and we don't have a snapshot.d.n
> > for that period.
>
> I don't think this statement is correct. snapshot.debian.net seems to
> have all dates up to and including 2009/03/28, that date is after the
> release of lenny rc1 afaict.

Last time I checked, and that was quite some time ago, I thought it had 
stopped updating completely. But it looks like you're correct and the 
affected versions are available.

After I sent the mail I decided that it would be a good idea to export the 
currently tagged versions and keep them separately on alioth somewhere, 
so that would cover that.

I still don't think it's a major issue, but thanks for the correction.

/me wonders when we'll be getting the long-promised snapshot.d.o...


-- 
To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-04 Thread peter green



3) The relevant versions are now no longer available anywhere [2]: they
   are no longer in the archive and we don't have a snapshot.d.n for that
   period.
I don't think this statement is correct. snapshot.debian.net seems to 
have all dates up to and including 2009/03/28, that date is after the 
release of lenny rc1 afaict.


Note: the search function on snapshot.debian.net stopped updating long 
before the actual archiving stopped, and also 2009 doesn't appear in the 
index of archives (but the early 2009 stuff is accessible through 
manually typing urls) don't display properly.



--
To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-04 Thread Frans Pop
Thanks a lot for the reply, Bastian.

On Thursday 04 June 2009, Bastian Blank wrote:
> On Wed, Jun 03, 2009 at 10:42:39PM +0200, Frans Pop wrote:
> > The way my cleanup works is that I remove all changes to the affected
> > files made between revisions 55934 and 57133 (both inclusive).
> > As a result of the cleanup the 'svnadmin dump' file shrinks by more
> > than 2GB (!) and the repository database shrinks from 2.4GB to 1.7GB.
>
> Which sizes did you compare? The d-i repo still includes plenty of

Current database versus reloaded cleaned database.

> vdelta revisions from repository format <= 3. A dump/load cycle should
> reduce the size anyway.

Ah, that is possible. The other advantages remain though.

> Working copies with references to this revisions gets invalidated.

Hmm, yes that could be. Did not consider that.
But what risk is there that there _are_ (m)any working copies that 
reference those revisions? The last commit I change was 08-01-2009, so 
most users should have 'svn updated' by now.

Hmm. I guess some translators who worked on their translations in that 
period and haven't been active since could have such a checkout. 

OK. I'll test that and if it is a problem we'll have to warn about it.
I don't think it's a huge problem if such users would have to do a new 
checkout.

> > Because of the way tagging in subversion works, it is not possible to
> > do the cleanup and still keep the tagged versions exactly as they
> > were uploaded (see below for affected package versions).
>
> Please explain. A tag is just a copy, which can also include
> modifications.

A tag is a copy, but the files are not actually copied. So if I change the 
file in trunk in a revision before the tag, the tagged version of the 
file will automatically change as well.

> > Essentially: not.
>
> This is incorrect. The effects are outlined in the Subversion FAQ and
> references materials[1].

There does not seem anything there other than what we've already covered. 
We don't lose any revisions and all revisions + the state of HEAD remain 
completely identical to the current database.

> > If we are agreed, I will pick a day to do the actual cleanup. During
> > part of that day the repository will be blocked for commits.
>
> There is not need to block anything. You can only change intermediate
> revisions, so the top is not affected.

I don't see how I could manipulate intermediate revs without rebuilding 
the database from the bottom up. What exact procedure are you referring 
to?

Blocking the repo for a few hours shouldn't be a major inconvenience 
anyway. It's not like we have a high commit rate ATM.

> > BEGIN {
> > clean = 0
> > infile = 0
> > }
>
> [...]
>
> I think you want svndumpfilter.

I read about that, but I don't think it does what we need here: it only 
filters paths, not specific commits [1]. Anyway, my awk script is already 
there and I've tested that it does exactly what I want it to do.
My cleaned dump file loads without any problems and I've done fairly 
extensive checks with svnlook that the database is as it should be after 
the load.

Despite the warnings, the dumpfile format is relatively straightforward 
(and I did not use --incremental for my dump on purpose).

Thanks again,
FJP

[1] Hmm. Guess it could maybe be used, but I'd need to create a dumpfile 
for exactly the range to be cleaned and it would need to be run 
separately for each file to be excluded.


-- 
To UNSUBSCRIBE, email to debian-boot-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org



Re: [RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-04 Thread Bastian Blank
On Wed, Jun 03, 2009 at 10:42:39PM +0200, Frans Pop wrote:
> The way my cleanup works is that I remove all changes to the affected 
> files made between revisions 55934 and 57133 (both inclusive).
> As a result of the cleanup the 'svnadmin dump' file shrinks by more than 
> 2GB (!) and the repository database shrinks from 2.4GB to 1.7GB.

Which sizes did you compare? The d-i repo still includes plenty of
vdelta revisions from repository format <= 3. A dump/load cycle should
reduce the size anyway.

> As a result of the cleanup, some revisions (24 in total) become empty as 
> no other files were changed in that commit, but subversion handles this 
> without problems: a diff against the previous revision just shows empty. 
> I'll modify the revision comment to explain this. I'll also modify the 
> comments for revisions that caused the problem and the (now very small) 
> cleanup commits to explain the issue.

Working copies with references to this revisions gets invalidated.

> Because of the way tagging in subversion works, it is not possible to do 
> the cleanup and still keep the tagged versions exactly as they were 
> uploaded (see below for affected package versions).

Please explain. A tag is just a copy, which can also include
modifications.

> Essentially: not.

This is incorrect. The effects are outlined in the Subversion FAQ and
references materials[1].

> If we are agreed, I will pick a day to do the actual cleanup. During part 
> of that day the repository will be blocked for commits.

There is not need to block anything. You can only change intermediate
revisions, so the top is not affected.

> BEGIN {
>   clean = 0
>   infile = 0
> }
[...]

I think you want svndumpfilter.

Bastian

[1]: http://subversion.tigris.org/faq.html#removal
-- 
Is truth not truth for all?
-- Natira, "For the World is Hollow and I have Touched
   the Sky", stardate 5476.4.


signature.asc
Description: Digital signature


[RFC] IMPORTANT: Cleaning l10n-sync damage from D-I SVN repository

2009-06-03 Thread Frans Pop
As you may remember we had a problem before the release of Lenny with the 
l10n-sync script running wild and creating an insanely large Danish PO 
file for sublevel 4.
This was eventually corrected, but the commits increasing the size of that 
da.po master file to eventually 250MB (and the same again spread out the 
da.po files for several individual packages) are still there.

These commits waste space on alioth and will also continue to cause 
problems, for example when people create a git-svn checkout [1].

Today I've looked at options to clean up the worst of the mess and I think 
I've found something that will work, but has one important consequence 
that needs to be discussed.

At the bottom of the mail a list of affected files and packages.

THE CLEANUP
===
The way my cleanup works is that I remove all changes to the affected 
files made between revisions 55934 and 57133 (both inclusive).
As a result of the cleanup the 'svnadmin dump' file shrinks by more than 
2GB (!) and the repository database shrinks from 2.4GB to 1.7GB.

The cleanup starts _after_ the problems started, so the affected da.po 
files between the start of the problem (revision 55901) and the end of 
the cleanup are still not technically correct. However, they now remain 
only a little bit broken for the whole period instead of increasingly 
majorly broken.

As a result of the cleanup, some revisions (24 in total) become empty as 
no other files were changed in that commit, but subversion handles this 
without problems: a diff against the previous revision just shows empty. 
I'll modify the revision comment to explain this. I'll also modify the 
comments for revisions that caused the problem and the (now very small) 
cleanup commits to explain the issue.

The cleanup procedure is described below.

THE PROBLEM
===
The issue occurred right around the release of D-I Lenny RC1. The Lenny 
branch was created in the middle of the period and all the affected 
packages were uploaded: first because of changes or an l10n upload series 
and later after the errors in the Danish translation were corrected in 
the Lenny branch.

Because of the way tagging in subversion works, it is not possible to do 
the cleanup and still keep the tagged versions exactly as they were 
uploaded (see below for affected package versions).
However, IMO the "damage" is acceptable, for the following reasons:
1) My cleanup stops _before_ the correction of the Danish translations
   in the Lenny branch by Christian. This means that the tags for the
   versions uploaded as a result of that, and also all versions released
   with Lenny, are 100% identical to what was uploaded.
2) For affected releases before that, tThe only file that is "incorrect"
   is the da.po file, the tagged version is still 100% correct for all
   other files in the packages.
3) The relevant versions are now no longer available anywhere [2]: they
   are no longer in the archive and we don't have a snapshot.d.n for that
   period.

HOW DOES IT AFFECT USERS

Essentially: not.

During the cleanup the repository will be locked for commits. Users would 
be advised not to try to do an svn up: it should do no harm except 
possibly for the short time I'll be moving the cleaned repo in place.

There is one minor effect for git-svn users who have the affected period 
in their history: their local git repository will no longer match the the 
SVN repository. But in practice that can do absolutely no harm.

WHAT NOW?
=
The main question is if people agree with me that this cleanup is a good 
thing and that the problem described is not serious enough to block it.
So: comments welcome!

If we are agreed, I will pick a day to do the actual cleanup. During part 
of that day the repository will be blocked for commits.

Cheers,
FJP

[1] Phil Hands' git-svn checkout got buggered as a result of this.
[2] Not completely true: D-I Lenny RC1 images are still on the mirrors,
but they will also disappear [3].
[3] BTW, looks like there are a number of old D-I releases in unstable
that could be cleaned up. FTP masters will appreciate it.


Affected files/packages
---
po/sublevel4/da.po

cdebconf/debian/po/da.po
nobootloader/debian/po/da.po
flash-kernel/debian/po/da.po

partman/partman-prep/debian/po/da.po
partman/partman-newworld/debian/po/da.po
partman/partman-target/debian/po/da.po
partman/partman-palo/debian/po/da.po
partman/partman-ext2r0/debian/po/da.po
partman/partman-efi/debian/po/da.po

arch/sparc/silo-installer/debian/po/da.po
arch/powerpc/prep-installer/debian/po/da.po
arch/powerpc/quik-installer/debian/po/da.po
arch/powerpc/yaboot-installer/debian/po/da.po
arch/mips/sibyl-installer/debian/po/da.po
arch/mips/arcboot-installer/debian/po/da.po
arch/m68k/vmelilo-installer/debian/po/da.po

Package versions that will have tags not 100% equal to upload
-
r55973 cdebconf 0.136
r55975 partm