On Friday 13 August 2010 08:54:11 Denys Vlasenko wrote:
> On Thu, Aug 12, 2010 at 8:26 AM, Rob Landley <[email protected]> wrote:
> > 3) you may want these patches:
> >
> >
> > http://landley.net/hg/toybox/rev/362
> > http://landley.net/hg/toybox/rev/375
> >
> > And optionally some debug crap:
> > http://landley.net/hg/toybox/rev/376
>
> I just took the latest source from hg and put it into patch_toybox.c,
> then in scratch build renamed it to patch.c.
Woot.
> I need a few changes/fixes there before I can put it into bbox:
>
> (1) It fails one test which current bbox's one does not fail.
> With this patch:
>
> --- input.doesnt_exist Jan 01 01:01:01 2000
> +++ input Jan 01 01:01:01 2000
> @@ -1,2 +1,3 @@
> qwe
> +asd
> zxc
>
> "patch -R" tries to patch file named "input.doesnt_exist"
> whereas it should patch file "input".
Actually there's no standard about which one it should patch when the two
filenames differ. (I researched this when implementing patch. This is why
everybody does -p1, so that the old and new paths under the directory you diff
can be identical and this issue gets glossed over.)
Of course gnu patch has heuristics to try to guess what you'd like to do. (It
doesn't change what you typed behind your back and pop up a talking paper
clip, but you can tell it _wants_ to...) And like fuzz factor, it can
occasionally do the wrong thing and bite you. But my research was long enough
ago that I remember the annoyance but not the details...
Speaking of research, digging into my blog, the "Right, screw it. Time to
implement patch in toybox." entry (when I broke gnu patch) was December 11,
2007. And then the rest of the month is full of comments on patch. I should
dig them up and post them here for posterity:
Implementing patch for toybox, I am learning _so_ much about the weird
corner cases of diff and patch. (This is a fairly common occurrence, I
learned tons about mount implementing it, learned sed by implementing it,
and so on. For one thing, unified diffs can't have hunks out of order. (If
they do, the gnu version of patch fails with a specific error message for
this. If you reorder and then renumber them, the first one applies with an
offset and the second fails.) So it's doing a simple one pass
substitution.
December 15, 2007:
Got the first half of patch checked into toybox. The patch output is
going to stdout, but it seems to be applying hunks properly now. With an
offset even, albeit no fuzz factor yet.
I need to think more about the offset logic. Right now it doesn't care
where the hunks say they are (in terms of starting line number), it just
tries to apply them all in order based on context lines. This is probably
good enough for most real world uses, although the downside is that a single
failed hunk means all the other hunks in the patch fail too. In order to
improve on it I'd probably want to read in all the hunks for a file and try
to apply them in parallel. Worry about it later...
December 17, 2007:
Banging on patch.c. Got it working... except for inserting new files,
which breaks a couple of rules. (Patching a file that doesn't exist,
zero context lines.) Deleting an entire file is another unimplemented corner
case, again zero context lines and deleting the resulting file. Also, right
now I'm using the +++ line as the filename, which works for everything
except comparing a file with /dev/null to delete it. Sigh.
December 19, 2007:
One more swing at patch (I keep getting distracted from it). Got add
and remove implemented, although not debugged yet. I need to make it
understand the git "move" syntax too, because renaming files is one of
those things people actually do, and "delete whole file, insert whole file"
is a _horrible_ way to represent that.
December 26, 2007:
Implemented the rest of patch add/remove in toybox. Comment from 60
seconds ago, "That may actually have worked". Firmware Linux seems to be
building with it. Fingers crossed... I still need to implement
"move" support, ala git. And I need to go re-read Brahm Cohen's blog
(the creator of bittorrent) to see what his new diff algorithm was all
about. But hey... Progress.
December 28, 2007:
And patch broke in the uClibc build, because if a hunk is right up against
the end of the file there aren't as many trailing context lines as starting
context lines. (There may not, in fact, be any.) So fewer ending than
starting context lines means match end of file. The start of file case is
sort of already handled by the current code; with zero context lines we
match immediatey (start of file unless it's not the first hunk, which is a
"don't go there" case). Trimming trailing context lines shouldn't hurt too
much because the start of the file is easy to find reliably.
December 29, 2007:
Yet another corner case in "patch". I'd assumed there would always be
at least one non-patch line between each patch, so what I was doing was
naievely reading each hunk until the first line that didn't start with a
plus minus or space char, and then trimming any extra context lines (more
than the starting context line).
This was _really_stupid_, and I honestly don't know what I was thinking.
It was small, simple, and wrong. Here's how that breaks when you cat
multiple patch files together:
--- filename
+++ filename
@@ -42,7 +42,7 @@
context
context
context
-line
+line
context
context
context
--- filename
+++ filename
...
The length of the patch is given in the @@ line at the top: the -xx,7 means
7 lines of old content, the +xx,7 means 7 lines of new content. Use that
and concatenating patches is no problem (and context line trimming isn't
necessary either, so it actually probably winds up smaller.)
My code's even already parsing @@ lines, although I just through it would
just be for error reporting purposes. The -42 and +42 are the starting line
at which to expect the start of the hunk (in the old and new file,
respectively), which is only really useful to tell you at what offset you did
find the hunk. I even thought about searching for multiple patch locations
and taking the closest one, but traditional patch just locks to the first
three matching lines of context and fails the hunk if the rest of the patch
doesn't match. (Unless of course you implement "fuzz factor" support, which
is often <a href=http://kernelslacker.livejournal.com/7336.html>a bad
idea</a>
There is a corner case where repeated context lines could make it miss a
patch location and fail to apply. If the first three context lines are "aab"
and the file has "aaab", then matching the first two and failing the third
would flush those three already-evaluated lines from the previous supposed
match to the output, and they wouldn't be re-scanned for another match later
down. This is an "I can fix that, but is it worth enlarging/complicating the
algorithm"?
I could make the thing _darn_ reliable (and fuzz factor implementation
trivial) by loading the whole file into memory and running each hunk against
the in-ram copy. It could apply overlapping or out-of-order hunks, use the
closest match in case of duplicate match sites... But it would also eat
much more memory at runtime, and I'm trying to do embedded software...
Maybe I could do something with mmap(MAP_PRIVATE)? Eh, worry about it
later, get this implementation working first. And _that_ means a rewrite of
the hunk reading code.
That's it through 2007, I should dig up the 2008, 2009, and 2010 entries and
post 'em too, but I'll worry about that later, this is already long enough...
> (2) printf("patching %s\n", name) should be changed into
> printf("patching file %s\n", name) to mimic standard patch.
If you'd like.
Most of my work on this predated SUSv4, which _finally_ mentions -u:
http://www.opengroup.org/onlinepubs/9699919799/utilities/patch.html
It might be nice to compare what the standard finally got around to saying with
what reality's been doing for the past 20 years, if only for the amusement
value. (for example, the new diff standard insists that the timestamp always
be output, when in reality it's optional.)
> (3) it does not support -N. bbox one does.
It's got a longish todo list of things that could be added.
> I fixed (1) and (2) and committed the result to git. So you can pull it
> and hack on it.
I'm not sure "fixed" is the right word for (1), I suspect you just grabbed one
of the many behaviors the gnu heuristic could squeeze out, which may or may
not have been an improvement.
If you want to follow the gnu-gnu-gnu-dammit project's heuristic, then try the
+++ name first and if it's not there try the --- name. Except for reversed
patches, where you try them in the opposite order. And of course when you're
creating files said file not existing is actually your _success_ condition.
(Deletion works like modification, you find the file, then you operate on it.
Creation you have to guess right first try.)
I believe the logic is that the "+++" name is the one you're trying to turn
the thing _into_, so it should be the first choice of name you have when you
finish. But when you reverse the patch, you go the other way. Except there's
a corner case where you're reversing a patch that deleted a file, and the goal
isn't to figure out what the patch _says_ to do but to reverse what the
creation operation would have done.
Most of the time both of the listed names aren't there, so you just pick
whichever one exists. But this opens the can of worms of getting consistent
behavior when they _are_ both there, which turns out to be nontrivial.
I just punted and went with something simple and consistent, I think it was
"I'm always using the +++ name for normal patching and --- for reverse, and if
that's not good enough use -p1 already".
Hey, it turns out that Posix 2008 has an opinion on this:
http://www.opengroup.org/onlinepubs/9699919799/utilities/patch.html#tag_20_90_13_02
And their filename determination section doesn't even mention the "reverse"
option. (Standards committees that don't quite understand the utility they're
documenting, where have I heard that before...)
I wouldn't feel bound by their heuristic (for one thing, we're not prompting
for filename, that never did anybody any good... and SCCS? In 2008? Really,
POSIX committee? You're _serious_?)
And of _course_ they don't document the git file delete/rename semantics. That
would be too obvious. It looks like this, by the way:
http://www.spinics.net/lists/linux-doc/msg00095.html
And is documented here:
http://www.kernel.org/pub/software/scm/git/docs/git-diff.html
See "Generating patches with -p".
> I didn't fully "busyboxify" it, need to restore -N first (and add a test
> for it, we don't have one).
As I said, there's a todo list. (In a comment, I think.)
It needs a config entry for the "extra" behavior, and a determiniation about
what the minimal behavior _is_. (Pretty much the stuff I implemented was what
I actually needed to patch packages in the wild. And that was with -p1, since
that's what people who know what they're doing post to lists, and what git and
hg produce...)
> > I'm also working on a patch to implement -l support (squash whitespace)
> > if you're interested. And at some point I should do fuzz factor support,
> > and better error reporting. And add a CONFIG symbol to chop out most of
> > the command line options...
>
> I will try to not get in the way (in patch.c) for now then.
Oh go ahead and have fun, I'll migrate my stuff when you get to a good stopping
point. :)
Thanks for doing this, by the way. I'm proud of the work I did in toybox, but
it proved what I set out to, and that effort would be better spent cleaning up
busybox...
Rob
--
GPLv3: as worthy a successor as The Phantom Menace, as timely as Duke Nukem
Forever, and as welcome as New Coke.
_______________________________________________
busybox mailing list
[email protected]
http://lists.busybox.net/mailman/listinfo/busybox