On Sat, 2010-10-09, Johan Corveleyn wrote: > On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad <julian.f...@wandisco.com> wrote: > > On Sat, 2010-10-09, Johan Corveleyn wrote: > >> Ok, third iteration of the patch in attachment. It passes make check. > >> > >> As discussed in [1], this version keeps 50 lines of the identical > >> suffix around, to give the algorithm a good chance to generate a diff > >> output of good quality (in all but the most extreme cases, this will > >> be the same as with the original svn_diff algorithm). > >> > >> That's about the only difference with the previous iteration. So for > >> now, I'm submitting this for review. Any feedback is very welcome :-). > > > > Hi Johan. > > Hi Julian, > > Thanks for taking a look. > > > I haven't reviewed it, but after seeing today's discussion I had just > > scrolled quickly through the previous version of this patch. I noticed > > that the two main functions - find_identical_suffix and > > find_identical_suffix - are both quite similar (but not quite similar > > enough to make them easily share implementation) and both quite long, > > and I noticed you wrote in an earlier email that you were finding it > > hard to make the code readable. I have a suggestion that may help. [...] > > So I wrote a patch - attached - that refactors this into an array of 4 > > sub-structures, and simplifies all the code that uses them. [...] > Yes, great idea! That would indeed vastly simplify a lot of the code. > So please go ahead and commit the refactoring.
OK, committed in r1021282. > Also, maybe the last_chunk number could be included in the file_info > struct? Now it's calculated in several places: "last_chunk0 = > offset_to_chunk(file_baton->size[idx0])", or I have to pass it on > every time as an extra argument. Seems like the sort of info that > could be part of file_info. It looks to me like you only need to calculate it in exactly one place: in increment_pointer_or_chunk(). I wouldn't worry about pre-calculating for efficiency, as it's a trivial calculation. > One more thing: you might have noticed that for find_identical_suffix > I use other buffers, chunks, curp's, endp's, ... than for the prefix. > For prefix scanning I can just use the stuff from the diff_baton, > because after prefix scanning has finished, everything is buffered and > pointing correctly for the normal algorithm to continue (i.e. > everything points at the first byte of the first non-identical line). > For suffix scanning I need to use other structures (newly alloc'd > buffer etc), so as to not mess with those pointers/buffers from the > diff_baton. > Maybe I can give it a go, > first suffix then prefix, and see if I can find real-life problems ... > So: I think I'll need the file_info struct to be available out of the > diff_baton_t struct as well, so I can use this in suffix scanning > also. Yes; I gave the struct type a name - "struct file_info" - so you can use it in other places. > (side-note: I considered first doing suffix scanning, then prefix > scanning, so I could reuse the buffers/pointers from diff_baton all > the time, and still have everything pointing correctly after > eliminating prefix/suffix. But that could give vastly different > results in some cases, for instance when original file is entirely > identical to both the prefix and the suffix of the modified file. So I > decided it's best to stick with first prefix, then suffix). And in a later email you wrote: > [...] Maybe I can give it a go, > first suffix then prefix, and see if I can find real-life problems ... There's no need to re-visit that. I think it's fine as is it, using a separate pair of buffers. > > Responding to some of the other points you mentioned in a much earlier > > mail: > > > >> 3) It's only implemented for 2 files. I'd like to generalize this for > >> an array of datasources, so it can also be used for diff3 and diff4. > >> > >> 4) As a small hack, I had to add a flag "datasource_opened" to > >> token.c#svn_diff__get_tokens, so I could have different behavior for > >> regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is > >> no longer needed. > > > > Yes, I'd like to see 3), and so hack 4) will go away. > > I'm wondering though how I should represent the datasources to pass > into datasources_open. An array combined with a length parameter? > Something like: > > static svn_error_t * > datasources_open(void *baton, apr_off_t *prefix_lines, > svn_diff_datasource_e[] datasources, int datasources_len) > > ? And then use for loops everywhere I now do things twice for the two > datasources? I haven't looked at how datasources_open() API is used or what form of API would be best. For the internal functions, you can now pass around an array of file_info pointers rather than indexes. > >> 5) I've struggled with making the code readable/understandable. It's > >> likely that there is still a lot of room for improvement. I also > >> probably need to document it some more. > > > > You need to write a full doc string for datasources_open(), at least. > > It needs especially to say how it relates to datasource_open() - why > > should the caller call this function rather than that function, and are > > they both necessary - and how the caller is supposed to use the > > 'prefix_lines' parameter. And doc strings for > > increment/decrement_...(). > > > > But this makes me think, it looks to me like this whole > > prefix-suffix-skipping functionality would fit better inside the > > lower-level diff algorithm rather than inside the "datasources_open" > > function. Do you agree? > > > > It works as it is, but you were talking about wanting it to obey the > > standard token-parsing rules such as "ignore white space", and so on. > > It seems that things like this would be much better one level down. > > Yes, I've been struggling with this. But I can't easily see it fit in > the lower levels right now. Problem is that everything in those lower > levels always acts on 1 datasource at a time (getting all the tokens, > inserting them into the token tree, ... then the same for the next > datasource). The datasource_open seemed to me to be the easiest place > to combine datasources to do things for both of them "concurrently" > (with least impact on the rest of the code). I expect you're right. You've studied this; I haven't. > Maybe those lower-level functions could also be made > "multi-datasource", but I have to think a little bit more about that. My gut feeling is that's probably not the way to go. I'll respond separately to your follow-up email about this. > >> 6) I've not paid too much attention to low level optimizations, so > >> here also there's probably room for improvement, which may be > >> significant for the critical loops. > > > > Maybe. Not important at this stage. > > > >> 7) There are two warnings left "conversion from 'apr_off_t' to 'int'", > >> in diff_file.c#find_identical_suffix. To completely eliminate this, I > >> should probably make all "chunks" of type apr_off_t instead of int > >> (but I have not done that yet, because the original implementation > >> also used int for the chunk in the svn_diff__file_baton_t struct). > >> Should I do this (also in the baton struct)? Or should I use an > >> explicit cast somewhere? > > > > I recommend using an explicit cast where needed, in this patch. > > Changing the 'chunk' data type everywhere could be done in a separate > > patch, either before or after this patch. > > Ok, will do. > > One more thing I was wondering about: maybe increment_... and > decrement_... should (eventually) be implemented as macro's? In fact, > they are mostly just replacements for curp++ and curp-- but taking > into account that we have chunks. Just a thought ... If profiling shows that the function-call overghead is too high, then I suggest that you make just the simple part of it a macro, like this: "if ((f)->curp < (f)->endp - 1) (f)->curp++; else function_call(...);". > Thanks again for your input. My pleasure. - Julian