On Thu, Oct 23, 2008 at 10:29:37PM +0100, Ganesh Sittampalam wrote: > Hi, > > I've been thinking in detail about how to optimise darcs annotate, which > I hope to do at the sprint this weekend. I'd appreciate comments on my > current plan, which is not yet fully fleshed out. > > - The basic idea is to keep track of what patches might touch what files, > as has been suggested in past discussions about this topic. > > - In each repo, maintain a local numbering of all the patches in the > repo. This will facilitate referring to the patches efficiently. If a > patch is deleted or commuted, there is no need to remove it from the > numbering, so this might end up as a superset of the patches currently > present.
Why not just use the hashes that we already store (i.e. in a hashed repository)? I don't see what an additional layer of indirection will gain us except a headache trying to keep everything synchronized. > - Maintain equivalence classes of (a) filenames and (b) directory > names. I still haven't quite worked out the details of this, but > the basic idea is that two names are related if there's ever been a > move operation from one to the other in the repo history (again, > supersets are allowed), and then we take the > reflexive-symmetric-transitive closure to make the equivalence > relation. We need to track directories and files separately because > a directory move can implicitly cause file moves, and commutation > means that we can't be sure what the complete set of affected files > is at the point of the move. A far simpler approach would be to only optimize for currently-existing files and directories, which is the common use case. Then we have a very simple structure that's easy to update and use, and we can fall back to the existing code for removed files or directories. Alternately, you could maintain a database using a repository-specific unique ID for each file, which would just be its original filename and the ID of the patch that created it, and then we'd just need a mapping from current filenames to file IDs and the patch database could be keyed on the unique ID, so that for any file you'd be able to look up precisely which patches affected that file. > - Maintain a mapping from the equivalence classes to sets of patches > (referred to by the numbering). A patch is in the relevant set if it > touches any member of the equivalence class. I'll repeat what I mentioned above: it's faster and better to refer to patches by hash than by number. It takes a bit more space, but that shouldn't be a significant downside, and the upside is that you have easy O(1) lookup of patch name and contents, and potentially a human-readable database (assuming the humans don't mind grubbing around in _darcs/patches/). > - Implement some kind of restricted repository/patch reader (perhaps by > adding parameters to the existing code) that is constrained to a set of > patches and a set of names. This should run much quicker than a full > repository read, but be equivalent from the point of view of the annotate > code itself if given an appropriate restriction based on the file the > user wanted to be annotated. The equivalence classes should ensure that > we don't care if the user specifies a patch as well as a file. Right. Okay, I think I'm beginning to understand now. These equivalence classes make your database more awkward to implement and slower to use, but allow us to reuse more of the crufty old annotate code... and to have a single code path, which is also nice. > - I'm planning on using Data.Binary for persisting the structures > described above. Probably with a version number at the beginning for any > backwards compatibility issues. I'd definitely prefer not to do this. It stinks of premature optimization. Why not just use the same sorts of data structures we're already using? All you really need to do to improve annotate is to change its scaling, and we've already got all the data structures in place that you need, so far as I can tell. I would far prefer a text format akin to the hashed repository format. It's easy to atomically update, and it's easy to work with. And it would enable features like running darcs annotate on a remote repository--which could be extremely stupid, but could also be convenient--or lazily grabbing this database when doing a darcs get. This latter possibility would be very nice, as it would mean you could run annotate on a huge repository without downloading all the patches. David _______________________________________________ darcs-users mailing list [email protected] http://lists.osuosl.org/mailman/listinfo/darcs-users
