Hi,

Here are my initial thoughts about the hashed storage work. I'm still reading through it and I anticipate having more to say as I spend more time on it. I'm looking at both the hashed-storage package and the changes to darcs to make use of it.

I'll start by summarising what it's about, both as an introduction for others and so any misconceptions or omissions can be corrected.

I haven't yet had time to read through all the past mailing list traffic on the subject of hashed-storage etc, so please point me to any relevant posts if it seems appropriate. I will try to read through them asap.

The basic point of this work is to optimise and clean up the way that darcs interacts with the filesystem. The core of this is the new 'Tree' type, which replaces the old 'Slurpy' type. It differs most importantly by using explicitly embedded IO actions for each file or subdirectory, as opposed to the lazy IO of Slurpy.

The work is split into a new 'hashed-storage' package and some changes to darcs to make use of this package. In general the changes to darcs are relatively small and are often simplifications.

As its name suggests, hashed-storage takes over the work of dealing with hash-structured storage as used by the darcs 1.5/2 formats. It also introduces the index, which is a cache of the hashes (and related info) of an entire tree, stored in a single file. This allows for much faster operations in many cases. hashed-storage also deals with "plain" trees such as the working copy and old-style pristine.

There is also a 'TreeIO' monad, intended to provide an abstraction for working with 'Tree's and also some optimisations such as buffering up writes until a size threshold is reached.

And now my initial comments:

Overall I think something like this is clearly the right way to go and generally I think it looks nice, with easy to read code. Although I haven't been through the changes to darcs in great detail yet they generally look like providing significant improvements in simplicity.

My most important question is about the dividing line between hashed-storage and darcs-hs. The biggest weakness of hashed-storage as a separate package is that it has functions that are specific to Darcs repositories. To my mind, it either needs to abandon this knowledge, perhaps by abstracting it over a typeclass that is meaningful in itself, or some or all of hashed-storage needs to move into the darcs tree proper. Otherwise, changes to some parts of darcs will continue to require lockstep changes to hashed-storage - which is ok when Petr is developing the two together intensively on his own, but not really when other people are involved or over a longer timeframe. One option would be to move it all into the darcs tree but distribute it as a separate cabal package with a more darcs-specific name.

Onto more specific comments, which are mainly in note form:

- The naming of Darcs.Gorsvet is obscure and a barrier to future understanding of the code

- In Darcs.Gorsvet, is there any reason not to use bracket in mInCurrentDirectory (as a comment suggests)?

- The floatPath function from Storage.Hashed is documented as being unsafe and then used all over Darcs.Gorsvet. Needs an explanation of what's unsafe and why it's ok to use it in Darcs.Gorsvet.

- In various places (e.g. deleting on windows, corrupt pending etc) a file gets renamed to a fixed different name, which could cause problems if it already exists. There should be some standard scheme to avoid collisions.

- We need to sort out haskell_policy, the ratification of readFile all over the place is just silly. Replacing it with hlint as has been suggested sounds like a very good idea.

- I think treeDiff is actually relatively unlikely to be wrong because it delegate s actual line by line diffing to Patch.Prim which hasn't changed. Storage.Hashed.Diff exists but seems unused, what's the status and plans there?

- unrecordedChanges - Why do IgnoreTimes and LookForAdds lead to ignoring the index?

- Presumably biggest risk is random bugs with invalid indexes. An explanation of when the index can become invalid would be helpful (there may be one in the code that I haven't spotted yet).

- Storage.Hashed.Index: code is full of magic numbers

- Petr and I have already discussed overloading Tree and TreeIO over the container (IO). This would make writing test rigs easier, and mean that code that processes Tree could be polymorphic where appropriate. Overloaded code would typically constrain the container to be in Monad or perhaps Applicative.

- Win32 removeFile workaround: not used consistently, should be hidden behind some standard abstraction?

- I don't really understand what virtualTreeIO does and doesn't do.

- I think TreeIO should be abstract, i.e. a newtype, unless there is some value in exposing the fact that it is a StateT.

- hashedTreeIO never deletes things, so what removes dead stuff?

- why don't the two cases of hashedTreeIO.updateFile use the same code path?

- I'm very dubious about semantics of finish - why doesn't expandPath call it? Presumably the documented usage semantics of readIndex have something to do with this, but it feels fragile, and I only sort of understand the whole dirs_changed/finish stuff in readIndex'

That's all for now...

Cheers,

Ganesh
_______________________________________________
darcs-users mailing list
[email protected]
http://lists.osuosl.org/mailman/listinfo/darcs-users

Reply via email to