Hi all, I've had a bit of experience with Hackage 2 and acid-state now, and I'm not convinced that it's the best fit for us:
* It's slow. It takes about 5 minutes for me to stop and then start the server. It's actually surprising just how slow it is, so it might be possible/easy to get this down to seconds, but it still won't be instantaneous. * Memory usage is high. It's currently in the 700M-1G range, and to get it that low I had to stop the parsed .cabal files from being held in memory (which presumably has an impact on performance, although I don't know how significant that is), and disable the reverse dependencies feature. It will grow at least linearly with the number of package/versions in Hackage. * Only a single process can use the database at once. For example, if the admins want a tool that will make it easier for them to approve user requests, then that tool needs to be integrated into the Hackage server (or talk to it over HTTP), rather than being standalone. * The database is relatively opaque. While in principle tools could be written for browsing, modifying or querying it, currently none exist (as far as I know). * The above 2 points mean that, for example, there was no easy way for me to find out how many packages use each top-level module hierarchy (Data, Control, etc). This would have been a simple SQL query if the data had been in a traditional database, but as it was I had to write a Haskell program to process all the package .tar.gz's and parse the .cabal files manually. * acid-state forces us to use a server-process model, rather than having processes for individual requests run by apache. I don't know if we would have made this choice anyway, so this may or may not be an issue. But the current model does mean that adding a feature or fixing a bug means restarting the process, rather than just installing the new program in-place. Someone pointed out that one disadvantage of traditional databases is that they discourage you from writing as if everything was Haskell datastructures in memory. For example, if you have things of type data Foo = Foo { str :: String, bool :: Bool, ints :: [Int] } stored in a database then you could write either: foo <- getFoo 23 print $ bool foo or b <- getFooBool 23 print b The former is what you would more naturally write, but would require constructing the whole Foo from the database (including reading an arbitrary number of Ints). The latter is thus more efficient with the database backend, but emphasises that you aren't working with regular Haskell datastructures. This is even more notable with the Cabal types (like PackageDescription) as the types and various utility functions already exist - although it's currently somewhat moot as the current acid-state backend doesn't keep the Cabal datastructures in memory anyway. The other issue raised is performance. I'd want to see (full-size) benchmarks before commenting on that. Has anyone else got any thoughts? On a related note, I think it would be a little nicer to store blobs as e.g. 54/54fb24083b14b5916df11f1ffcd03b26/foo-1.0.tar.gz rather than 54/54fb24083b14b5916df11f1ffcd03b26 I don't think that this breaks anything, so it should be noncontentious. Thanks Ian _______________________________________________ cabal-devel mailing list cabal-devel@haskell.org http://www.haskell.org/mailman/listinfo/cabal-devel