On Mon, Aug 31, 2009 at 2:59 PM, Max Battcher<[email protected]> wrote: > > It sounds a lot like Petr's thoughts on chunky hunks to me... The idea that > once a chunk gets above a certain threshold, the chunk's data can easily be > dumped into hashed-storage as a separate unit and the patch itself then > merely contains a hash reference to that chunk data. > > In this particular case it simply becomes a matter of chunking that data > into hashed-storage as soon as possible during record to avoid huge memory > usage (keeping only the hash references in memory). (The only real > complication just being that there are good garbage collection policies in > place should the record be canceled.)
Having thought more about the problem and the merits of some of the potential solutions I have an update. I reread the documentation for Iteratees. It seems that what they excel at precisely processing lots of data with one of several goals: a) extracting a few values (efficient random access is even supported), b) reducing the data bit by bit to some smaller value (say, a checksum), c) doing a 'map' style transformation on the data stream. Unfortunately for anyone wanting to apply Iteratees like a hammer to a nail, this isn't going to gain us much with record as is. The reason is because of what darcs wants to do with patches once it reads them. Iteratees could definitely be used to make efficient patch parsers that don't hold excessive amounts of the patch in memory as long as we're doing (a), (b), or (c). The problem is that darcs wants to hold the whole sequence in memory. I guess this is where Petr's chunky hunks[1] shine. The patch summary is typically small and we could hold many in memory, but the details can be rather large and holding them in memory is sometimes equivalent to holding the whole directory tree of the repository in memory. The chunky hunks are a win because they decouple the summary from the details. Of course, by "patch summary", I mean enough info about the patch to commute it, at least in most cases, with other patches. Naturally, I started to wonder about ways of keeping our existing patch format but internally separating the patch summary from the details. Continuing to use record for inspiration, can we come up with other ways of decoupling the two? For example, could we keep the existing patch format for on disk but in memory use the chunky hunk representation? For example, during record we read in the pending, say using Iteratees, and instead of holding the full patches in memory, we instead hold in memory just a chunky hunk representation of the patches. Actually, I think the current patch parsing may be just fine for this; I just need to check how much memory it would take to read all the patches and only hold on to the chunky hunk level of details. A change like this requires at least the following: 1) Rewiring commute to operate on chunky hunks 2) Making the chunky hunk representations 3) Rewiring the UI to select changes on chunky hunks and display patch contents given a chunky hunk Other things that may be required: 4) Changing the patch parsing do work in a bounded memory stream fashion 5) Storing the pending chunky hunks on disk while the user is interacting with darcs 6) When showing hunks to the user, only reading as many bytes at a time as we're able to dump to the output Am I missing anything big? #5 requires clean up, and it could be viewed as an alternative solution to the whole problem I'm approaching, I think. The biggest drawback that I can think of for just going with a pure #5 based approach is that as near as I can tell, commute would still take O(m + n) memory given patches of size m and n. Whereas, I think with this layered approach we can reduce the memory required to commute to the size of the chunky hunks. Replace patches being a different case, I think, but in that case I think we could stream the hunk data and then never hold in memory more than some reasonable amount. In other words, I could probably fix the performance of record to be O(m) where, m is the size of a patch, with just #5, but then as soon as I do that I would turn to improving things further and be right back here. Eventually, we'll probably even want to not store the whole chunky hunk sequence in memory. I suspect that change will be less intrusive and something which is reasonable to postpone as a 'phase 2'. How does this proposal sound to people? If I were starting to work on this tomorrow to address the record case, I wouldn't have to worry about #1 for a while. The focus would be on #2, #3, and checking if #4 and #6 are needed. I guess I would work in a private repository until I had something testable. Although, I would be happy to publish it for those who want to follow along. Also on my list of things I want to work on: * Finishing type witnesses in the commands * Implementing RIO witnesses * Creating a written specification of the file system operations darcs does during various commands I feel like for the first two on that list I'm waiting for Petr's GSoC work to be merged. I'm not eager to finish the witness stuff just to have to redo a large portion of it or conflict with Petr's already done work. Would this record work be orthogonal to Petr's GSoC work, or am I going to need to wait on that too? (At some point, I'll be waiting on too much stuff and I'll get annoyed by that and just start working on one or more of the items.) Feedback appreciated! Thanks, Jason [1] http://web.mornfall.net/blog/patch_formats.html _______________________________________________ darcs-users mailing list [email protected] http://lists.osuosl.org/mailman/listinfo/darcs-users
