On Sat, Feb 13, 2010 at 7:07 PM, xor <xor at gmx.li> wrote: > On Sunday 14 February 2010 01:00:16 xor wrote: > >> >> I wonder why you do not want the interleaved scheme for all multi-segment >> files? Why the arbitrary choice of 80 MiB files? >> >> It would suck if then people started to artificially bloat 50MiB files up >> to 80MiB to improve their success rates... > > Oh I guess the answer was in your original message: >> For files of 20 segments (80 MiB) or more, we move to the >> double-layered interleaved scheme. I'm working on the interleaving >> code still (it isn't optimal for all numbers of data blocks yet). The >> simple segmenting scheme is better for smaller files, and the >> interleaved scheme for large ones. At 18 segments, the segmentation >> does better. By 20 segments, the interleaved code is slightly better. >> By 25 segments, the difference is approaching a 1.5x reduction in >> failure rates. (Details depend on block success rate. I'll post them >> on the bug report shortly.)
Yeah, that's the answer. At 50M, simple segments do better than interleaving. The 50M simple segment file is better than either one at 80M. > > ... Another question: Will you implement code to dynamically decide based on > filesize how much amount of interleaving is needed? So that we do not have > to modify anything even if people start inserting 1 TiB files? > > - It doesn't seem wise to have any assumptions on maximal file size as it > changes over the years. There are a variety of options here. This scheme has several things to recommend it. The decoder and encoder are very simple; the hard part is the interleaver. Depending on what we decide for the metadata format, it's entirely possible to structure it so we can change the interleaving but still work with old decoders. That would require storing the segment layout, rather than simply "compute segment layout for n blocks using scheme number x" and counting on the decoder having an implementation of scheme x available. That's not actually that big a penalty, though. Worst case if we don't do anything clever about packing it efficiently it adds about 12B of metadata per data block; with careful compression I think it's ~ 4B/data block (we spend 138B per data block just storing the CHK URIs, though we could reduce that to 64). So my current recommendation is that I'll produce an interleaver scheme that will be better than simple segments for all files > ~80M, and will slowly degrade with very large files (I'm not sure where that boundary is, but even as the files get very large it will outperform simple segments by a huge margin). Then we store the full interleaving pattern as metadata. That makes the upgrade path very smooth when we later decide that huge files (1 TiB? Bigger?) are an issue. Evan Daniel
