[freenet-dev] Splitfiles

Matthew Toseland Sat, 13 Feb 2010 20:24:17 +0000

On Saturday 13 February 2010 19:09:46 Evan Daniel wrote:
> This is somewhat long and mathy.  If it's not explained well, please
> ask!  Executive summary: by using better encoding schemes than our
> present splitfile segmentation, we can dramatically improve splitfile
> success rates in a variety of situations, while keeping the same basic
> FEC code and 2x redundancy rate.
> 
> 
> 
> Introduction
> 
> Our current splitfile scheme has a number of problems when it comes to
> data retention.  On some files (mostly large ones) the success rates
> for the whole file will be low, despite a reasonably large fraction of
> the blocks being available.  There are a variety of ways we can
> improve this (without increasing the number of check blocks); the most
> promising is a two-layered interleaved scheme similar to that used by
> CDs (CIRC).  I've written a monte-carlo simulator to give expected
> success rates for a file based on file size, encoding scheme, and
> block-level success rate.
> 
> This entire email is concerned with problems that happen at the bottom
> (data) layer of the splitfile only.  Problems that stem from top
> blocks and upper layers and such are not addressed.  (In other words:
> everything here assumes that all of the non-data layers have already
> been successfully downloaded.)
> 
> 
> 
> Splitfiles as they are now
> 
> Freenet splitfiles have 2x redundancy.  A number of check blocks are
> created for each file, equal in number to the data blocks.  If we had
> a theoretically perfect coding scheme, to download a file with n data
> blocks, any combination of n data and check blocks would suffice.  For
> large files, that presents practical problems.  The solution is
> segmented splitfiles.  We group up to 128 data blocks together, and
> then use a Reed-Solomon code to generate an equal number of check
> blocks to make up a segment of up to 256 blocks total.  (Since we use
> 32 KiB blocks, each segment is up to 4 MiB of data, or 8 MiB total.)
> The result is that each segment is independent: if one segment fails
> to download, there is no way to recover from that.
> 
> This is the first cause of problems.  A file with many segments, that
> has had a moderate number of blocks drop off the network, can have a
> low probability of success despite having plenty of blocks available.
> As the number of segments gets large, it becomes likely that somewhere
> the failed blocks will clump into one segment enough to make it fail.
> For example, if each individual block succeeds with probability
> p=0.58, then a single segment will succeed with p=0.99588, which is
> fairly good.  But, by four segments, the file success rate has dropped
> to 0.984, and by 8 to 0.967.  3.3% failure isn't that high, but we've
> only reached a 32 MiB file.  For a 700 MiB file, it's down to only a
> 48% chance of success.  Even at block success rate of 0.60 (segment
> success rate of 0.99951), the file success rate is only 92%.  As the
> file size gets larger, we require a steadily higher block success rate
> to achieve high reliability.
> 
> The second cause of problems is that we don't split the segments up
> evenly.  In the worst case, a file with 129 data blocks becomes two
> segments, one of 128 data + 128 check blocks, and one with 1 data and
> 1 check blocks.
> 
> I've made some diagrams of this.  For clarity, the diagrams are drawn
> with a maximum of only 16 blocks per segment.  Here's how a splitfile
> with 34 blocks would look with such segments:
> http://www.miranda.org/~evand/freenet/single_layer_1.svg
> 
> 
> 
> Simple improvements
> 
> The second problem is obviously much easier to solve than the first.
> We simply split the blocks evenly amongst the segments.  Here's the
> analogous diagram:
> http://www.miranda.org/~evand/freenet/single_layer_2.svg
> 
> This significantly improves retrievability, in the case of files that
> have only slightly more than a multiple of 128 data blocks, because
> larger segments are more reliable than smaller ones at the same
> success rate.  In fact, this effect is strong enough (when combined
> with segments being independent) that in some cases it is advantageous
> to use slightly lower redundancy in order to reduce the segment count.
>  For example, at p=0.6, for a file with 134 data blocks, the
> two-segment splitfile (67 + 67 blocks in each segment) will fail 1.5%
> of the time.  But, if it is instead encoded as a single larger segment
> with reduced total redundancy (that is, 134 data and 122 check blocks)
> the failure rate falls to only 0.5% -- it's three times less likely to
> fail, despite using up 5% less storage space.
> 
> See https://bugs.freenetproject.org/view.php?id=2931 for some more
> discussion and lots of numbers.
> 
> 
> 
> Multi-layer interleaving
> 
> For large files, there are more sophisticated techniques than our
> simple segments, that don't require larger encoding groups (and
> associated increase in memory usage and disk seeks).  Roughly
> speaking, the odds are very, very good that most segments can fetch at
> least one more block than is strictly required (ie 129 of 256 instead
> of only 128 of 256).  At the same time, the odds are also high that
> *some* segment won't be able to fetch the required 128 blocks.  So, we
> do a two-layer scheme: outer segments that require 129 blocks to
> decode, and then those groups of 129 blocks are regrouped into a
> second layer of segments that have 128 data blocks and 1 check block.
> That means that if one outer segment gets very unlucky, but the others
> can download 129 blocks, then all the missing blocks from the one
> outer segment are dispersed, each showing up as the only missing block
> in some inner segment.
> 
> Again, a diagram to illustrate.  With 16-block segments, and 40 data
> blocks (5 full segments of 8+8), our current single-layer scheme looks
> like this:
> http://www.miranda.org/~evand/freenet/single_layer.svg
> The two layer scheme keeps the same total number of check blocks, but
> instead computes some of them from the inner segments:
> http://www.miranda.org/~evand/freenet/double_layer.svg
> (In this diagram, outer segments require 9 of 16 blocks to decode, and
> inner segments require 8 of 9.  Again, these diagrams are merely
> intended to illustrate.  The file shown is actually small enough that
> the single-layer scheme performs better.  It's only as the files get
> large that the double-layer scheme becomes important.)
> 
> Returning to our hypothetical 700 MiB file (22400 blocks in 175
> segments by the current scheme), with a block success rate of 0.6 we
> have a file success rate of only 92%: it fails about 1 in 12 times.
> If instead we encode it as 175 outer segments (with 129 of 256 blocks
> required), and a second interleaved layer of 175 inner segments (with
> 128 of 129 blocks required), the success rate was 1000 out of 1000
> monte-carlo runs: a perfect record.  Even at the lower p=0.58 (where
> over half of simple segmented files fail) the double-layer success
> rate is still 1000/1000.  At p=0.56, where the simple segmented file
> only succeeds 1.64% of the time, the double-layered structure manages
> a success rate of 978/1000.
> 
> There is some more discussion of this at
> https://bugs.freenetproject.org/view.php?id=3370 .  I'll be adding an
> update there with more numbers soon.
> 
> 
> 
> Proposal
> 
> We should make several changes.  The problems and solutions are
> different for very small, medium, and large files.
> 
> First, for very tiny files, achieving reasonable success rates simply
> requires more check blocks (on single-segment files, our coding is
> already perfect).  I somewhat arbitrarily propose that all splitfiles
> with less than 8 data blocks be given 1 additional check block.  So, a
> file with 3 data blocks is given 4 check blocks, for 7 total blocks.
> With 7 data blocks, it gets 8 check blocks.  At 8 or more we use our
> usual 2x redundancy.
> 
> For files of moderate size, we keep our current segmenting system, but
> with modifications.  First, the segments should always be the same
> size (some having 1 more data and 1 more check block when the data
> blocks don't divide evenly).  Additionally, splitfiles of up to 136
> data blocks should use only one segment.  Splitfiles up to 266 blocks
> should be two segments, up to 393 should be 3 segments, up to 520
> blocks should be 4 segments. Beyond that, the differences between
> different split points become rather small; simply splitting to limit
> the maximum segment size to 128 data blocks is fine.
> 
> For files of 20 segments (80 MiB) or more, we move to the
> double-layered interleaved scheme.  I'm working on the interleaving
> code still (it isn't optimal for all numbers of data blocks yet).  The
> simple segmenting scheme is better for smaller files, and the
> interleaved scheme for large ones.  At 18 segments, the segmentation
> does better.  By 20 segments, the interleaved code is slightly better.
>  By 25 segments, the difference is approaching a 1.5x reduction in
> failure rates.  (Details depend on block success rate.  I'll post them
> on the bug report shortly.)


IMHO this is incredibly awesome and we should implement it as soon as possible. 
I will need evanbd to provide a robust algorithm for the creation of segments, 
but it should be feasible with relatively minor changes to splitfile 
encoding/decoding. I don't think we should put this off until after 0.8 because 
first it will only take a few days, and second it will make a huge difference 
to larger files.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 835 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20100213/39bd3945/attachment.pgp>

[freenet-dev] Splitfiles

Reply via email to