infinity0 has recently completed nextgens' work to make the native C FEC codec work on 64-bit, with some help from nextgens and sdiz. FEC has always been dramatically faster (historically 50X!) in native code than in java code, so we need to deploy this soon: it will now work on 32-bit and 64-bit x86, which covers just about everyone; sdiz believes the code is endian-safe in which case it should be possible to compile it even on ppc etc. In any case, sdiz has built the binaries for x86-64 on Windows, we also need binaries (32-bit and if it supports it 64-bit) for OS/X.
On a core 2 duo, infinity0's benchmarks show around 700MB/sec encode or decode (single threaded), for 8-bit codecs (e.g. our current 128/128). 16-bit codecs *should* be around 4 times slower, but this has not yet been tested. Larger segments should increase reliability (vivee: how much?). Assuming that 16-bit codecs achieve around 175MB/sec, this is very tempting... The parameters: - Segment size. The number of data blocks in a segment, which is equal to the number of check blocks in a segment. Right now this is 128, the maximum achievable on an 8-bit code with 100% redundancy. Some other applications use higher redundancy while still having a 128 segment size; it is not immediately clear whether this uses an 8-bit code, when I have tried such things I have had segfaults or other problems. - Packet size. The size of the slice from each block which we read when decoding or encoding a segment. Currently 4K, although with such small segments we should seriously consider reading whole blocks to speed things up, it's only 8MB ... - Acceptable memory usage. I am not certain what the memory overhead is within the native codec, beyond the passed-in buffers; the passed-in buffers can be allocated from a direct byte buffer, meaning there is no duplication within the JVM and the library, but the library itself may allocate additional structures - infinity0? nextgens? The problems/constraints: - Memory usage will be segment size * 2 * stripe size, plus any structures allocated by the library, during decode / encode. Right now this is tiny. - During decode / encode, we will have to read all the stripes. Assuming memory is insufficient, this could be a very large number of seeks - around segment size * 2 * 32K / stripe size. This will likely dominate over the CPU usage involved! We can avoid seeks on decode by changing the on-disk format, but then we have more seeks not only when finding a block (which might be acceptable), but on decompressing/streaming to the client (which probably isn't). - If we assume 20ms per seek, this makes (segment size * 2 * (32K / stripe size) * 20 / 1000) seconds for a decode... assuming there is no other disk I/O, which of course there will be; on the other hand, there is a good chance it won't be a full 20ms seek, the blocks should be close together if we're lucky. - In db4o, we select a sub-segment or a segment at a time for doing requests. Hence if segments are huge, memory requirements increase, and some database jobs take longer. Decode rate is (segment size in bytes) / (decode time in seconds) = (segment size * 32K) / (segment size * 2 * (32K / stripe size) * 0.02) = 1 / (2 * 0.02 / stripe size) = stripe size / 0.04 = 25 * stripe size So if we want a decode rate of at least 100KB/sec (disk-limited), that means stripe size must be no less than 4KB; if we want 200KB/sec, stripe size must be no less than 8KB, etc. Assuming we go for 4K stripe size, segment size is determined by acceptable memory usage: segment size = memory usage / (2 * stripe size). So 1024 blocks (32MB segments) for 8MB, 2048 blocks (64MB segments) for 16MB, 4096 blocks (128MB segments) for 32MB. 1024 seems reasonable to me: 32MB segments instead of 4MB segments, any lossy-compressed audio track will be a single segment, an ISO will be 22 segments (instead of 350!), segment decodes will take around 5 minutes, and the effects on db4o are not too bad. If there is plenty of memory, hopefully the operating system will have cached the blocks we have recently downloaded, and decodes will be much faster than the above. However, IMHO it is important that Freenet work adequately on low-end systems: old recomissioned systems now acting as dedicated servers (geeky but we have lots of geeks), quasi-embedded stuff such as fanless bittorrent boxes, and so on. Alternatively, we *could* cache all blocks from the segment we are currently downloading, in full, in JVM-controlled memory. This would maximise decode rates, but would use a lot of memory, and limit the possible segment sizes: 1024 blocks is 32MB of data blocks plus 32MB of check blocks, so we would have to increase memory requirements by another 64MB... Thoughts?
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Devl mailing list [email protected] http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl
