Scott Young wrote:
That's common misconception. :)
The goal of compression is to conserve disk bandwidth rather than space.
By compressing it is possible to transfer data (== uncompressed data
user works with), at a rate higher than raw device bandwidth.
I will be doing some research on an algorithm that speeds up data
transfers over a network by adaptively selecting a compression
algorithm. It can be applied to filesystem reads and writes too. When
the send queue is reasonably full on the server, it starts compressing
data at the tail of the queue while sending the data at the head of the
queue. If the output stream catches up to segment currently being
compressed, then that segment is sent uncompressed. If the compressed
data is not significantly smaller, then the uncompressed data is sent
instead. For network applications that are not network interface bound
(like rsync over a 100mbit connection), the buffer will be empty most of
the time and therefore little compression would be needed or wanted as
it would only slow the application down. Compression is chosen from a
pool of algorithms and varied depending on the history of buffer
overflows and under-runs. Slower, better compression algorithms are
used when the buffer is mostly full and the compression is observably
effective. The idea here is to minimize the time between the client
requesting the data and having the usable data in a minimal amount of
time. This can be seen as a time-verses-amount-of-usable-data-on-client
graph, and some applications prefer a low latency for the initial stream
of data (such as a web page) whereas some prefer the time to retrieve a
very large piece of data (such as scp [EMAIL PROTECTED]/SomeBigDocument.sxw
/home/scott over a 56k modem).
Adapting this to filesystem concepts, the server can be seen as the
write process and the client can be seen as the read process.
I don't understand. Why not view the client as the disk drive and the
bus as the network?
The idea
can be applied to Reiser4 by compressing the overwrite set while the
journal data is being written, and then compressing the tail of the
relocate set moving backwards until the write stream catches up to the
compression. It could also take into account the estimated
decompression time when reading the data back, and use it for deciding
whether the compression ratio is good enough to write the compressed
data instead of the uncompressed data.
I didn't understand the above.
Another interesting twist would be to cache the compressed data if the
same data is going to be sent from the server several times. This
reduces CPU overhead on the server (and possibly it's memory
requirements for caching the data, and reduces the amount of data that
needs to be read from the drive), but it is complicated in the context
of a network algorithm and is mostly application-dependent. This is
research for another day, maybe in the form of a derived-data plugin for
ReiserFS where an application tells the filesystem how to construct the
file, and the filesystem can store the original, the result, or both,
depending on space needs and performance analysis, with copy-on-write
metadata flags when appropriate.
I didn't understand the above.
I haven't started coding the adaptive compression algorithm yet, but I
have a general idea about how I am going to implement it. For the
proof-of-concept, I want to write this using sockets and some basic
library compression algorithms (gzip, bzip2, and maybe a simple MTF +
Adaptive Huffman). Later variants may work with TCP or other protocols
around that layer. Any suggestions will be appreciated.
I think we need to use adaptive compression in Reiser4, based on the
type of file being compressed, and anyone who finds it interesting to
develop heuristics for selecting compression strategies is welcome to
help and join the fun.
Scott Young
--
Hans