On Wed, Jan 22, 2014 at 12:55:39PM -0800, Roger Binns wrote:
> There was the theoretical side - ie coming up with a way of defining
> perfection which then allows measuring against.  For example you have
> going up to a 128K block size but without knowing the theoretical best we
> don't know if that is a stopgap or very good.

I got some inspiration from
http://web.archive.org/web/20060511214233/http://www.namesys.com/compression.txt
there's a metric proposed and some analysis of cluster sizes.

Referenced from
http://web.archive.org/web/20070403001821/http://www.namesys.com/cryptcompress_design.html
with the related filesysem details

'Theoretical best' seems too vaguely defined, with compression it's
always some trade-off and compromise (comp/decomp speed vs ratio) and
counting the filesystem requirements like a reasonable read and write
latencies/throughput it's not getting easier.

The main categories I'm targeting are 'real-time, good on average for
random read-write', and 'slow, high compression ratio, expected
streaming reads/writes'.

I'm trying to evaluate the effects of the changes with respect to the
filesystem, eg. when the larger chunk means the ratio improves from 80%
to 70% for some data set type, what's the reduced block count and if
it's worth.

If the interface is flexible enough, it gives the user chance to
experiment with the options and make the choice.

> That also feeds into things like if it would be a good idea to go back
> afterwards (perhaps as part of defrag) and spend more effort on
> (re)compression.

This is a bit different usecase, defrag is triggered by user at the time
he knows the resources are available and usually on known files and
their potential compressibility. More, the userspace defrag process can
do a preliminary analysis of the files, eg. guess the mime type or do a
random sample of the file data and guess compressibility.

> Another consideration is perhaps having the compression dictionary kept
> separate from the compressed blocks thereby allowing it to be used across
> blocks and potentially files.  Compressors like smaz (very good on short
> pieces of text) work by having a precomputed dictionary - perhaps those
> can be used too.

Keeping the dictionary implies more data to be read/written, with small
chunks there's a low chance of actual dictionary reuse for other files.
Also, thinking about the implementaion, it would become too complex to
do in kernel for this particular usecase.

It's interesting theoretically, there's a paper about that

"To Zip or Not to Zip: Effective Resource Usage for Real-Time
Compression"
https://www.usenix.org/conference/fast13/technical-sessions/presentation/harnik

but haven't found it feasible to push the implementation to kernel and
worth the required disk format changes.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to