There is a lot of overhead in the serialized data itself (just have a look at a 
sstable file).

It would be great to be able to compress at the byte array level rather than 
string.

Regards,
Terje

On 1 Feb 2011, at 03:15, "David G. Boney" <dbon...@semanticartifacts.com> wrote:

> In Cassandra, strings are stored as UTF-8. In arithmetic coding compression, 
> the modeling is separate from the coding. A standard arrangement is to have a 
> 0-order model, frequencies of individual bytes, 1-order model, frequencies of 
> two byte occurrences, and 2-order models, frequencies of three byte 
> occurrences.  For multibyte unicode encodings, which almost all fall in the 
> two or three byte encodings, the higher order models should capture the 
> statistics. Having said than, there is nothing to prevent someone from 
> developing a UTF-8 specific model for capturing the statistics better.
> 
> An adaptive arithmetic coding model starts with the probabilities of all the 
> symbols being equal. This results in all the symbols using 8-bits per 
> encoding. The model adapts with each symbol it sees but it takes seeing some 
> data to start seeing compression. I do not know, or have any reports from the 
> literature, on how many bytes, on average, it takes to start seeing 
> compression. All the papers I have seen have studied large blocks of text. 
> 
> My assumption is that the majority of strings to compress in Cassandra will 
> be short string, less then 32 bytes, to medium length strings, less than, 256 
> bytes. An example of short strings might be column names. An example of 
> medium length strings would be URLs. My assumption is that most short strings 
> would not get much compression from adaptive arithmetic coding compression 
> but medium length to longer strings would. In an effort to get higher 
> compression on the short strings, static encoding methods could be used. A 
> static encoding method uses a hard coded frequency table for the bytes. The 
> start of the compressed string could have a 2 bit code, 00-no compression, 
> 01-static, default probability table, 10-static, locale probability table, 
> 11-adaptive coding. The default static coding would be based on English. For 
> the static locale probability table, the 2 bit code would need to be followed 
> by a code table, indicating the probability table to use for a particular 
> language. The locale probability tables would be stored in the compressor jar 
> file as a separate class for each locale supported (This issue needs more 
> analysis, what I don't think is effective is to load the probability tables 
> from disk each time a new compressor is created). During the encoding phase, 
> both adaptive and static coding of the string would occur. In general, 
> compression with a static coding table is very fast. static codig is 
> basically a table lookup from a table with 256 entries. It is the adaptive 
> coding that is more computationally intensive. Furthermore, you could place a 
> limit on the use of static coding, say strings less than 256 bytes. This 
> would need to be set empirically. The shorter string from the two methods is 
> used as the compressed string. There is no additional burden on the decoding 
> side, since you know the type of compression encoding.
> 
> It might be the case that block compressors can achieve greater compression 
> ratios. From what I have read on the mailing list and JIRA, it appears that 
> there is a lot of pain involved to implement block based compressors. This 
> method, a compressed string data type, is presented as a method that is 
> minimally invasive to the Cassandra architecture. Since everything is already 
> stored as a byte array, nothing would need to be changed in the internal data 
> types. Furthermore, no surgery of the internal tables is need. The main 
> addition is the development of comparators, for keys and column headings, 
> that know how to decompress the byte arrays before comparing them. There is 
> also the additional work on writing the codex but there are a number of 
> examples in C and Java from which to draw. Moffat's web site would be a good 
> place to start.
> 
> -------------
> Sincerely,
> David G. Boney
> dbon...@semanticartifacts.com
> http://www.semanticartifacts.com
> 
> 
> 
> 
> On Jan 31, 2011, at 2:19 AM, Stephen Connolly wrote:
> 
>> On 31 January 2011 04:41, David G. Boney <dbon...@semanticartifacts.com> 
>> wrote:
>>> I propose a simple idea for compression using a compressed string datatype.
>>> 
>>> The compressed string datatype could be implemented for column family keys 
>>> by creating a compressed string ordered partitioner. The compressed string 
>>> ordered partitioner works by decompressing the string and then applying an 
>>> ordered partitioner for strings to the decompressed string. The hash based 
>>> partitioner would be used with the compressed string without any 
>>> modification. The compressed string datatype could be implemented for 
>>> column keys by creating a compressed string comparator. A compressed string 
>>> comparator would work by decompressing the string and then applying a 
>>> string comparator. The compressed string datatype could be implemented for 
>>> column values. The compressed string datatype would be an internal datatype 
>>> for Cassandra. The compressed string would be converted to a string before 
>>> returning the value to a client. I suppose you could also have an option of 
>>> passing the compressed form back to the client if the client wanted it that 
>>> way.
>>> 
>>> I propose using an adaptive arithmetic coding compressor. This type of 
>>> compression can be done a byte at a time. It will build a compression model 
>>> only on the string that is presented, a byte at a time. See the below 
>>> papers.
>>> 
>>> Moffat, Alistair, Radford M. Neal, & Ian H. Witten, (1998), "Arithmetic 
>>> Coding Revisited", ACM Trans. on Info. Systems, Vol. 16, No. 3, pp. 256-294.
>>> 
>>> Witten, Ian H., Radford M. Neal, & John G. Cleary, (1987), "Arithmetic 
>>> Coding for Data Compression", Communications of the ACM, Vol. 30, No. 6, 
>>> pp. 520-540.
>>> 
>>> It has been reported that arithmetic coding based compression applied to 
>>> text can get compression ratios of up to 2.2 bits per character. Assuming 
>>> you only get 4 bits per character because of short strings. This would be a 
>>> 50% compression of text data, including keys and column names. Many 
>>> applications would benefit. It should speed up the overall operation of 
>>> Cassandra because you would be moving significantly less data through the 
>>> system.
>> 
>> I have not read those papers but how do they and their reported
>> compressions apply to the unicode characters that java strings are
>> stored as?
>> 
>> -Stephen
>> 
>>> 
>>> This would provide a compression option that could be implemented without 
>>> any redesign to the internal structure of Cassandra except for the a new 
>>> partitioner class, a new comparator class, a new datatype class, and the 
>>> compression class.
>>> -------------
>>> Sincerely,
>>> David G. Boney
>>> dbon...@semanticartifacts.com
>>> http://www.semanticartifacts.com
>>> 
>>> 
>>> 
>>> 
>>> 
> 

Reply via email to