> on node with 300m rows (small node), it will be 585937 index sample entries
> with 512 sampling. lets say 100 bytes per entry this will be 585 MB, bloom
> filters are 884 MB. With default sampling 128, sampled entries will use
> majority of node memory. Index sampling should be reworked like bloom
> filters to avoid allocating one large array per sstable. hadoop mapfile is
> using sampling 128 by default too and it reads entire mapfile index into
> memory.

The index summary does have an ArrayList which will be backed by an
array which could become large; however larger than that array (which
is going to be 1 object reference per sample, or 1-2 taking into
account internal growth of the array list) will be the overhead of the
objects in the array (regular Java objects). This is also why it is
non-trivial to report on the data size.

> it should be clearly documented in
> http://wiki.apache.org/cassandra/LargeDataSetConsiderations - that bloom
> filters + index sampling will be responsible for most memory used by node.
> Caching itself has minimal use on large data set used for OLAP.

I added some information at the end.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Reply via email to