Bloom filters are used to avoid disk seeks on accessing sstables. As we
don't know where exactly the partition resides, we have to narrow down the
search to paticular sstables where the data most probably is.

Given that most likely you won't store 50B rows on the single node, you
will most likely have a larger cluster.

I would start with writing data (possibly test payloads) rather than tweak
params. More often than not such optimization a may have reverse effects.

better data modeling is extremely important though: picking right partition
key, having good clustering key layout will help to run queries in the most
performant way.

With regard to compaction strategy you may want to read up something
similar to
https://www.instaclustr.com/blog/2016/01/27/apache-cassandra-compaction/
where several compaction strategies are compared and explained in details.

Start with defaults, tweak params that make sense depending on read write
workloads under realistic stress test payloads, measure results and make
decisions. If there was a particular setting that would suddenly make
Cassandra much better and faster it would have been always on by default.
On Thu, 19 May 2016 at 16:55, Kai Wang <dep...@gmail.com> wrote:

> with 50 bln rows and bloom_filter_fp_chance = 0.01, bloom filter will
> consume a lot of off heap memory. You may want to take that into
> consideration too.
>
> On Wed, May 18, 2016 at 11:53 PM, Adarsh Kumar <adarsh0...@gmail.com>
> wrote:
>
>> Hi Sai,
>>
>> We have a use case where we are designing a table that is going to have
>> around 50 billion rows and we require a very fast reads. Partitions are not
>> that complex/big, it has
>> some validation data for duplicate checks (consisting 4-5 int and
>> varchar). So we were trying various options to optimize read performance.
>> Apart from tuning Bloom Filter we are trying following thing:
>>
>> 1). Better data modelling (making appropriate partition and clustering
>> keys)
>> 2). Trying Leveled compaction (changing data model for this one)
>>
>> Jonathan,
>>
>> I understand that tuning bloom_filter_fp_chance will not have a drastic
>> performance gain.
>> But this is one of the many tings we are trying.
>> Please let me know if you have any other suggestions to improve read
>> performance for this volume of data.
>>
>> Also please let me know any performance benchmark technique (currently we
>> are planing to trigger massive reads from spark and check cfstats).
>>
>> NOTE: we will be deploying DSE on EC2, so please suggest if you have
>> anything specific to DSE and EC2.
>>
>> Adarsh
>>
>> On Wed, May 18, 2016 at 9:45 PM, Jonathan Haddad <j...@jonhaddad.com>
>> wrote:
>>
>>> The impact is it'll get massively bigger with very little performance
>>> benefit, if any.
>>>
>>> You can't get 0 because it's a probabilistic data structure.  It tells
>>> you either:
>>>
>>> your data is definitely not here
>>> your data has a pretty decent chance of being here
>>>
>>> but never "it's here for sure"
>>>
>>> https://en.wikipedia.org/wiki/Bloom_filter
>>>
>>> On Wed, May 18, 2016 at 11:04 AM sai krishnam raju potturi <
>>> pskraj...@gmail.com> wrote:
>>>
>>>> hi Adarsh;
>>>>     were there any drawbacks to setting the bloom_filter_fp_chance  to
>>>> the default value?
>>>>
>>>> thanks
>>>> Sai
>>>>
>>>> On Wed, May 18, 2016 at 2:21 AM, Adarsh Kumar <adarsh0...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> What is the impact of setting bloom_filter_fp_chance < 0.01.
>>>>>
>>>>> During performance tuning I was trying to tune bloom_filter_fp_chance
>>>>> and have following questions:
>>>>>
>>>>> 1). Why bloom_filter_fp_chance = 0 is not allowed. (
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-5013)
>>>>> 2). What is the maximum/recommended value of bloom_filter_fp_chance
>>>>> (if we do not have any limitation for bloom filter size).
>>>>>
>>>>> NOTE: We are using default SizeTieredCompactionStrategy on
>>>>> cassandra  2.1.8.621
>>>>>
>>>>> Thanks in advance..:)
>>>>>
>>>>> Adarsh Kumar
>>>>>
>>>>
>>>>
>>
> --
Alex Petrov

Reply via email to