Re: High Bloom Filter FP Ratio

2014-12-19 Thread Chris Hart
Hi Tyler,

I tried what you said and false positives look much more reasonable there.  
Thanks for looking into this.

-Chris

- Original Message -
From: "Tyler Hobbs" 
To: user@cassandra.apache.org
Sent: Friday, December 19, 2014 1:25:29 PM
Subject: Re: High Bloom Filter FP Ratio

I took a look at the code where the bloom filter true/false positive
counters are updated and notice that the true-positive count isn't being
updated on key cache hits:
https://issues.apache.org/jira/browse/CASSANDRA-8525.  That may explain
your ratios.

Can you try querying for a few non-existent partition keys in cqlsh with
tracing enabled (just run "TRACING ON") and see if you really do get that
high of a false-positive ratio?

On Fri, Dec 19, 2014 at 9:59 AM, Mark Greene  wrote:
>
> We're seeing similar behavior except our FP ratio is closer to 1.0 (100%).
>
> We're using Cassandra 2.1.2.
>
>
> Schema
> ---
> CREATE TABLE contacts.contact (
> id bigint,
> property_id int,
> created_at bigint,
> updated_at bigint,
> value blob,
> PRIMARY KEY (id, property_id)
> ) WITH CLUSTERING ORDER BY (property_id ASC)
> *AND bloom_filter_fp_chance = 0.001*
> AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
> AND comment = ''
> AND compaction = {'min_threshold': '4', 'class':
> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy',
> 'max_threshold': '32'}
> AND compression = {'sstable_compression':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
> AND dclocal_read_repair_chance = 0.1
> AND default_time_to_live = 0
> AND gc_grace_seconds = 864000
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.0
> AND speculative_retry = '99.0PERCENTILE';
>
> CF Stats Output:
> -
> Keyspace: contacts
> Read Count: 2458375
> Read Latency: 0.852844076675 ms.
> Write Count: 10357
> Write Latency: 0.1816912233272183 ms.
> Pending Flushes: 0
> Table: contact
> SSTable count: 61
> SSTables in each level: [1, 10, 50, 0, 0, 0, 0, 0, 0]
> Space used (live): 9047112471
> Space used (total): 9047112471
> Space used by snapshots (total): 0
> SSTable Compression Ratio: 0.34119240020241487
> Memtable cell count: 24570
> Memtable data size: 1299614
> Memtable switch count: 2
> Local read count: 2458290
> Local read latency: 0.853 ms
> Local write count: 10044
> Local write latency: 0.186 ms
> Pending flushes: 0
> Bloom filter false positives: 11096
> *Bloom filter false ratio: 0.99197*
> Bloom filter space used: 3923784
> Compacted partition minimum bytes: 373
> Compacted partition maximum bytes: 152321
> Compacted partition mean bytes: 9938
> Average live cells per slice (last five minutes): 37.57851240677983
> Maximum live cells per slice (last five minutes): 63.0
> Average tombstones per slice (last five minutes): 0.0
> Maximum tombstones per slice (last five minutes): 0.0
>
> --
> about.me <http://about.me/markgreene>
>
> On Wed, Dec 17, 2014 at 1:32 PM, Chris Hart  wrote:
>>
>> Hi,
>>
>> I have create the following table with bloom_filter_fp_chance=0.01:
>>
>> CREATE TABLE logged_event (
>>   time_key bigint,
>>   partition_key_randomizer int,
>>   resource_uuid timeuuid,
>>   event_json text,
>>   event_type text,
>>   field_error_list map,
>>   javascript_timestamp timestamp,
>>   javascript_uuid uuid,
>>   page_impression_guid uuid,
>>   page_request_guid uuid,
>>   server_received_timestamp timestamp,
>>   session_id bigint,
>>   PRIMARY KEY ((time_key, partition_key_randomizer), resource_uuid)
>> ) WITH
>>   bloom_filter_fp_chance=0.01 AND
>>   caching='KEYS_ONLY' AND
>>   comment='' AND
>>   dclocal_read_repair_chance=0.00 AND
>>   gc_grace_seconds=864000 AND
>>   index_interval=128 AND
>>   read_repair_chance=0.00 AND
>>   replicate_on_write='true' AND
>>   populate_io_cache_on_flush='false' AND
>>   default_time_to_live=0 AND
>>   speculative_retry='99.0PERCENTILE' AN

Re: High Bloom Filter FP Ratio

2014-12-19 Thread Tyler Hobbs
I took a look at the code where the bloom filter true/false positive
counters are updated and notice that the true-positive count isn't being
updated on key cache hits:
https://issues.apache.org/jira/browse/CASSANDRA-8525.  That may explain
your ratios.

Can you try querying for a few non-existent partition keys in cqlsh with
tracing enabled (just run "TRACING ON") and see if you really do get that
high of a false-positive ratio?

On Fri, Dec 19, 2014 at 9:59 AM, Mark Greene  wrote:
>
> We're seeing similar behavior except our FP ratio is closer to 1.0 (100%).
>
> We're using Cassandra 2.1.2.
>
>
> Schema
> ---
> CREATE TABLE contacts.contact (
> id bigint,
> property_id int,
> created_at bigint,
> updated_at bigint,
> value blob,
> PRIMARY KEY (id, property_id)
> ) WITH CLUSTERING ORDER BY (property_id ASC)
> *AND bloom_filter_fp_chance = 0.001*
> AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
> AND comment = ''
> AND compaction = {'min_threshold': '4', 'class':
> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy',
> 'max_threshold': '32'}
> AND compression = {'sstable_compression':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
> AND dclocal_read_repair_chance = 0.1
> AND default_time_to_live = 0
> AND gc_grace_seconds = 864000
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.0
> AND speculative_retry = '99.0PERCENTILE';
>
> CF Stats Output:
> -
> Keyspace: contacts
> Read Count: 2458375
> Read Latency: 0.852844076675 ms.
> Write Count: 10357
> Write Latency: 0.1816912233272183 ms.
> Pending Flushes: 0
> Table: contact
> SSTable count: 61
> SSTables in each level: [1, 10, 50, 0, 0, 0, 0, 0, 0]
> Space used (live): 9047112471
> Space used (total): 9047112471
> Space used by snapshots (total): 0
> SSTable Compression Ratio: 0.34119240020241487
> Memtable cell count: 24570
> Memtable data size: 1299614
> Memtable switch count: 2
> Local read count: 2458290
> Local read latency: 0.853 ms
> Local write count: 10044
> Local write latency: 0.186 ms
> Pending flushes: 0
> Bloom filter false positives: 11096
> *Bloom filter false ratio: 0.99197*
> Bloom filter space used: 3923784
> Compacted partition minimum bytes: 373
> Compacted partition maximum bytes: 152321
> Compacted partition mean bytes: 9938
> Average live cells per slice (last five minutes): 37.57851240677983
> Maximum live cells per slice (last five minutes): 63.0
> Average tombstones per slice (last five minutes): 0.0
> Maximum tombstones per slice (last five minutes): 0.0
>
> --
> about.me 
>
> On Wed, Dec 17, 2014 at 1:32 PM, Chris Hart  wrote:
>>
>> Hi,
>>
>> I have create the following table with bloom_filter_fp_chance=0.01:
>>
>> CREATE TABLE logged_event (
>>   time_key bigint,
>>   partition_key_randomizer int,
>>   resource_uuid timeuuid,
>>   event_json text,
>>   event_type text,
>>   field_error_list map,
>>   javascript_timestamp timestamp,
>>   javascript_uuid uuid,
>>   page_impression_guid uuid,
>>   page_request_guid uuid,
>>   server_received_timestamp timestamp,
>>   session_id bigint,
>>   PRIMARY KEY ((time_key, partition_key_randomizer), resource_uuid)
>> ) WITH
>>   bloom_filter_fp_chance=0.01 AND
>>   caching='KEYS_ONLY' AND
>>   comment='' AND
>>   dclocal_read_repair_chance=0.00 AND
>>   gc_grace_seconds=864000 AND
>>   index_interval=128 AND
>>   read_repair_chance=0.00 AND
>>   replicate_on_write='true' AND
>>   populate_io_cache_on_flush='false' AND
>>   default_time_to_live=0 AND
>>   speculative_retry='99.0PERCENTILE' AND
>>   memtable_flush_period_in_ms=0 AND
>>   compaction={'class': 'SizeTieredCompactionStrategy'} AND
>>   compression={'sstable_compression': 'LZ4Compressor'};
>>
>>
>> When I run cfstats, I see a much higher false positive ratio:
>>
>> Table: logged_event
>> SSTable count: 15
>> Space used (live), bytes: 104128214227
>> Space used (total), bytes: 104129482871
>> SSTable Compression Ratio: 0.3295840184239226
>> Number of keys (estimate): 199293952
>> Memtable cell count: 56364
>> Memtable data size, bytes: 20903960
>> Memtable switch count: 148
>> Local read count: 1396402
>> Local read latency: 0.362 ms
>> Local write count: 2345306
>> Local write latency: 0.062 ms
>> Pending tasks: 0
>>

Re: High Bloom Filter FP Ratio

2014-12-19 Thread Mark Greene
We're seeing similar behavior except our FP ratio is closer to 1.0 (100%).

We're using Cassandra 2.1.2.


Schema
---
CREATE TABLE contacts.contact (
id bigint,
property_id int,
created_at bigint,
updated_at bigint,
value blob,
PRIMARY KEY (id, property_id)
) WITH CLUSTERING ORDER BY (property_id ASC)
*AND bloom_filter_fp_chance = 0.001*
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class':
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy',
'max_threshold': '32'}
AND compression = {'sstable_compression':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';

CF Stats Output:
-
Keyspace: contacts
Read Count: 2458375
Read Latency: 0.852844076675 ms.
Write Count: 10357
Write Latency: 0.1816912233272183 ms.
Pending Flushes: 0
Table: contact
SSTable count: 61
SSTables in each level: [1, 10, 50, 0, 0, 0, 0, 0, 0]
Space used (live): 9047112471
Space used (total): 9047112471
Space used by snapshots (total): 0
SSTable Compression Ratio: 0.34119240020241487
Memtable cell count: 24570
Memtable data size: 1299614
Memtable switch count: 2
Local read count: 2458290
Local read latency: 0.853 ms
Local write count: 10044
Local write latency: 0.186 ms
Pending flushes: 0
Bloom filter false positives: 11096
*Bloom filter false ratio: 0.99197*
Bloom filter space used: 3923784
Compacted partition minimum bytes: 373
Compacted partition maximum bytes: 152321
Compacted partition mean bytes: 9938
Average live cells per slice (last five minutes): 37.57851240677983
Maximum live cells per slice (last five minutes): 63.0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0.0

--
about.me 

On Wed, Dec 17, 2014 at 1:32 PM, Chris Hart  wrote:
>
> Hi,
>
> I have create the following table with bloom_filter_fp_chance=0.01:
>
> CREATE TABLE logged_event (
>   time_key bigint,
>   partition_key_randomizer int,
>   resource_uuid timeuuid,
>   event_json text,
>   event_type text,
>   field_error_list map,
>   javascript_timestamp timestamp,
>   javascript_uuid uuid,
>   page_impression_guid uuid,
>   page_request_guid uuid,
>   server_received_timestamp timestamp,
>   session_id bigint,
>   PRIMARY KEY ((time_key, partition_key_randomizer), resource_uuid)
> ) WITH
>   bloom_filter_fp_chance=0.01 AND
>   caching='KEYS_ONLY' AND
>   comment='' AND
>   dclocal_read_repair_chance=0.00 AND
>   gc_grace_seconds=864000 AND
>   index_interval=128 AND
>   read_repair_chance=0.00 AND
>   replicate_on_write='true' AND
>   populate_io_cache_on_flush='false' AND
>   default_time_to_live=0 AND
>   speculative_retry='99.0PERCENTILE' AND
>   memtable_flush_period_in_ms=0 AND
>   compaction={'class': 'SizeTieredCompactionStrategy'} AND
>   compression={'sstable_compression': 'LZ4Compressor'};
>
>
> When I run cfstats, I see a much higher false positive ratio:
>
> Table: logged_event
> SSTable count: 15
> Space used (live), bytes: 104128214227
> Space used (total), bytes: 104129482871
> SSTable Compression Ratio: 0.3295840184239226
> Number of keys (estimate): 199293952
> Memtable cell count: 56364
> Memtable data size, bytes: 20903960
> Memtable switch count: 148
> Local read count: 1396402
> Local read latency: 0.362 ms
> Local write count: 2345306
> Local write latency: 0.062 ms
> Pending tasks: 0
> Bloom filter false positives: 147705
> Bloom filter false ratio: 0.49020
> Bloom filter space used, bytes: 249129040
> Compacted partition minimum bytes: 447
> Compacted partition maximum bytes: 315852
> Compacted partition mean bytes: 1636
> Average live cells per slice (last five minutes): 0.0
> Average tombstones per slice (last five minutes): 0.0
>
> Any idea what could be causing this?  This is timeseries data.  Every time
> we read from this table, we read a single row key with 1000
> partition_key_randomizer values.  I'm running cassandra 2.0.11.  I t