Brent Haines created CASSANDRA-10084:
----------------------------------------

             Summary: Very slow performance streaming a large query from a 
single CF
                 Key: CASSANDRA-10084
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10084
             Project: Cassandra
          Issue Type: Bug
         Environment: Cassandra 2.1.8
12GB EC2 instance
12 node cluster
32 concurrent reads
32 concurrent writes
6GB heap space
            Reporter: Brent Haines
         Attachments: cassandra.yaml

We have a relatively simple column family that we use to track event data from 
different providers. We have been utilizing it for some time. Here is what it 
looks like: 

{code}
CREATE TABLE data.stories_by_text (
    ref_id timeuuid,
    second_type text,
    second_value text,
    object_type text,
    field_name text,
    value text,
    story_id timeuuid,
    data map<text, text>,
    PRIMARY KEY ((ref_id, second_type, second_value, object_type, field_name), 
value, story_id)
) WITH CLUSTERING ORDER BY (value ASC, story_id ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
    AND comment = 'Searchable fields and actions in a story are indexed by ref 
id which corresponds to a brand, app, app instance, or user.'
    AND compaction = {'min_threshold': '4', 'cold_reads_to_omit': '0.0', 
'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
'max_threshold': '32'}
    AND compression = {'sstable_compression': 
'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';
{code}

We will, on a daily basis pull a query of the complete data for a given index, 
it will look like this: 

{code}
select * from stories_by_text where ref_id = 
f0124740-2f5a-11e5-a113-03cdf3f3c6dc and second_type = 'Day' and second_value = 
'20150812' and object_type = 'booshaka:user' and field_name = 'hashedEmail';
{code}

In the past, we have been able to pull millions of records out of the CF in a 
few seconds. We recently added the data column so that we could filter on event 
data and provide more detailed analysis of activity for our reports. The data 
map, declared with 'data map<text, text>' is very small; only 2 or 3 name/value 
pairs.

Since we have added this column, our streaming query performance has gone 
straight to hell. I just ran the above query and it took 46 minutes to read 86K 
rows and then it timed out.

I am uncertain what other data you need to see in order to diagnose this. We 
are using STCS and are considering a change to Leveled Compaction. The table is 
repaired nightly and the updates, which are at a very fast clip will only 
impact the partition key for today, while the queries are for previous days 
only. 

To my knowledge these queries no longer finish ever. They time out, even though 
I put a 60 second timeout on the read for the cluster. I can watch it pause for 
30 to 50 seconds many times during the stream. 

Again, this only started happening when we added the data column.

Please let me know what else you need for this. It is having a very big impact 
on our system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to