[jira] [Comment Edited] (CASSANDRA-10084) Very slow performance streaming a large query from a single CF

Brent Haines (JIRA) Thu, 20 Aug 2015 17:43:09 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705971#comment-14705971
 ]


Brent Haines edited comment on CASSANDRA-10084 at 8/21/15 12:42 AM:
--------------------------------------------------------------------

I did a lot of tuning with prefetching, threads per client, and added 
multithreading to our query collator. Performance has improved a lot, but it 
doesn't come close to what we had before we added the collection to the table. 

Right now, I have discovered a query for a specific index value that is 
particularly slow, 3 minutes for 10,000 records. First, it stop after providing 
only a about 1% of the data, but did not produce any kind of error or 
exception. I did a repair on one of the nodes for that partition key and it 
seems to be working, but is very slow now. I have attached stack dumps for 
every node involved in the query, though I am not certain which one is doing 
work at any given time. 

Stupid question - is there a quick way to see what nodes own the key for a 
specific query? I turn trace on and run the query a bunch of times to get all 
three. 

Please see the attached profiles for the 3 nodes. 

Also FYI - We run an incremental repair nightly. They usually finish, but 
sometimes, in the morning, nodes report *much* more storage than they actually 
own. They all own about 60 to 90GB, but after repair some nodes will say they 
own 2+ TB! Restarting reveals that they are way behind on compaction and takes 
about 2 hours to clear that up. If I try a nodetool compactionstats before 
restarting, it will hang until timeout.

Final question, is upgrading to 2.2 a safe bet for some of these issues? 
Specifically the halting of compaction during repair? 


was (Author: thebrenthaines):
I did a lot of tuning with prefetching, threads per client, and added 
multithreading to our query collator. Performance has improved a lot, but it 
doesn't come close to what we had before we added the collection to the table. 

Right now, I have discovered a query for a specific index value that is 
particularly slow, 3 minutes for 10,000 records. First, it returned only a 
about 1% of the data without error. I did a repair on one of the nodes for that 
partition key and it seems to be working, but is very slow now. I have attached 
stack dumps for every node, though I am not certain which one is working at any 
given time. 

Stupid question - is there a quick way to see what nodes own the key for a 
specific query? I turn trace on and run the query a bunch of times to get all 
three. 

Please see the attached profiles for the 3 nodes. 

We run an incremental repair nightly. They usually finish, but sometimes nodes 
report *much* more storage than they actually own. They all own about 60 to 
90GB, but after repair some nodes will say they own 2+ TB! Restarting reveals 
that they are way behind on compaction and take about 2 hours to clear that up. 

> Very slow performance streaming a large query from a single CF
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-10084
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10084
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cassandra 2.1.8
> 12GB EC2 instance
> 12 node cluster
> 32 concurrent reads
> 32 concurrent writes
> 6GB heap space
>            Reporter: Brent Haines
>         Attachments: cassandra.yaml, node1.txt, node2.txt, node3.txt
>
>
> We have a relatively simple column family that we use to track event data 
> from different providers. We have been utilizing it for some time. Here is 
> what it looks like: 
> {code}
> CREATE TABLE data.stories_by_text (
>     ref_id timeuuid,
>     second_type text,
>     second_value text,
>     object_type text,
>     field_name text,
>     value text,
>     story_id timeuuid,
>     data map<text, text>,
>     PRIMARY KEY ((ref_id, second_type, second_value, object_type, 
> field_name), value, story_id)
> ) WITH CLUSTERING ORDER BY (value ASC, story_id ASC)
>     AND bloom_filter_fp_chance = 0.01
>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>     AND comment = 'Searchable fields and actions in a story are indexed by 
> ref id which corresponds to a brand, app, app instance, or user.'
>     AND compaction = {'min_threshold': '4', 'cold_reads_to_omit': '0.0', 
> 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
> 'max_threshold': '32'}
>     AND compression = {'sstable_compression': 
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99.0PERCENTILE';
> {code}
> We will, on a daily basis pull a query of the complete data for a given 
> index, it will look like this: 
> {code}
> select * from stories_by_text where ref_id = 
> f0124740-2f5a-11e5-a113-03cdf3f3c6dc and second_type = 'Day' and second_value 
> = '20150812' and object_type = 'booshaka:user' and field_name = 'hashedEmail';
> {code}
> In the past, we have been able to pull millions of records out of the CF in a 
> few seconds. We recently added the data column so that we could filter on 
> event data and provide more detailed analysis of activity for our reports. 
> The data map, declared with 'data map<text, text>' is very small; only 2 or 3 
> name/value pairs.
> Since we have added this column, our streaming query performance has gone 
> straight to hell. I just ran the above query and it took 46 minutes to read 
> 86K rows and then it timed out.
> I am uncertain what other data you need to see in order to diagnose this. We 
> are using STCS and are considering a change to Leveled Compaction. The table 
> is repaired nightly and the updates, which are at a very fast clip will only 
> impact the partition key for today, while the queries are for previous days 
> only. 
> To my knowledge these queries no longer finish ever. They time out, even 
> though I put a 60 second timeout on the read for the cluster. I can watch it 
> pause for 30 to 50 seconds many times during the stream. 
> Again, this only started happening when we added the data column.
> Please let me know what else you need for this. It is having a very big 
> impact on our system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-10084) Very slow performance streaming a large query from a single CF

Reply via email to