[
https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392398#comment-14392398
]
Sergey Maznichenko commented on CASSANDRA-9092:
-----------------------------------------------
Java heap is selected automatically in cassandra-env.sh. I tried to set
MAX_HEAP_SIZE="8G", NEW_HEAP_SIZE="800M", but it didn't help.
nodetool disableautocompaction - didn't help, compactions continue after
restart node.
nodetool truncatehints - didn't help, it showed message like 'cannot stop
running hint compaction'.
One of nodes had ~24000 files in system\hints-..., I stepped node and deleted
them, it helps and node is running about 10 hours. Other node has 18154 files
in system\hints-... (~1.1TB) and has the same problem, I leave it for
experiments.
Workload: 20-40 processes on application servers, each one performs loading
files in blobs (one big table), size of each file is about 3.5MB, key - UUID.
CREATE KEYSPACE filespace WITH replication = {'class':
'NetworkTopologyStrategy', 'DC1': '1', 'DC2': '1'} AND durable_writes = true;
CREATE TABLE filespace.filestorage (
key text,
filename text,
value blob,
PRIMARY KEY (key, chunk)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (chunk ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32'}
AND compression = {'sstable_compression':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
nodetool status filespace
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID
Rack
UN 10.X.X.12 4.82 TB 256 28.0%
25cefe6a-a9b1-4b30-839d-46ed5f4736cc RAC1
UN 10.X.X.13 3.98 TB 256 22.9%
ef439686-1e8f-4b31-9c42-f49ff7a8b537 RAC1
UN 10.X.X.10 4.52 TB 256 26.1%
a11f52a6-1bff-4b47-bfa9-628a55a058dc RAC1
UN 10.X.X.11 4.01 TB 256 23.1%
0f454fa7-5cdf-45b3-bf2d-729ab7bd9e52 RAC1
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID
Rack
UN 10.X.X.137 4.64 TB 256 22.6%
e184cc42-7cd9-4e2e-bd0d-55a6a62f69dd RAC1
UN 10.X.X.136 1.25 TB 256 27.2%
c8360341-83e0-4778-b2d4-3966f083151b RAC1
DN 10.X.X.139 4.81 TB 256 25.8%
1f434cfe-6952-4d41-8fc5-780a18e64963 RAC1
UN 10.X.X.138 3.69 TB 256 24.4%
b7467041-05d9-409f-a59a-438d0a29f6a7 RAC1
I need some workaround to prevent this situation with hints.
How we use dafault values for:
hinted_handoff_enabled: 'true'
max_hints_delivery_threads: 2
max_hint_window_in_ms: 10800000
hinted_handoff_throttle_in_kb: 1024
Should I disable hints or increase number of threads and throughput?
For example:
hinted_handoff_enabled: 'true'
max_hints_delivery_threads: 20
max_hint_window_in_ms: 108000000
hinted_handoff_throttle_in_kb: 10240
> Nodes in DC2 die during and after huge write workload
> -----------------------------------------------------
>
> Key: CASSANDRA-9092
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9092
> Project: Cassandra
> Issue Type: Bug
> Environment: CentOS 6.2 64-bit, Cassandra 2.1.2,
> java version "1.7.0_71"
> Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
> Reporter: Sergey Maznichenko
> Fix For: 2.1.5
>
> Attachments: cassandra_crash1.txt
>
>
> Hello,
> We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
> Node is VM 8 CPU, 32GB RAM
> During significant workload (loading several millions blobs ~3.5MB each), 1
> node in DC2 stops and after some time next 2 nodes in DC2 also stops.
> Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start.
> I see many files in system.hints table and error appears in 2-3 minutes after
> starting system.hints auto compaction.
> The problem exists only in DC2. We have 1GbE between DC1 and DC2.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)