[jira] [Updated] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh McKenzie updated CASSANDRA-13938: -- Bug Category: Parent values: Availability(12983)Level 1 values: Process Crash(12992) > Default repair is broken, crashes other nodes participating in repair (in > trunk) > > > Key: CASSANDRA-13938 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13938 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Nate McCall >Assignee: Aleksey Yeschenko >Priority: Urgent > Fix For: 4.0-alpha > > Attachments: 13938.yaml, test.sh > > > Running through a simple scenario to test some of the new repair features, I > was not able to make a repair command work. Further, the exception seemed to > trigger a nasty failure state that basically shuts down the netty connections > for messaging *and* CQL on the nodes transferring back data to the node being > repaired. The following steps reproduce this issue consistently. > Cassandra stress profile (probably not necessary, but this one provides a > really simple schema and consistent data shape): > {noformat} > keyspace: standard_long > keyspace_definition: | > CREATE KEYSPACE standard_long WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > table: test_data > table_definition: | > CREATE TABLE test_data ( > key text, > ts bigint, > val text, > PRIMARY KEY (key, ts) > ) WITH COMPACT STORAGE AND > CLUSTERING ORDER BY (ts DESC) AND > bloom_filter_fp_chance=0.01 AND > caching={'keys':'ALL', 'rows_per_partition':'NONE'} AND > comment='' AND > dclocal_read_repair_chance=0.00 AND > gc_grace_seconds=864000 AND > read_repair_chance=0.00 AND > compaction={'class': 'SizeTieredCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > columnspec: > - name: key > population: uniform(1..5000) # 50 million records available > - name: ts > cluster: gaussian(1..50) # Up to 50 inserts per record > - name: val > population: gaussian(128..1024) # varrying size of value data > insert: > partitions: fixed(1) # only one insert per batch for individual partitions > select: fixed(1)/1 # each insert comes in one at a time > batchtype: UNLOGGED > queries: > single: > cql: select * from test_data where key = ? and ts = ? limit 1; > series: > cql: select key,ts,val from test_data where key = ? limit 10; > {noformat} > The commands to build and run: > {noformat} > ccm create 4_0_test -v git:trunk -n 3 -s > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=15s -rate threads=4 > # flush the memtable just to get everything on disk > ccm node1 nodetool flush > ccm node2 nodetool flush > ccm node3 nodetool flush > # disable hints for nodes 2 and 3 > ccm node2 nodetool disablehandoff > ccm node3 nodetool disablehandoff > # stop node1 > ccm node1 stop > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=45s -rate threads=4 > # wait 10 seconds > ccm node1 start > # Note that we are local to ccm's nodetool install 'cause repair preview is > not reported yet > node1/bin/nodetool repair --preview > node1/bin/nodetool repair standard_long test_data > {noformat} > The error outputs from the last repair command follow. First, this is stdout > from node1: > {noformat} > $ node1/bin/nodetool repair standard_long test_data > objc[47876]: Class JavaLaunchHelper is implemented in both > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/bin/java > (0x10274d4c0) and > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/libinstrument.dylib > (0x1047b64e0). One of the two will be used. Which one is undefined. > [2017-10-05 14:31:52,425] Starting repair command #4 > (7e1a9150-a98e-11e7-ad86-cbd2801b8de2), repairing keyspace standard_long with > repair options (parallelism: parallel, primary range: false, incremental: > true, job threads: 1, ColumnFamilies: [test_data], dataCenters: [], hosts: > [], previewKind: NONE, # of ranges: 3, pull repair: false, force repair: > false) > [2017-10-05 14:32:07,045] Repair session 7e2e8e80-a98e-11e7-ad86-cbd2801b8de2 > for range [(3074457345618258602,-9223372036854775808], > (-9223372036854775808,-3074457345618258603], > (-3074457345618258603,3074457345618258602]] failed with error Stream failed > [2017-10-05 14:32:07,048] null > [2017-10-05 14:32:07,050] Repair command #4 finished in 14 seconds > error: Repair job has failed with the error message: [2017-10-05 > 14:32:07,048] null > -- StackTrace -- > java.lang.RuntimeException: Repair job has failed with the error message:
[jira] [Updated] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksey Yeschenko updated CASSANDRA-13938: -- Since Version: 4.0-alpha Source Control Link: [602a5eef177ac65020470cb0fcf8d88d820ab888|https://github.com/apache/cassandra/commit/602a5eef177ac65020470cb0fcf8d88d820ab888] Resolution: Fixed Status: Resolved (was: Ready to Commit) > Default repair is broken, crashes other nodes participating in repair (in > trunk) > > > Key: CASSANDRA-13938 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13938 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Nate McCall >Assignee: Aleksey Yeschenko >Priority: Urgent > Fix For: 4.0-alpha > > Attachments: 13938.yaml, test.sh > > > Running through a simple scenario to test some of the new repair features, I > was not able to make a repair command work. Further, the exception seemed to > trigger a nasty failure state that basically shuts down the netty connections > for messaging *and* CQL on the nodes transferring back data to the node being > repaired. The following steps reproduce this issue consistently. > Cassandra stress profile (probably not necessary, but this one provides a > really simple schema and consistent data shape): > {noformat} > keyspace: standard_long > keyspace_definition: | > CREATE KEYSPACE standard_long WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > table: test_data > table_definition: | > CREATE TABLE test_data ( > key text, > ts bigint, > val text, > PRIMARY KEY (key, ts) > ) WITH COMPACT STORAGE AND > CLUSTERING ORDER BY (ts DESC) AND > bloom_filter_fp_chance=0.01 AND > caching={'keys':'ALL', 'rows_per_partition':'NONE'} AND > comment='' AND > dclocal_read_repair_chance=0.00 AND > gc_grace_seconds=864000 AND > read_repair_chance=0.00 AND > compaction={'class': 'SizeTieredCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > columnspec: > - name: key > population: uniform(1..5000) # 50 million records available > - name: ts > cluster: gaussian(1..50) # Up to 50 inserts per record > - name: val > population: gaussian(128..1024) # varrying size of value data > insert: > partitions: fixed(1) # only one insert per batch for individual partitions > select: fixed(1)/1 # each insert comes in one at a time > batchtype: UNLOGGED > queries: > single: > cql: select * from test_data where key = ? and ts = ? limit 1; > series: > cql: select key,ts,val from test_data where key = ? limit 10; > {noformat} > The commands to build and run: > {noformat} > ccm create 4_0_test -v git:trunk -n 3 -s > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=15s -rate threads=4 > # flush the memtable just to get everything on disk > ccm node1 nodetool flush > ccm node2 nodetool flush > ccm node3 nodetool flush > # disable hints for nodes 2 and 3 > ccm node2 nodetool disablehandoff > ccm node3 nodetool disablehandoff > # stop node1 > ccm node1 stop > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=45s -rate threads=4 > # wait 10 seconds > ccm node1 start > # Note that we are local to ccm's nodetool install 'cause repair preview is > not reported yet > node1/bin/nodetool repair --preview > node1/bin/nodetool repair standard_long test_data > {noformat} > The error outputs from the last repair command follow. First, this is stdout > from node1: > {noformat} > $ node1/bin/nodetool repair standard_long test_data > objc[47876]: Class JavaLaunchHelper is implemented in both > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/bin/java > (0x10274d4c0) and > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/libinstrument.dylib > (0x1047b64e0). One of the two will be used. Which one is undefined. > [2017-10-05 14:31:52,425] Starting repair command #4 > (7e1a9150-a98e-11e7-ad86-cbd2801b8de2), repairing keyspace standard_long with > repair options (parallelism: parallel, primary range: false, incremental: > true, job threads: 1, ColumnFamilies: [test_data], dataCenters: [], hosts: > [], previewKind: NONE, # of ranges: 3, pull repair: false, force repair: > false) > [2017-10-05 14:32:07,045] Repair session 7e2e8e80-a98e-11e7-ad86-cbd2801b8de2 > for range [(3074457345618258602,-9223372036854775808], > (-9223372036854775808,-3074457345618258603], > (-3074457345618258603,3074457345618258602]] failed with error Stream failed > [2017-10-05 14:32:07,048] null > [2017-10-05 14:32:07,050] Repair command #4 finished in
[jira] [Updated] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dinesh Joshi updated CASSANDRA-13938: - Status: Ready to Commit (was: Review In Progress) > Default repair is broken, crashes other nodes participating in repair (in > trunk) > > > Key: CASSANDRA-13938 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13938 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Nate McCall >Assignee: Aleksey Yeschenko >Priority: Urgent > Fix For: 4.0-alpha > > Attachments: 13938.yaml, test.sh > > > Running through a simple scenario to test some of the new repair features, I > was not able to make a repair command work. Further, the exception seemed to > trigger a nasty failure state that basically shuts down the netty connections > for messaging *and* CQL on the nodes transferring back data to the node being > repaired. The following steps reproduce this issue consistently. > Cassandra stress profile (probably not necessary, but this one provides a > really simple schema and consistent data shape): > {noformat} > keyspace: standard_long > keyspace_definition: | > CREATE KEYSPACE standard_long WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > table: test_data > table_definition: | > CREATE TABLE test_data ( > key text, > ts bigint, > val text, > PRIMARY KEY (key, ts) > ) WITH COMPACT STORAGE AND > CLUSTERING ORDER BY (ts DESC) AND > bloom_filter_fp_chance=0.01 AND > caching={'keys':'ALL', 'rows_per_partition':'NONE'} AND > comment='' AND > dclocal_read_repair_chance=0.00 AND > gc_grace_seconds=864000 AND > read_repair_chance=0.00 AND > compaction={'class': 'SizeTieredCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > columnspec: > - name: key > population: uniform(1..5000) # 50 million records available > - name: ts > cluster: gaussian(1..50) # Up to 50 inserts per record > - name: val > population: gaussian(128..1024) # varrying size of value data > insert: > partitions: fixed(1) # only one insert per batch for individual partitions > select: fixed(1)/1 # each insert comes in one at a time > batchtype: UNLOGGED > queries: > single: > cql: select * from test_data where key = ? and ts = ? limit 1; > series: > cql: select key,ts,val from test_data where key = ? limit 10; > {noformat} > The commands to build and run: > {noformat} > ccm create 4_0_test -v git:trunk -n 3 -s > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=15s -rate threads=4 > # flush the memtable just to get everything on disk > ccm node1 nodetool flush > ccm node2 nodetool flush > ccm node3 nodetool flush > # disable hints for nodes 2 and 3 > ccm node2 nodetool disablehandoff > ccm node3 nodetool disablehandoff > # stop node1 > ccm node1 stop > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=45s -rate threads=4 > # wait 10 seconds > ccm node1 start > # Note that we are local to ccm's nodetool install 'cause repair preview is > not reported yet > node1/bin/nodetool repair --preview > node1/bin/nodetool repair standard_long test_data > {noformat} > The error outputs from the last repair command follow. First, this is stdout > from node1: > {noformat} > $ node1/bin/nodetool repair standard_long test_data > objc[47876]: Class JavaLaunchHelper is implemented in both > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/bin/java > (0x10274d4c0) and > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/libinstrument.dylib > (0x1047b64e0). One of the two will be used. Which one is undefined. > [2017-10-05 14:31:52,425] Starting repair command #4 > (7e1a9150-a98e-11e7-ad86-cbd2801b8de2), repairing keyspace standard_long with > repair options (parallelism: parallel, primary range: false, incremental: > true, job threads: 1, ColumnFamilies: [test_data], dataCenters: [], hosts: > [], previewKind: NONE, # of ranges: 3, pull repair: false, force repair: > false) > [2017-10-05 14:32:07,045] Repair session 7e2e8e80-a98e-11e7-ad86-cbd2801b8de2 > for range [(3074457345618258602,-9223372036854775808], > (-9223372036854775808,-3074457345618258603], > (-3074457345618258603,3074457345618258602]] failed with error Stream failed > [2017-10-05 14:32:07,048] null > [2017-10-05 14:32:07,050] Repair command #4 finished in 14 seconds > error: Repair job has failed with the error message: [2017-10-05 > 14:32:07,048] null > -- StackTrace -- > java.lang.RuntimeException: Repair job has failed with the error message: > [2017-10-05 14:32:07,048] null >
[jira] [Updated] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dinesh Joshi updated CASSANDRA-13938: - Reviewers: Dinesh Joshi, Dinesh Joshi (was: Dinesh Joshi) Dinesh Joshi, Dinesh Joshi (was: Dinesh Joshi) Status: Review In Progress (was: Patch Available) > Default repair is broken, crashes other nodes participating in repair (in > trunk) > > > Key: CASSANDRA-13938 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13938 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Nate McCall >Assignee: Aleksey Yeschenko >Priority: Urgent > Fix For: 4.0-alpha > > Attachments: 13938.yaml, test.sh > > > Running through a simple scenario to test some of the new repair features, I > was not able to make a repair command work. Further, the exception seemed to > trigger a nasty failure state that basically shuts down the netty connections > for messaging *and* CQL on the nodes transferring back data to the node being > repaired. The following steps reproduce this issue consistently. > Cassandra stress profile (probably not necessary, but this one provides a > really simple schema and consistent data shape): > {noformat} > keyspace: standard_long > keyspace_definition: | > CREATE KEYSPACE standard_long WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > table: test_data > table_definition: | > CREATE TABLE test_data ( > key text, > ts bigint, > val text, > PRIMARY KEY (key, ts) > ) WITH COMPACT STORAGE AND > CLUSTERING ORDER BY (ts DESC) AND > bloom_filter_fp_chance=0.01 AND > caching={'keys':'ALL', 'rows_per_partition':'NONE'} AND > comment='' AND > dclocal_read_repair_chance=0.00 AND > gc_grace_seconds=864000 AND > read_repair_chance=0.00 AND > compaction={'class': 'SizeTieredCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > columnspec: > - name: key > population: uniform(1..5000) # 50 million records available > - name: ts > cluster: gaussian(1..50) # Up to 50 inserts per record > - name: val > population: gaussian(128..1024) # varrying size of value data > insert: > partitions: fixed(1) # only one insert per batch for individual partitions > select: fixed(1)/1 # each insert comes in one at a time > batchtype: UNLOGGED > queries: > single: > cql: select * from test_data where key = ? and ts = ? limit 1; > series: > cql: select key,ts,val from test_data where key = ? limit 10; > {noformat} > The commands to build and run: > {noformat} > ccm create 4_0_test -v git:trunk -n 3 -s > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=15s -rate threads=4 > # flush the memtable just to get everything on disk > ccm node1 nodetool flush > ccm node2 nodetool flush > ccm node3 nodetool flush > # disable hints for nodes 2 and 3 > ccm node2 nodetool disablehandoff > ccm node3 nodetool disablehandoff > # stop node1 > ccm node1 stop > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=45s -rate threads=4 > # wait 10 seconds > ccm node1 start > # Note that we are local to ccm's nodetool install 'cause repair preview is > not reported yet > node1/bin/nodetool repair --preview > node1/bin/nodetool repair standard_long test_data > {noformat} > The error outputs from the last repair command follow. First, this is stdout > from node1: > {noformat} > $ node1/bin/nodetool repair standard_long test_data > objc[47876]: Class JavaLaunchHelper is implemented in both > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/bin/java > (0x10274d4c0) and > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/libinstrument.dylib > (0x1047b64e0). One of the two will be used. Which one is undefined. > [2017-10-05 14:31:52,425] Starting repair command #4 > (7e1a9150-a98e-11e7-ad86-cbd2801b8de2), repairing keyspace standard_long with > repair options (parallelism: parallel, primary range: false, incremental: > true, job threads: 1, ColumnFamilies: [test_data], dataCenters: [], hosts: > [], previewKind: NONE, # of ranges: 3, pull repair: false, force repair: > false) > [2017-10-05 14:32:07,045] Repair session 7e2e8e80-a98e-11e7-ad86-cbd2801b8de2 > for range [(3074457345618258602,-9223372036854775808], > (-9223372036854775808,-3074457345618258603], > (-3074457345618258603,3074457345618258602]] failed with error Stream failed > [2017-10-05 14:32:07,048] null > [2017-10-05 14:32:07,050] Repair command #4 finished in 14 seconds > error: Repair job has failed with the error message: [2017-10-05 > 14:32:07,048] null >
[jira] [Updated] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-13938: Authors: (was: Alex Petrov) > Default repair is broken, crashes other nodes participating in repair (in > trunk) > > > Key: CASSANDRA-13938 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13938 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Nate McCall >Assignee: Alex Petrov >Priority: Urgent > Fix For: 4.0-alpha > > Attachments: 13938.yaml, test.sh > > > Running through a simple scenario to test some of the new repair features, I > was not able to make a repair command work. Further, the exception seemed to > trigger a nasty failure state that basically shuts down the netty connections > for messaging *and* CQL on the nodes transferring back data to the node being > repaired. The following steps reproduce this issue consistently. > Cassandra stress profile (probably not necessary, but this one provides a > really simple schema and consistent data shape): > {noformat} > keyspace: standard_long > keyspace_definition: | > CREATE KEYSPACE standard_long WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > table: test_data > table_definition: | > CREATE TABLE test_data ( > key text, > ts bigint, > val text, > PRIMARY KEY (key, ts) > ) WITH COMPACT STORAGE AND > CLUSTERING ORDER BY (ts DESC) AND > bloom_filter_fp_chance=0.01 AND > caching={'keys':'ALL', 'rows_per_partition':'NONE'} AND > comment='' AND > dclocal_read_repair_chance=0.00 AND > gc_grace_seconds=864000 AND > read_repair_chance=0.00 AND > compaction={'class': 'SizeTieredCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > columnspec: > - name: key > population: uniform(1..5000) # 50 million records available > - name: ts > cluster: gaussian(1..50) # Up to 50 inserts per record > - name: val > population: gaussian(128..1024) # varrying size of value data > insert: > partitions: fixed(1) # only one insert per batch for individual partitions > select: fixed(1)/1 # each insert comes in one at a time > batchtype: UNLOGGED > queries: > single: > cql: select * from test_data where key = ? and ts = ? limit 1; > series: > cql: select key,ts,val from test_data where key = ? limit 10; > {noformat} > The commands to build and run: > {noformat} > ccm create 4_0_test -v git:trunk -n 3 -s > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=15s -rate threads=4 > # flush the memtable just to get everything on disk > ccm node1 nodetool flush > ccm node2 nodetool flush > ccm node3 nodetool flush > # disable hints for nodes 2 and 3 > ccm node2 nodetool disablehandoff > ccm node3 nodetool disablehandoff > # stop node1 > ccm node1 stop > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=45s -rate threads=4 > # wait 10 seconds > ccm node1 start > # Note that we are local to ccm's nodetool install 'cause repair preview is > not reported yet > node1/bin/nodetool repair --preview > node1/bin/nodetool repair standard_long test_data > {noformat} > The error outputs from the last repair command follow. First, this is stdout > from node1: > {noformat} > $ node1/bin/nodetool repair standard_long test_data > objc[47876]: Class JavaLaunchHelper is implemented in both > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/bin/java > (0x10274d4c0) and > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/libinstrument.dylib > (0x1047b64e0). One of the two will be used. Which one is undefined. > [2017-10-05 14:31:52,425] Starting repair command #4 > (7e1a9150-a98e-11e7-ad86-cbd2801b8de2), repairing keyspace standard_long with > repair options (parallelism: parallel, primary range: false, incremental: > true, job threads: 1, ColumnFamilies: [test_data], dataCenters: [], hosts: > [], previewKind: NONE, # of ranges: 3, pull repair: false, force repair: > false) > [2017-10-05 14:32:07,045] Repair session 7e2e8e80-a98e-11e7-ad86-cbd2801b8de2 > for range [(3074457345618258602,-9223372036854775808], > (-9223372036854775808,-3074457345618258603], > (-3074457345618258603,3074457345618258602]] failed with error Stream failed > [2017-10-05 14:32:07,048] null > [2017-10-05 14:32:07,050] Repair command #4 finished in 14 seconds > error: Repair job has failed with the error message: [2017-10-05 > 14:32:07,048] null > -- StackTrace -- > java.lang.RuntimeException: Repair job has failed with the error message: > [2017-10-05 14:32:07,048] null > at
[jira] [Updated] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-13938: - Fix Version/s: (was: 4.0) 4.0-alpha > Default repair is broken, crashes other nodes participating in repair (in > trunk) > > > Key: CASSANDRA-13938 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13938 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Nate McCall >Assignee: Jason Brown >Priority: Urgent > Fix For: 4.0-alpha > > Attachments: 13938.yaml, test.sh > > > Running through a simple scenario to test some of the new repair features, I > was not able to make a repair command work. Further, the exception seemed to > trigger a nasty failure state that basically shuts down the netty connections > for messaging *and* CQL on the nodes transferring back data to the node being > repaired. The following steps reproduce this issue consistently. > Cassandra stress profile (probably not necessary, but this one provides a > really simple schema and consistent data shape): > {noformat} > keyspace: standard_long > keyspace_definition: | > CREATE KEYSPACE standard_long WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > table: test_data > table_definition: | > CREATE TABLE test_data ( > key text, > ts bigint, > val text, > PRIMARY KEY (key, ts) > ) WITH COMPACT STORAGE AND > CLUSTERING ORDER BY (ts DESC) AND > bloom_filter_fp_chance=0.01 AND > caching={'keys':'ALL', 'rows_per_partition':'NONE'} AND > comment='' AND > dclocal_read_repair_chance=0.00 AND > gc_grace_seconds=864000 AND > read_repair_chance=0.00 AND > compaction={'class': 'SizeTieredCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > columnspec: > - name: key > population: uniform(1..5000) # 50 million records available > - name: ts > cluster: gaussian(1..50) # Up to 50 inserts per record > - name: val > population: gaussian(128..1024) # varrying size of value data > insert: > partitions: fixed(1) # only one insert per batch for individual partitions > select: fixed(1)/1 # each insert comes in one at a time > batchtype: UNLOGGED > queries: > single: > cql: select * from test_data where key = ? and ts = ? limit 1; > series: > cql: select key,ts,val from test_data where key = ? limit 10; > {noformat} > The commands to build and run: > {noformat} > ccm create 4_0_test -v git:trunk -n 3 -s > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=15s -rate threads=4 > # flush the memtable just to get everything on disk > ccm node1 nodetool flush > ccm node2 nodetool flush > ccm node3 nodetool flush > # disable hints for nodes 2 and 3 > ccm node2 nodetool disablehandoff > ccm node3 nodetool disablehandoff > # stop node1 > ccm node1 stop > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=45s -rate threads=4 > # wait 10 seconds > ccm node1 start > # Note that we are local to ccm's nodetool install 'cause repair preview is > not reported yet > node1/bin/nodetool repair --preview > node1/bin/nodetool repair standard_long test_data > {noformat} > The error outputs from the last repair command follow. First, this is stdout > from node1: > {noformat} > $ node1/bin/nodetool repair standard_long test_data > objc[47876]: Class JavaLaunchHelper is implemented in both > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/bin/java > (0x10274d4c0) and > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/libinstrument.dylib > (0x1047b64e0). One of the two will be used. Which one is undefined. > [2017-10-05 14:31:52,425] Starting repair command #4 > (7e1a9150-a98e-11e7-ad86-cbd2801b8de2), repairing keyspace standard_long with > repair options (parallelism: parallel, primary range: false, incremental: > true, job threads: 1, ColumnFamilies: [test_data], dataCenters: [], hosts: > [], previewKind: NONE, # of ranges: 3, pull repair: false, force repair: > false) > [2017-10-05 14:32:07,045] Repair session 7e2e8e80-a98e-11e7-ad86-cbd2801b8de2 > for range [(3074457345618258602,-9223372036854775808], > (-9223372036854775808,-3074457345618258603], > (-3074457345618258603,3074457345618258602]] failed with error Stream failed > [2017-10-05 14:32:07,048] null > [2017-10-05 14:32:07,050] Repair command #4 finished in 14 seconds > error: Repair job has failed with the error message: [2017-10-05 > 14:32:07,048] null > -- StackTrace -- > java.lang.RuntimeException: Repair job has failed with the error message: > [2017-10-05 14:32:07,048] null > at
[jira] [Updated] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Lynch updated CASSANDRA-13938: - Fix Version/s: (was: 4.0) > Default repair is broken, crashes other nodes participating in repair (in > trunk) > > > Key: CASSANDRA-13938 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13938 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Nate McCall >Assignee: Jason Brown >Priority: Urgent > Attachments: 13938.yaml, test.sh > > > Running through a simple scenario to test some of the new repair features, I > was not able to make a repair command work. Further, the exception seemed to > trigger a nasty failure state that basically shuts down the netty connections > for messaging *and* CQL on the nodes transferring back data to the node being > repaired. The following steps reproduce this issue consistently. > Cassandra stress profile (probably not necessary, but this one provides a > really simple schema and consistent data shape): > {noformat} > keyspace: standard_long > keyspace_definition: | > CREATE KEYSPACE standard_long WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > table: test_data > table_definition: | > CREATE TABLE test_data ( > key text, > ts bigint, > val text, > PRIMARY KEY (key, ts) > ) WITH COMPACT STORAGE AND > CLUSTERING ORDER BY (ts DESC) AND > bloom_filter_fp_chance=0.01 AND > caching={'keys':'ALL', 'rows_per_partition':'NONE'} AND > comment='' AND > dclocal_read_repair_chance=0.00 AND > gc_grace_seconds=864000 AND > read_repair_chance=0.00 AND > compaction={'class': 'SizeTieredCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > columnspec: > - name: key > population: uniform(1..5000) # 50 million records available > - name: ts > cluster: gaussian(1..50) # Up to 50 inserts per record > - name: val > population: gaussian(128..1024) # varrying size of value data > insert: > partitions: fixed(1) # only one insert per batch for individual partitions > select: fixed(1)/1 # each insert comes in one at a time > batchtype: UNLOGGED > queries: > single: > cql: select * from test_data where key = ? and ts = ? limit 1; > series: > cql: select key,ts,val from test_data where key = ? limit 10; > {noformat} > The commands to build and run: > {noformat} > ccm create 4_0_test -v git:trunk -n 3 -s > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=15s -rate threads=4 > # flush the memtable just to get everything on disk > ccm node1 nodetool flush > ccm node2 nodetool flush > ccm node3 nodetool flush > # disable hints for nodes 2 and 3 > ccm node2 nodetool disablehandoff > ccm node3 nodetool disablehandoff > # stop node1 > ccm node1 stop > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=45s -rate threads=4 > # wait 10 seconds > ccm node1 start > # Note that we are local to ccm's nodetool install 'cause repair preview is > not reported yet > node1/bin/nodetool repair --preview > node1/bin/nodetool repair standard_long test_data > {noformat} > The error outputs from the last repair command follow. First, this is stdout > from node1: > {noformat} > $ node1/bin/nodetool repair standard_long test_data > objc[47876]: Class JavaLaunchHelper is implemented in both > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/bin/java > (0x10274d4c0) and > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/libinstrument.dylib > (0x1047b64e0). One of the two will be used. Which one is undefined. > [2017-10-05 14:31:52,425] Starting repair command #4 > (7e1a9150-a98e-11e7-ad86-cbd2801b8de2), repairing keyspace standard_long with > repair options (parallelism: parallel, primary range: false, incremental: > true, job threads: 1, ColumnFamilies: [test_data], dataCenters: [], hosts: > [], previewKind: NONE, # of ranges: 3, pull repair: false, force repair: > false) > [2017-10-05 14:32:07,045] Repair session 7e2e8e80-a98e-11e7-ad86-cbd2801b8de2 > for range [(3074457345618258602,-9223372036854775808], > (-9223372036854775808,-3074457345618258603], > (-3074457345618258603,3074457345618258602]] failed with error Stream failed > [2017-10-05 14:32:07,048] null > [2017-10-05 14:32:07,050] Repair command #4 finished in 14 seconds > error: Repair job has failed with the error message: [2017-10-05 > 14:32:07,048] null > -- StackTrace -- > java.lang.RuntimeException: Repair job has failed with the error message: > [2017-10-05 14:32:07,048] null > at
[jira] [Updated] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Lynch updated CASSANDRA-13938: - Fix Version/s: 4.0 > Default repair is broken, crashes other nodes participating in repair (in > trunk) > > > Key: CASSANDRA-13938 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13938 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Nate McCall >Assignee: Jason Brown >Priority: Urgent > Fix For: 4.0 > > Attachments: 13938.yaml, test.sh > > > Running through a simple scenario to test some of the new repair features, I > was not able to make a repair command work. Further, the exception seemed to > trigger a nasty failure state that basically shuts down the netty connections > for messaging *and* CQL on the nodes transferring back data to the node being > repaired. The following steps reproduce this issue consistently. > Cassandra stress profile (probably not necessary, but this one provides a > really simple schema and consistent data shape): > {noformat} > keyspace: standard_long > keyspace_definition: | > CREATE KEYSPACE standard_long WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > table: test_data > table_definition: | > CREATE TABLE test_data ( > key text, > ts bigint, > val text, > PRIMARY KEY (key, ts) > ) WITH COMPACT STORAGE AND > CLUSTERING ORDER BY (ts DESC) AND > bloom_filter_fp_chance=0.01 AND > caching={'keys':'ALL', 'rows_per_partition':'NONE'} AND > comment='' AND > dclocal_read_repair_chance=0.00 AND > gc_grace_seconds=864000 AND > read_repair_chance=0.00 AND > compaction={'class': 'SizeTieredCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > columnspec: > - name: key > population: uniform(1..5000) # 50 million records available > - name: ts > cluster: gaussian(1..50) # Up to 50 inserts per record > - name: val > population: gaussian(128..1024) # varrying size of value data > insert: > partitions: fixed(1) # only one insert per batch for individual partitions > select: fixed(1)/1 # each insert comes in one at a time > batchtype: UNLOGGED > queries: > single: > cql: select * from test_data where key = ? and ts = ? limit 1; > series: > cql: select key,ts,val from test_data where key = ? limit 10; > {noformat} > The commands to build and run: > {noformat} > ccm create 4_0_test -v git:trunk -n 3 -s > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=15s -rate threads=4 > # flush the memtable just to get everything on disk > ccm node1 nodetool flush > ccm node2 nodetool flush > ccm node3 nodetool flush > # disable hints for nodes 2 and 3 > ccm node2 nodetool disablehandoff > ccm node3 nodetool disablehandoff > # stop node1 > ccm node1 stop > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=45s -rate threads=4 > # wait 10 seconds > ccm node1 start > # Note that we are local to ccm's nodetool install 'cause repair preview is > not reported yet > node1/bin/nodetool repair --preview > node1/bin/nodetool repair standard_long test_data > {noformat} > The error outputs from the last repair command follow. First, this is stdout > from node1: > {noformat} > $ node1/bin/nodetool repair standard_long test_data > objc[47876]: Class JavaLaunchHelper is implemented in both > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/bin/java > (0x10274d4c0) and > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/libinstrument.dylib > (0x1047b64e0). One of the two will be used. Which one is undefined. > [2017-10-05 14:31:52,425] Starting repair command #4 > (7e1a9150-a98e-11e7-ad86-cbd2801b8de2), repairing keyspace standard_long with > repair options (parallelism: parallel, primary range: false, incremental: > true, job threads: 1, ColumnFamilies: [test_data], dataCenters: [], hosts: > [], previewKind: NONE, # of ranges: 3, pull repair: false, force repair: > false) > [2017-10-05 14:32:07,045] Repair session 7e2e8e80-a98e-11e7-ad86-cbd2801b8de2 > for range [(3074457345618258602,-9223372036854775808], > (-9223372036854775808,-3074457345618258603], > (-3074457345618258603,3074457345618258602]] failed with error Stream failed > [2017-10-05 14:32:07,048] null > [2017-10-05 14:32:07,050] Repair command #4 finished in 14 seconds > error: Repair job has failed with the error message: [2017-10-05 > 14:32:07,048] null > -- StackTrace -- > java.lang.RuntimeException: Repair job has failed with the error message: > [2017-10-05 14:32:07,048] null > at
[jira] [Updated] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Lynch updated CASSANDRA-13938: - Fix Version/s: (was: 4.x) 4.0 > Default repair is broken, crashes other nodes participating in repair (in > trunk) > > > Key: CASSANDRA-13938 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13938 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Nate McCall >Assignee: Jason Brown >Priority: Urgent > Fix For: 4.0 > > Attachments: 13938.yaml, test.sh > > > Running through a simple scenario to test some of the new repair features, I > was not able to make a repair command work. Further, the exception seemed to > trigger a nasty failure state that basically shuts down the netty connections > for messaging *and* CQL on the nodes transferring back data to the node being > repaired. The following steps reproduce this issue consistently. > Cassandra stress profile (probably not necessary, but this one provides a > really simple schema and consistent data shape): > {noformat} > keyspace: standard_long > keyspace_definition: | > CREATE KEYSPACE standard_long WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > table: test_data > table_definition: | > CREATE TABLE test_data ( > key text, > ts bigint, > val text, > PRIMARY KEY (key, ts) > ) WITH COMPACT STORAGE AND > CLUSTERING ORDER BY (ts DESC) AND > bloom_filter_fp_chance=0.01 AND > caching={'keys':'ALL', 'rows_per_partition':'NONE'} AND > comment='' AND > dclocal_read_repair_chance=0.00 AND > gc_grace_seconds=864000 AND > read_repair_chance=0.00 AND > compaction={'class': 'SizeTieredCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > columnspec: > - name: key > population: uniform(1..5000) # 50 million records available > - name: ts > cluster: gaussian(1..50) # Up to 50 inserts per record > - name: val > population: gaussian(128..1024) # varrying size of value data > insert: > partitions: fixed(1) # only one insert per batch for individual partitions > select: fixed(1)/1 # each insert comes in one at a time > batchtype: UNLOGGED > queries: > single: > cql: select * from test_data where key = ? and ts = ? limit 1; > series: > cql: select key,ts,val from test_data where key = ? limit 10; > {noformat} > The commands to build and run: > {noformat} > ccm create 4_0_test -v git:trunk -n 3 -s > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=15s -rate threads=4 > # flush the memtable just to get everything on disk > ccm node1 nodetool flush > ccm node2 nodetool flush > ccm node3 nodetool flush > # disable hints for nodes 2 and 3 > ccm node2 nodetool disablehandoff > ccm node3 nodetool disablehandoff > # stop node1 > ccm node1 stop > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=45s -rate threads=4 > # wait 10 seconds > ccm node1 start > # Note that we are local to ccm's nodetool install 'cause repair preview is > not reported yet > node1/bin/nodetool repair --preview > node1/bin/nodetool repair standard_long test_data > {noformat} > The error outputs from the last repair command follow. First, this is stdout > from node1: > {noformat} > $ node1/bin/nodetool repair standard_long test_data > objc[47876]: Class JavaLaunchHelper is implemented in both > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/bin/java > (0x10274d4c0) and > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/libinstrument.dylib > (0x1047b64e0). One of the two will be used. Which one is undefined. > [2017-10-05 14:31:52,425] Starting repair command #4 > (7e1a9150-a98e-11e7-ad86-cbd2801b8de2), repairing keyspace standard_long with > repair options (parallelism: parallel, primary range: false, incremental: > true, job threads: 1, ColumnFamilies: [test_data], dataCenters: [], hosts: > [], previewKind: NONE, # of ranges: 3, pull repair: false, force repair: > false) > [2017-10-05 14:32:07,045] Repair session 7e2e8e80-a98e-11e7-ad86-cbd2801b8de2 > for range [(3074457345618258602,-9223372036854775808], > (-9223372036854775808,-3074457345618258603], > (-3074457345618258603,3074457345618258602]] failed with error Stream failed > [2017-10-05 14:32:07,048] null > [2017-10-05 14:32:07,050] Repair command #4 finished in 14 seconds > error: Repair job has failed with the error message: [2017-10-05 > 14:32:07,048] null > -- StackTrace -- > java.lang.RuntimeException: Repair job has failed with the error message: > [2017-10-05 14:32:07,048] null > at
[jira] [Updated] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Shuler updated CASSANDRA-13938: --- Fix Version/s: (was: 4.0.x) 4.x > Default repair is broken, crashes other nodes participating in repair (in > trunk) > > > Key: CASSANDRA-13938 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13938 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Nate McCall >Assignee: Jason Brown >Priority: Critical > Fix For: 4.x > > Attachments: 13938.yaml, test.sh > > > Running through a simple scenario to test some of the new repair features, I > was not able to make a repair command work. Further, the exception seemed to > trigger a nasty failure state that basically shuts down the netty connections > for messaging *and* CQL on the nodes transferring back data to the node being > repaired. The following steps reproduce this issue consistently. > Cassandra stress profile (probably not necessary, but this one provides a > really simple schema and consistent data shape): > {noformat} > keyspace: standard_long > keyspace_definition: | > CREATE KEYSPACE standard_long WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > table: test_data > table_definition: | > CREATE TABLE test_data ( > key text, > ts bigint, > val text, > PRIMARY KEY (key, ts) > ) WITH COMPACT STORAGE AND > CLUSTERING ORDER BY (ts DESC) AND > bloom_filter_fp_chance=0.01 AND > caching={'keys':'ALL', 'rows_per_partition':'NONE'} AND > comment='' AND > dclocal_read_repair_chance=0.00 AND > gc_grace_seconds=864000 AND > read_repair_chance=0.00 AND > compaction={'class': 'SizeTieredCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > columnspec: > - name: key > population: uniform(1..5000) # 50 million records available > - name: ts > cluster: gaussian(1..50) # Up to 50 inserts per record > - name: val > population: gaussian(128..1024) # varrying size of value data > insert: > partitions: fixed(1) # only one insert per batch for individual partitions > select: fixed(1)/1 # each insert comes in one at a time > batchtype: UNLOGGED > queries: > single: > cql: select * from test_data where key = ? and ts = ? limit 1; > series: > cql: select key,ts,val from test_data where key = ? limit 10; > {noformat} > The commands to build and run: > {noformat} > ccm create 4_0_test -v git:trunk -n 3 -s > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=15s -rate threads=4 > # flush the memtable just to get everything on disk > ccm node1 nodetool flush > ccm node2 nodetool flush > ccm node3 nodetool flush > # disable hints for nodes 2 and 3 > ccm node2 nodetool disablehandoff > ccm node3 nodetool disablehandoff > # stop node1 > ccm node1 stop > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=45s -rate threads=4 > # wait 10 seconds > ccm node1 start > # Note that we are local to ccm's nodetool install 'cause repair preview is > not reported yet > node1/bin/nodetool repair --preview > node1/bin/nodetool repair standard_long test_data > {noformat} > The error outputs from the last repair command follow. First, this is stdout > from node1: > {noformat} > $ node1/bin/nodetool repair standard_long test_data > objc[47876]: Class JavaLaunchHelper is implemented in both > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/bin/java > (0x10274d4c0) and > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/libinstrument.dylib > (0x1047b64e0). One of the two will be used. Which one is undefined. > [2017-10-05 14:31:52,425] Starting repair command #4 > (7e1a9150-a98e-11e7-ad86-cbd2801b8de2), repairing keyspace standard_long with > repair options (parallelism: parallel, primary range: false, incremental: > true, job threads: 1, ColumnFamilies: [test_data], dataCenters: [], hosts: > [], previewKind: NONE, # of ranges: 3, pull repair: false, force repair: > false) > [2017-10-05 14:32:07,045] Repair session 7e2e8e80-a98e-11e7-ad86-cbd2801b8de2 > for range [(3074457345618258602,-9223372036854775808], > (-9223372036854775808,-3074457345618258603], > (-3074457345618258603,3074457345618258602]] failed with error Stream failed > [2017-10-05 14:32:07,048] null > [2017-10-05 14:32:07,050] Repair command #4 finished in 14 seconds > error: Repair job has failed with the error message: [2017-10-05 > 14:32:07,048] null > -- StackTrace -- > java.lang.RuntimeException: Repair job has failed with the error message: > [2017-10-05 14:32:07,048] null > at
[jira] [Updated] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dinesh Joshi updated CASSANDRA-13938: - Reviewer: Dinesh Joshi > Default repair is broken, crashes other nodes participating in repair (in > trunk) > > > Key: CASSANDRA-13938 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13938 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Nate McCall >Assignee: Jason Brown >Priority: Critical > Fix For: 4.0.x > > Attachments: 13938.yaml, test.sh > > > Running through a simple scenario to test some of the new repair features, I > was not able to make a repair command work. Further, the exception seemed to > trigger a nasty failure state that basically shuts down the netty connections > for messaging *and* CQL on the nodes transferring back data to the node being > repaired. The following steps reproduce this issue consistently. > Cassandra stress profile (probably not necessary, but this one provides a > really simple schema and consistent data shape): > {noformat} > keyspace: standard_long > keyspace_definition: | > CREATE KEYSPACE standard_long WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > table: test_data > table_definition: | > CREATE TABLE test_data ( > key text, > ts bigint, > val text, > PRIMARY KEY (key, ts) > ) WITH COMPACT STORAGE AND > CLUSTERING ORDER BY (ts DESC) AND > bloom_filter_fp_chance=0.01 AND > caching={'keys':'ALL', 'rows_per_partition':'NONE'} AND > comment='' AND > dclocal_read_repair_chance=0.00 AND > gc_grace_seconds=864000 AND > read_repair_chance=0.00 AND > compaction={'class': 'SizeTieredCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > columnspec: > - name: key > population: uniform(1..5000) # 50 million records available > - name: ts > cluster: gaussian(1..50) # Up to 50 inserts per record > - name: val > population: gaussian(128..1024) # varrying size of value data > insert: > partitions: fixed(1) # only one insert per batch for individual partitions > select: fixed(1)/1 # each insert comes in one at a time > batchtype: UNLOGGED > queries: > single: > cql: select * from test_data where key = ? and ts = ? limit 1; > series: > cql: select key,ts,val from test_data where key = ? limit 10; > {noformat} > The commands to build and run: > {noformat} > ccm create 4_0_test -v git:trunk -n 3 -s > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=15s -rate threads=4 > # flush the memtable just to get everything on disk > ccm node1 nodetool flush > ccm node2 nodetool flush > ccm node3 nodetool flush > # disable hints for nodes 2 and 3 > ccm node2 nodetool disablehandoff > ccm node3 nodetool disablehandoff > # stop node1 > ccm node1 stop > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=45s -rate threads=4 > # wait 10 seconds > ccm node1 start > # Note that we are local to ccm's nodetool install 'cause repair preview is > not reported yet > node1/bin/nodetool repair --preview > node1/bin/nodetool repair standard_long test_data > {noformat} > The error outputs from the last repair command follow. First, this is stdout > from node1: > {noformat} > $ node1/bin/nodetool repair standard_long test_data > objc[47876]: Class JavaLaunchHelper is implemented in both > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/bin/java > (0x10274d4c0) and > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/libinstrument.dylib > (0x1047b64e0). One of the two will be used. Which one is undefined. > [2017-10-05 14:31:52,425] Starting repair command #4 > (7e1a9150-a98e-11e7-ad86-cbd2801b8de2), repairing keyspace standard_long with > repair options (parallelism: parallel, primary range: false, incremental: > true, job threads: 1, ColumnFamilies: [test_data], dataCenters: [], hosts: > [], previewKind: NONE, # of ranges: 3, pull repair: false, force repair: > false) > [2017-10-05 14:32:07,045] Repair session 7e2e8e80-a98e-11e7-ad86-cbd2801b8de2 > for range [(3074457345618258602,-9223372036854775808], > (-9223372036854775808,-3074457345618258603], > (-3074457345618258603,3074457345618258602]] failed with error Stream failed > [2017-10-05 14:32:07,048] null > [2017-10-05 14:32:07,050] Repair command #4 finished in 14 seconds > error: Repair job has failed with the error message: [2017-10-05 > 14:32:07,048] null > -- StackTrace -- > java.lang.RuntimeException: Repair job has failed with the error message: > [2017-10-05 14:32:07,048] null > at
[jira] [Updated] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Brown updated CASSANDRA-13938: Fix Version/s: 4.0.x > Default repair is broken, crashes other nodes participating in repair (in > trunk) > > > Key: CASSANDRA-13938 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13938 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Nate McCall >Assignee: Jason Brown >Priority: Critical > Fix For: 4.0.x > > Attachments: 13938.yaml, test.sh > > > Running through a simple scenario to test some of the new repair features, I > was not able to make a repair command work. Further, the exception seemed to > trigger a nasty failure state that basically shuts down the netty connections > for messaging *and* CQL on the nodes transferring back data to the node being > repaired. The following steps reproduce this issue consistently. > Cassandra stress profile (probably not necessary, but this one provides a > really simple schema and consistent data shape): > {noformat} > keyspace: standard_long > keyspace_definition: | > CREATE KEYSPACE standard_long WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > table: test_data > table_definition: | > CREATE TABLE test_data ( > key text, > ts bigint, > val text, > PRIMARY KEY (key, ts) > ) WITH COMPACT STORAGE AND > CLUSTERING ORDER BY (ts DESC) AND > bloom_filter_fp_chance=0.01 AND > caching={'keys':'ALL', 'rows_per_partition':'NONE'} AND > comment='' AND > dclocal_read_repair_chance=0.00 AND > gc_grace_seconds=864000 AND > read_repair_chance=0.00 AND > compaction={'class': 'SizeTieredCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > columnspec: > - name: key > population: uniform(1..5000) # 50 million records available > - name: ts > cluster: gaussian(1..50) # Up to 50 inserts per record > - name: val > population: gaussian(128..1024) # varrying size of value data > insert: > partitions: fixed(1) # only one insert per batch for individual partitions > select: fixed(1)/1 # each insert comes in one at a time > batchtype: UNLOGGED > queries: > single: > cql: select * from test_data where key = ? and ts = ? limit 1; > series: > cql: select key,ts,val from test_data where key = ? limit 10; > {noformat} > The commands to build and run: > {noformat} > ccm create 4_0_test -v git:trunk -n 3 -s > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=15s -rate threads=4 > # flush the memtable just to get everything on disk > ccm node1 nodetool flush > ccm node2 nodetool flush > ccm node3 nodetool flush > # disable hints for nodes 2 and 3 > ccm node2 nodetool disablehandoff > ccm node3 nodetool disablehandoff > # stop node1 > ccm node1 stop > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=45s -rate threads=4 > # wait 10 seconds > ccm node1 start > # Note that we are local to ccm's nodetool install 'cause repair preview is > not reported yet > node1/bin/nodetool repair --preview > node1/bin/nodetool repair standard_long test_data > {noformat} > The error outputs from the last repair command follow. First, this is stdout > from node1: > {noformat} > $ node1/bin/nodetool repair standard_long test_data > objc[47876]: Class JavaLaunchHelper is implemented in both > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/bin/java > (0x10274d4c0) and > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/libinstrument.dylib > (0x1047b64e0). One of the two will be used. Which one is undefined. > [2017-10-05 14:31:52,425] Starting repair command #4 > (7e1a9150-a98e-11e7-ad86-cbd2801b8de2), repairing keyspace standard_long with > repair options (parallelism: parallel, primary range: false, incremental: > true, job threads: 1, ColumnFamilies: [test_data], dataCenters: [], hosts: > [], previewKind: NONE, # of ranges: 3, pull repair: false, force repair: > false) > [2017-10-05 14:32:07,045] Repair session 7e2e8e80-a98e-11e7-ad86-cbd2801b8de2 > for range [(3074457345618258602,-9223372036854775808], > (-9223372036854775808,-3074457345618258603], > (-3074457345618258603,3074457345618258602]] failed with error Stream failed > [2017-10-05 14:32:07,048] null > [2017-10-05 14:32:07,050] Repair command #4 finished in 14 seconds > error: Repair job has failed with the error message: [2017-10-05 > 14:32:07,048] null > -- StackTrace -- > java.lang.RuntimeException: Repair job has failed with the error message: > [2017-10-05 14:32:07,048] null > at
[jira] [Updated] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Brown updated CASSANDRA-13938: Status: Patch Available (was: Open) > Default repair is broken, crashes other nodes participating in repair (in > trunk) > > > Key: CASSANDRA-13938 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13938 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Nate McCall >Assignee: Jason Brown >Priority: Critical > Attachments: 13938.yaml, test.sh > > > Running through a simple scenario to test some of the new repair features, I > was not able to make a repair command work. Further, the exception seemed to > trigger a nasty failure state that basically shuts down the netty connections > for messaging *and* CQL on the nodes transferring back data to the node being > repaired. The following steps reproduce this issue consistently. > Cassandra stress profile (probably not necessary, but this one provides a > really simple schema and consistent data shape): > {noformat} > keyspace: standard_long > keyspace_definition: | > CREATE KEYSPACE standard_long WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > table: test_data > table_definition: | > CREATE TABLE test_data ( > key text, > ts bigint, > val text, > PRIMARY KEY (key, ts) > ) WITH COMPACT STORAGE AND > CLUSTERING ORDER BY (ts DESC) AND > bloom_filter_fp_chance=0.01 AND > caching={'keys':'ALL', 'rows_per_partition':'NONE'} AND > comment='' AND > dclocal_read_repair_chance=0.00 AND > gc_grace_seconds=864000 AND > read_repair_chance=0.00 AND > compaction={'class': 'SizeTieredCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > columnspec: > - name: key > population: uniform(1..5000) # 50 million records available > - name: ts > cluster: gaussian(1..50) # Up to 50 inserts per record > - name: val > population: gaussian(128..1024) # varrying size of value data > insert: > partitions: fixed(1) # only one insert per batch for individual partitions > select: fixed(1)/1 # each insert comes in one at a time > batchtype: UNLOGGED > queries: > single: > cql: select * from test_data where key = ? and ts = ? limit 1; > series: > cql: select key,ts,val from test_data where key = ? limit 10; > {noformat} > The commands to build and run: > {noformat} > ccm create 4_0_test -v git:trunk -n 3 -s > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=15s -rate threads=4 > # flush the memtable just to get everything on disk > ccm node1 nodetool flush > ccm node2 nodetool flush > ccm node3 nodetool flush > # disable hints for nodes 2 and 3 > ccm node2 nodetool disablehandoff > ccm node3 nodetool disablehandoff > # stop node1 > ccm node1 stop > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=45s -rate threads=4 > # wait 10 seconds > ccm node1 start > # Note that we are local to ccm's nodetool install 'cause repair preview is > not reported yet > node1/bin/nodetool repair --preview > node1/bin/nodetool repair standard_long test_data > {noformat} > The error outputs from the last repair command follow. First, this is stdout > from node1: > {noformat} > $ node1/bin/nodetool repair standard_long test_data > objc[47876]: Class JavaLaunchHelper is implemented in both > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/bin/java > (0x10274d4c0) and > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/libinstrument.dylib > (0x1047b64e0). One of the two will be used. Which one is undefined. > [2017-10-05 14:31:52,425] Starting repair command #4 > (7e1a9150-a98e-11e7-ad86-cbd2801b8de2), repairing keyspace standard_long with > repair options (parallelism: parallel, primary range: false, incremental: > true, job threads: 1, ColumnFamilies: [test_data], dataCenters: [], hosts: > [], previewKind: NONE, # of ranges: 3, pull repair: false, force repair: > false) > [2017-10-05 14:32:07,045] Repair session 7e2e8e80-a98e-11e7-ad86-cbd2801b8de2 > for range [(3074457345618258602,-9223372036854775808], > (-9223372036854775808,-3074457345618258603], > (-3074457345618258603,3074457345618258602]] failed with error Stream failed > [2017-10-05 14:32:07,048] null > [2017-10-05 14:32:07,050] Repair command #4 finished in 14 seconds > error: Repair job has failed with the error message: [2017-10-05 > 14:32:07,048] null > -- StackTrace -- > java.lang.RuntimeException: Repair job has failed with the error message: > [2017-10-05 14:32:07,048] null > at
[jira] [Updated] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Brown updated CASSANDRA-13938: Attachment: test.sh 13938.yaml > Default repair is broken, crashes other nodes participating in repair (in > trunk) > > > Key: CASSANDRA-13938 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13938 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Nate McCall >Assignee: Jason Brown >Priority: Critical > Attachments: 13938.yaml, test.sh > > > Running through a simple scenario to test some of the new repair features, I > was not able to make a repair command work. Further, the exception seemed to > trigger a nasty failure state that basically shuts down the netty connections > for messaging *and* CQL on the nodes transferring back data to the node being > repaired. The following steps reproduce this issue consistently. > Cassandra stress profile (probably not necessary, but this one provides a > really simple schema and consistent data shape): > {noformat} > keyspace: standard_long > keyspace_definition: | > CREATE KEYSPACE standard_long WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > table: test_data > table_definition: | > CREATE TABLE test_data ( > key text, > ts bigint, > val text, > PRIMARY KEY (key, ts) > ) WITH COMPACT STORAGE AND > CLUSTERING ORDER BY (ts DESC) AND > bloom_filter_fp_chance=0.01 AND > caching={'keys':'ALL', 'rows_per_partition':'NONE'} AND > comment='' AND > dclocal_read_repair_chance=0.00 AND > gc_grace_seconds=864000 AND > read_repair_chance=0.00 AND > compaction={'class': 'SizeTieredCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > columnspec: > - name: key > population: uniform(1..5000) # 50 million records available > - name: ts > cluster: gaussian(1..50) # Up to 50 inserts per record > - name: val > population: gaussian(128..1024) # varrying size of value data > insert: > partitions: fixed(1) # only one insert per batch for individual partitions > select: fixed(1)/1 # each insert comes in one at a time > batchtype: UNLOGGED > queries: > single: > cql: select * from test_data where key = ? and ts = ? limit 1; > series: > cql: select key,ts,val from test_data where key = ? limit 10; > {noformat} > The commands to build and run: > {noformat} > ccm create 4_0_test -v git:trunk -n 3 -s > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=15s -rate threads=4 > # flush the memtable just to get everything on disk > ccm node1 nodetool flush > ccm node2 nodetool flush > ccm node3 nodetool flush > # disable hints for nodes 2 and 3 > ccm node2 nodetool disablehandoff > ccm node3 nodetool disablehandoff > # stop node1 > ccm node1 stop > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=45s -rate threads=4 > # wait 10 seconds > ccm node1 start > # Note that we are local to ccm's nodetool install 'cause repair preview is > not reported yet > node1/bin/nodetool repair --preview > node1/bin/nodetool repair standard_long test_data > {noformat} > The error outputs from the last repair command follow. First, this is stdout > from node1: > {noformat} > $ node1/bin/nodetool repair standard_long test_data > objc[47876]: Class JavaLaunchHelper is implemented in both > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/bin/java > (0x10274d4c0) and > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/libinstrument.dylib > (0x1047b64e0). One of the two will be used. Which one is undefined. > [2017-10-05 14:31:52,425] Starting repair command #4 > (7e1a9150-a98e-11e7-ad86-cbd2801b8de2), repairing keyspace standard_long with > repair options (parallelism: parallel, primary range: false, incremental: > true, job threads: 1, ColumnFamilies: [test_data], dataCenters: [], hosts: > [], previewKind: NONE, # of ranges: 3, pull repair: false, force repair: > false) > [2017-10-05 14:32:07,045] Repair session 7e2e8e80-a98e-11e7-ad86-cbd2801b8de2 > for range [(3074457345618258602,-9223372036854775808], > (-9223372036854775808,-3074457345618258603], > (-3074457345618258603,3074457345618258602]] failed with error Stream failed > [2017-10-05 14:32:07,048] null > [2017-10-05 14:32:07,050] Repair command #4 finished in 14 seconds > error: Repair job has failed with the error message: [2017-10-05 > 14:32:07,048] null > -- StackTrace -- > java.lang.RuntimeException: Repair job has failed with the error message: > [2017-10-05 14:32:07,048] null > at