[jira] [Comment Edited] (CASSANDRA-15274) Multiple Corrupt datafiles across entire environment

2019-08-16 Thread Vladimir Vavro (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908894#comment-16908894
 ] 

Vladimir Vavro edited comment on CASSANDRA-15274 at 8/16/19 9:30 AM:
-

Since affected version is 2.2.x there is no sstabledump available, but there is 
sstable2json. We tried to export one file and the attempt failed - but it looks 
like it again failed during the crc check based on this part of error message:


 Caused by: org.apache.cassandra.io.compress.CorruptBlockException: 
(/data/ssd2/data/KeyspaceMetadata/CF_ConversationIndex1-1e77be609c7911e8ac12255de1fb512a/lb-10664-big-Data.db):
 corruption detected, chunk at 7392105638 of length 35173.
        at 
org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferMmap(CompressedRandomAccessReader.java:185)


 Is it possible that sstable2json is using the same code to handle the data as 
Cassandra normally does? If it true, is it different for newer utilities 
sstableexport/sstabledump ?


was (Author: vvavro):
Since affected version is 2.2.x there is no sstabledump available, but there is 
sstable2json. We tried to export one file and the attempt failed - but it looks 
like it again failed during the crc check based on this part of error message:
Caused by: org.apache.cassandra.io.compress.CorruptBlockException: 
(/data/ssd2/data/KeyspaceMetadata/CF_ConversationIndex1-1e77be609c7911e8ac12255de1fb512a/lb-10664-big-Data.db):
 corruption detected, chunk at 7392105638 of length 35173.
        at 
org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferMmap(CompressedRandomAccessReader.java:185)
Is it possible that sstable2json is using the same code to handle the data as 
Cassandra normally does? If it true, is it different for newer utilities 
sstableexport/sstabledump ?

> Multiple Corrupt datafiles across entire environment 
> -
>
> Key: CASSANDRA-15274
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15274
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Compaction
>Reporter: Phil O Conduin
>Priority: Normal
>
> Cassandra Version: 2.2.13
> PRE-PROD environment.
>  * 2 datacenters.
>  * 9 physical servers in each datacenter - (_Cisco UCS C220 M4 SFF_)
>  * 4 Cassandra instances on each server (cass_a, cass_b, cass_c, cass_d)
>  * 72 Cassandra instances across the 2 data centres, 36 in site A, 36 in site 
> B.
> We also have 2 Reaper Nodes we use for repair.  One reaper node in each 
> datacenter each running with its own Cassandra back end in a cluster together.
> OS Details [Red Hat Linux]
> cass_a@x 0 10:53:01 ~ $ uname -a
> Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 
> x86_64 x86_64 GNU/Linux
> cass_a@x 0 10:57:31 ~ $ cat /etc/*release
> NAME="Red Hat Enterprise Linux Server"
> VERSION="7.6 (Maipo)"
> ID="rhel"
> Storage Layout 
> cass_a@xx 0 10:46:28 ~ $ df -h
> Filesystem                         Size  Used Avail Use% Mounted on
> /dev/mapper/vg01-lv_root            20G  2.2G   18G  11% /
> devtmpfs                            63G     0   63G   0% /dev
> tmpfs                               63G     0   63G   0% /dev/shm
> tmpfs                               63G  4.1G   59G   7% /run
> tmpfs                               63G     0   63G   0% /sys/fs/cgroup
> >> 4 cassandra instances
> /dev/sdd                           1.5T  802G  688G  54% /data/ssd4
> /dev/sda                           1.5T  798G  692G  54% /data/ssd1
> /dev/sdb                           1.5T  681G  810G  46% /data/ssd2
> /dev/sdc                           1.5T  558G  932G  38% /data/ssd3
> Cassandra load is about 200GB and the rest of the space is snapshots
> CPU
> cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
> CPU(s):                64
> Thread(s) per core:    2
> Core(s) per socket:    16
> Socket(s):             2
> *Description of problem:*
> During repair of the cluster, we are seeing multiple corruptions in the log 
> files on a lot of instances.  There seems to be no pattern to the corruption. 
>  It seems that the repair job is finding all the corrupted files for us.  The 
> repair will hang on the node where the corrupted file is found.  To fix this 
> we remove/rename the datafile and bounce the Cassandra instance.  Our 
> hardware/OS team have stated there is no problem on their side.  I do not 
> believe it the repair causing the corruption. 
>  
> So let me give you an example of a corrupted file and maybe someone might be 
> able to work through it with me?
> When this corrupted file was reported in the log it looks like it was the 
> repair that found it.
> $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" --until 
> "2019-08-07 22:45:00"
> Aug 07 22:30:33 cassandra[34611]: INFO  

[jira] [Commented] (CASSANDRA-15274) Multiple Corrupt datafiles across entire environment

2019-08-16 Thread Vladimir Vavro (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908894#comment-16908894
 ] 

Vladimir Vavro commented on CASSANDRA-15274:


Since affected version is 2.2.x there is no sstabledump available, but there is 
sstable2json. We tried to export one file and the attempt failed - but it looks 
like it again failed during the crc check based on this part of error message:
Caused by: org.apache.cassandra.io.compress.CorruptBlockException: 
(/data/ssd2/data/KeyspaceMetadata/CF_ConversationIndex1-1e77be609c7911e8ac12255de1fb512a/lb-10664-big-Data.db):
 corruption detected, chunk at 7392105638 of length 35173.
        at 
org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferMmap(CompressedRandomAccessReader.java:185)
Is it possible that sstable2json is using the same code to handle the data as 
Cassandra normally does? If it true, is it different for newer utilities 
sstableexport/sstabledump ?

> Multiple Corrupt datafiles across entire environment 
> -
>
> Key: CASSANDRA-15274
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15274
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Compaction
>Reporter: Phil O Conduin
>Priority: Normal
>
> Cassandra Version: 2.2.13
> PRE-PROD environment.
>  * 2 datacenters.
>  * 9 physical servers in each datacenter - (_Cisco UCS C220 M4 SFF_)
>  * 4 Cassandra instances on each server (cass_a, cass_b, cass_c, cass_d)
>  * 72 Cassandra instances across the 2 data centres, 36 in site A, 36 in site 
> B.
> We also have 2 Reaper Nodes we use for repair.  One reaper node in each 
> datacenter each running with its own Cassandra back end in a cluster together.
> OS Details [Red Hat Linux]
> cass_a@x 0 10:53:01 ~ $ uname -a
> Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 
> x86_64 x86_64 GNU/Linux
> cass_a@x 0 10:57:31 ~ $ cat /etc/*release
> NAME="Red Hat Enterprise Linux Server"
> VERSION="7.6 (Maipo)"
> ID="rhel"
> Storage Layout 
> cass_a@xx 0 10:46:28 ~ $ df -h
> Filesystem                         Size  Used Avail Use% Mounted on
> /dev/mapper/vg01-lv_root            20G  2.2G   18G  11% /
> devtmpfs                            63G     0   63G   0% /dev
> tmpfs                               63G     0   63G   0% /dev/shm
> tmpfs                               63G  4.1G   59G   7% /run
> tmpfs                               63G     0   63G   0% /sys/fs/cgroup
> >> 4 cassandra instances
> /dev/sdd                           1.5T  802G  688G  54% /data/ssd4
> /dev/sda                           1.5T  798G  692G  54% /data/ssd1
> /dev/sdb                           1.5T  681G  810G  46% /data/ssd2
> /dev/sdc                           1.5T  558G  932G  38% /data/ssd3
> Cassandra load is about 200GB and the rest of the space is snapshots
> CPU
> cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
> CPU(s):                64
> Thread(s) per core:    2
> Core(s) per socket:    16
> Socket(s):             2
> *Description of problem:*
> During repair of the cluster, we are seeing multiple corruptions in the log 
> files on a lot of instances.  There seems to be no pattern to the corruption. 
>  It seems that the repair job is finding all the corrupted files for us.  The 
> repair will hang on the node where the corrupted file is found.  To fix this 
> we remove/rename the datafile and bounce the Cassandra instance.  Our 
> hardware/OS team have stated there is no problem on their side.  I do not 
> believe it the repair causing the corruption. 
>  
> So let me give you an example of a corrupted file and maybe someone might be 
> able to work through it with me?
> When this corrupted file was reported in the log it looks like it was the 
> repair that found it.
> $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" --until 
> "2019-08-07 22:45:00"
> Aug 07 22:30:33 cassandra[34611]: INFO  21:30:33 Writing 
> Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1 ops, 
> 0%/0% of on/off-heap limit)
> Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a merkle 
> tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on KeyspaceMetadata/x, 
> (-1476350953672479093,-1474461
> Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread 
> Thread[ValidationExecutor:825,1,main]
> Aug 07 22:30:33 cassandra[34611]: org.apache.cassandra.io.FSReadError: 
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /x/ssd2/data/KeyspaceMetadata/x-1e453cb0
> Aug 07 22:30:33 cassandra[34611]: at 
> org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:365)
>  ~[apache-cassandra-2.2.13.jar:2.2.13]
> Aug 07 22:30:33 cassandra[34611]: at 
> 

[jira] [Commented] (CASSANDRA-15274) Multiple Corrupt datafiles across entire environment

2019-08-14 Thread Vladimir Vavro (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907212#comment-16907212
 ] 

Vladimir Vavro commented on CASSANDRA-15274:


If it is possible to export suspicious sstable into json format, the challenge 
might be to verify, if the exported data are valid or corrupted. However my 
understanding is that crc check is optional but decompression is obviously not 
for the export tool. If the binary data before decompression are corrupted, 
should not we see binary garbage in the json output ?

> Multiple Corrupt datafiles across entire environment 
> -
>
> Key: CASSANDRA-15274
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15274
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Compaction
>Reporter: Phil O Conduin
>Priority: Normal
>
> Cassandra Version: 2.2.13
> PRE-PROD environment.
>  * 2 datacenters.
>  * 9 physical servers in each datacenter - (_Cisco UCS C220 M4 SFF_)
>  * 4 Cassandra instances on each server (cass_a, cass_b, cass_c, cass_d)
>  * 72 Cassandra instances across the 2 data centres, 36 in site A, 36 in site 
> B.
> We also have 2 Reaper Nodes we use for repair.  One reaper node in each 
> datacenter each running with its own Cassandra back end in a cluster together.
> OS Details [Red Hat Linux]
> cass_a@x 0 10:53:01 ~ $ uname -a
> Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 x86_64 
> x86_64 x86_64 GNU/Linux
> cass_a@x 0 10:57:31 ~ $ cat /etc/*release
> NAME="Red Hat Enterprise Linux Server"
> VERSION="7.6 (Maipo)"
> ID="rhel"
> Storage Layout 
> cass_a@xx 0 10:46:28 ~ $ df -h
> Filesystem                         Size  Used Avail Use% Mounted on
> /dev/mapper/vg01-lv_root            20G  2.2G   18G  11% /
> devtmpfs                            63G     0   63G   0% /dev
> tmpfs                               63G     0   63G   0% /dev/shm
> tmpfs                               63G  4.1G   59G   7% /run
> tmpfs                               63G     0   63G   0% /sys/fs/cgroup
> >> 4 cassandra instances
> /dev/sdd                           1.5T  802G  688G  54% /data/ssd4
> /dev/sda                           1.5T  798G  692G  54% /data/ssd1
> /dev/sdb                           1.5T  681G  810G  46% /data/ssd2
> /dev/sdc                           1.5T  558G  932G  38% /data/ssd3
> Cassandra load is about 200GB and the rest of the space is snapshots
> CPU
> cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
> CPU(s):                64
> Thread(s) per core:    2
> Core(s) per socket:    16
> Socket(s):             2
> *Description of problem:*
> During repair of the cluster, we are seeing multiple corruptions in the log 
> files on a lot of instances.  There seems to be no pattern to the corruption. 
>  It seems that the repair job is finding all the corrupted files for us.  The 
> repair will hang on the node where the corrupted file is found.  To fix this 
> we remove/rename the datafile and bounce the Cassandra instance.  Our 
> hardware/OS team have stated there is no problem on their side.  I do not 
> believe it the repair causing the corruption. 
>  
> So let me give you an example of a corrupted file and maybe someone might be 
> able to work through it with me?
> When this corrupted file was reported in the log it looks like it was the 
> repair that found it.
> $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" --until 
> "2019-08-07 22:45:00"
> Aug 07 22:30:33 cassandra[34611]: INFO  21:30:33 Writing 
> Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1 ops, 
> 0%/0% of on/off-heap limit)
> Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a merkle 
> tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on KeyspaceMetadata/x, 
> (-1476350953672479093,-1474461
> Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread 
> Thread[ValidationExecutor:825,1,main]
> Aug 07 22:30:33 cassandra[34611]: org.apache.cassandra.io.FSReadError: 
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /x/ssd2/data/KeyspaceMetadata/x-1e453cb0
> Aug 07 22:30:33 cassandra[34611]: at 
> org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:365)
>  ~[apache-cassandra-2.2.13.jar:2.2.13]
> Aug 07 22:30:33 cassandra[34611]: at 
> org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:361) 
> ~[apache-cassandra-2.2.13.jar:2.2.13]
> Aug 07 22:30:33 cassandra[34611]: at 
> org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:340)
>  ~[apache-cassandra-2.2.13.jar:2.2.13]
> Aug 07 22:30:33 cassandra[34611]: at 
> org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:382)
>