Hi all - I replaced a node in a 14 node cluster, and it rebuilt OK.  I started to see a lot of timeout errors, and discovered one of the nodes had this message constantly repeated: "waiting to acquire a permit to begin streaming" - so perhaps I hit this bug:
https://www.mail-archive.com/commits@cassandra.apache.org/msg284709.html

I then restarted that node, but it gave a bunch of errors about "unexpected disk state: failed to read translation log" I deleted the corresponding files and got that node to come up, but now when I restart any of the other nodes in the cluster, they too do not start back up:

Example:

INFO  [main] 2023-08-30 09:50:46,130 LogTransaction.java:544 - Verifying logfile transaction [nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log in /data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3, /data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3] ERROR [main] 2023-08-30 09:50:46,154 LogReplicaSet.java:145 - Mismatched line in file nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log: got 'ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37640-big-,0,8][2833571752]' expected 'ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37639-big-,0,8][1997892352]', giving up ERROR [main] 2023-08-30 09:50:46,155 LogFile.java:164 - Failed to read records for transaction log [nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log in /data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3, /data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3] ERROR [main] 2023-08-30 09:50:46,156 LogTransaction.java:559 - Unexpected disk state: failed to read transaction log [nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log in /data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3, /data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3]
Files and contents follow:
/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log
ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37639-big-,0,8][1997892352]
        ABORT:[,0,0][737437348]
ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37640-big-,0,8][2833571752]
ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37644-big-,0,8][3122518803]
ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37643-big-,0,8][2875951075]
ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37642-big-,0,8][884016253]
ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37641-big-,0,8][926833718]
/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log
ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37640-big-,0,8][2833571752]
                ***Does not match <ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37639-big-,0,8][1997892352]> in first replica file
ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37644-big-,0,8][3122518803]
ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37643-big-,0,8][2875951075]
ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37642-big-,0,8][884016253]
ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37641-big-,0,8][926833718]

ERROR [main] 2023-08-30 09:50:46,156 CassandraDaemon.java:897 - Cannot remove temporary or obsoleted files for doc.extractedmetadata due to a problem with transaction log files. Please check records with problems in the log messages above and fix them. Refer to the 3.0 upgrading instructions in NEWS.txt for a description of transaction log files.

I then delete the files and eventually after many iterations, the node comes back up. The table 'extractedmetadata' has 29 billion records.  Just a data point here - I think the 'right' thing to do is just to go to each node and stop it, clean up the files, and finally get each one back up?

-Joe


--
This email has been checked for viruses by AVG antivirus software.
www.avg.com

Reply via email to