There are at least two bugs in the compaction lifecycle transaction log -
one that can drop an ABORT / ADD in the wrong order (and prevent startup),
and one that allows for invalid timestamps in the log file (and again,
prevent startups).

 I believe it's safe to work around the former by removing the .log file,
and you can work around the latter by using `touch` to update the
timestamps of the data file that mismatches, but I can't find the relevant
JIRAs to be 100% sure.

(Also, it may be a good trigger to cut a new release, because things that
block startup are obviously quite serious).




On Wed, Aug 30, 2023 at 6:59 AM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Hi all - I replaced a node in a 14 node cluster, and it rebuilt OK.  I
> started to see a lot of timeout errors, and discovered one of the nodes
> had this message constantly repeated:
> "waiting to acquire a permit to begin streaming" - so perhaps I hit this
> bug:
> https://www.mail-archive.com/commits@cassandra.apache.org/msg284709.html
>
> I then restarted that node, but it gave a bunch of errors about
> "unexpected disk state: failed to read translation log"
> I deleted the corresponding files and got that node to come up, but now
> when I restart any of the other nodes in the cluster, they too do not
> start back up:
>
> Example:
>
> INFO  [main] 2023-08-30 09:50:46,130 LogTransaction.java:544 - Verifying
> logfile transaction
> [nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log in
> /data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3,
>
>
> /data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3]
> ERROR [main] 2023-08-30 09:50:46,154 LogReplicaSet.java:145 - Mismatched
> line in file nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log: got
> 'ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37640-big-,0,8][2833571752]'
>
> expected
> 'ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37639-big-,0,8][1997892352]',
>
> giving up
> ERROR [main] 2023-08-30 09:50:46,155 LogFile.java:164 - Failed to read
> records for transaction log
> [nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log in
> /data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3,
>
>
> /data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3]
> ERROR [main] 2023-08-30 09:50:46,156 LogTransaction.java:559 -
> Unexpected disk state: failed to read transaction log
> [nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log in
> /data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3,
>
>
> /data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3]
> Files and contents follow:
>
> /data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log
>
> ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37639-big-,0,8][1997892352]
>          ABORT:[,0,0][737437348]
>
> ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37640-big-,0,8][2833571752]
>
> ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37644-big-,0,8][3122518803]
>
> ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37643-big-,0,8][2875951075]
>
> ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37642-big-,0,8][884016253]
>
> ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37641-big-,0,8][926833718]
>
> /data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log
>
> ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37640-big-,0,8][2833571752]
>                  ***Does not match
> <ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37639-big-,0,8][1997892352]>
>
> in first replica file
>
> ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37644-big-,0,8][3122518803]
>
> ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37643-big-,0,8][2875951075]
>
> ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37642-big-,0,8][884016253]
>
> ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37641-big-,0,8][926833718]
>
> ERROR [main] 2023-08-30 09:50:46,156 CassandraDaemon.java:897 - Cannot
> remove temporary or obsoleted files for doc.extractedmetadata due to a
> problem with transaction log files. Please check records with problems
> in the log messages above and fix them. Refer to the 3.0 upgrading
> instructions in NEWS.txt for a description of transaction log files.
>
> I then delete the files and eventually after many iterations, the node
> comes back up.
> The table 'extractedmetadata' has 29 billion records.  Just a data point
> here - I think the 'right' thing to do is just to go to each node and
> stop it, clean up the files, and finally get each one back up?
>
> -Joe
>
>
> --
> This email has been checked for viruses by AVG antivirus software.
> www.avg.com
>

Reply via email to