Hi, All
Yesterday our cluster experienced a sudden loss of power. When we started broker after power brought back, exception occurred: The exception showed the userRecordType loaded was illegal. The operation team deleted data journals and broker started successfully. It was a pity we didn't backup the problematic journal files. We checked dmesg command output, no disk errors. SMART tests on disk also showed disk not broken. Then we digged into code(JournalImpl::readJournalFile) and tried to find something. We have two doubts with the code. First doubt: The comment says "I - We scan for any valid record on the file. If a hole happened on the middle of the file we keep looking until all the possibilities are gone". Considering we're appending journal file and fileId is strictly increasing, so we can just skip the whole file if the fileId of record is not equal to file id. IMO the rest records in the file are the same, no need to read them. Should we keep looking all the possibilities, is there a possibility(very low one) that we just assemble a record of which fileId, recordType, checkSize all qualifies but actually does not exist? Our second one: In the case of power outage where part of record is written into disk, e.g. recordyType,fileId is successfully written, we may read the old record though fileId is latest? Can anyone shed some lights on this please? Thanks.
