A little correction, that description was for simple AddData.
Look at the extensions of JournalInternalRecord for a better description of the data format. On Tue, Jul 16, 2019 at 10:43 AM Clebert Suconic <[email protected]> wrote: > > it seems you are using 2.4.0. It does not seem related, but this fix > here would be important to have it on your system: > > commit 6b1abd1aadc2d097e3baefeb312c8e68092876ba > Author: Clebert Suconic <[email protected]> > Date: Sun Aug 26 15:55:56 2018 -0400 > ARTEMIS-2053 avoiding data loss after compacting > > > > However, let me explain you how the record scanning works: > > the format for the data is: > > Roughly > JOURNAL-RECORD-TYPE (byte) > FILE-ID (int) > compact-count(byte) > recordID(long) > recordSize, from persisters (int) > userRecordType > total-record-size > > > > When we recycle a file, we simply change the fileID on the header, and > when we load the file, the scan is done by matching the record-type > and at the end of the record the total-record-size has to match the > record-type. > > I did this to avoid filling up the file with zeros, which at the time > was a costly operation (I wrote this when disks were still mechanical > at the time), but that's still a costly operation. > > So, to wrongly trick the scan you will need a byte record, matching > the fileID at the next int, with the recordsize and total-record-size > matching each other. > > > Perhaps the loading is skipping the verification on total-record-size > and that snicked an invalid record? > > > > Or perhaps the fact that you missed the commit I mentioned caused you an > issue? > > > > On Mon, Jul 1, 2019 at 5:57 AM yw yw <[email protected]> wrote: > > > > Hi, All > > > > > > Yesterday our cluster experienced a sudden loss of power. When we started > > broker after power brought back, exception occurred: > > > > > > The exception showed the userRecordType loaded was illegal. The operation > > team deleted data journals and broker started successfully. > > > > > > It was a pity we didn't backup the problematic journal files. We checked > > dmesg command output, no disk errors. SMART tests on disk also showed disk > > not broken. Then we digged into code(JournalImpl::readJournalFile) and > > tried to find something. We have two doubts with the code. > > > > > > First doubt: > > > > The comment says "I - We scan for any valid record on the file. If a hole > > happened on the middle of the file we keep looking until all the > > possibilities are gone". > > > > Considering we're appending journal file and fileId is strictly increasing, > > so we can just skip the whole file if the fileId of record is not equal to > > file id. IMO the rest records in the file are the same, no need to read > > them. Should we keep looking all the possibilities, is there a > > possibility(very low one) that we just assemble a record of which fileId, > > recordType, checkSize all qualifies but actually does not exist? > > > > Our second one: > > > > In the case of power outage where part of record is written into disk, e.g. > > recordyType,fileId is successfully written, we may read the old record > > though fileId is latest? > > > > Can anyone shed some lights on this please? Thanks. > > > > -- > Clebert Suconic -- Clebert Suconic
