[
https://issues.apache.org/jira/browse/KUDU-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon resolved KUDU-1670.
-------------------------------
Resolution: Duplicate
Fix Version/s: n/a
> Crash when bootstrapping recovery wal
> -------------------------------------
>
> Key: KUDU-1670
> URL: https://issues.apache.org/jira/browse/KUDU-1670
> Project: Kudu
> Issue Type: Bug
> Components: tablet
> Affects Versions: 1.0.0
> Reporter: Dan Burkert
> Assignee: Todd Lipcon
> Priority: Critical
> Fix For: n/a
>
> Attachments: alter_table-randomized-test.txt.gz
>
>
> {{alter_table-randomized-itest}} turned up an interesting crash in the tablet
> bootstrap / wal recovery codepath. Relevant log lines (full log attached):
> {code}
> I0930 09:20:45.003783 17173 ts_tablet_manager.cc:632] T
> fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e:
> Bootstrapping tablet
> I0930 09:20:45.008419 17173 tablet_bootstrap.cc:380] T
> fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e:
> Bootstrap starting.
> I0930 09:20:45.025218 17173 tablet_bootstrap.cc:542] T
> fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e: Time
> spent opening tablet: real 0.000s user 0.001s sys 0.000s
> I0930 09:20:45.025753 17173 tablet_bootstrap.cc:598] T
> fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e: Will
> attempt to recover log segment
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246/wal-000000001
> to
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001
> I0930 09:20:45.026008 17173 tablet_bootstrap.cc:606] T
> fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e: Moving
> log directory
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246
> to recovery directory
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery
> in preparation for log replay
> W0930 09:20:45.026758 17173 log_util.cc:475] Log segment file
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001
> has 12 initial NULL bytes instead of magic and header length:
> \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 and will be treated as a
> blank segment.
> I0930 09:20:45.027081 17173 log_util.cc:312] Log segment
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001
> has no footer. This segment was likely being written when the server
> previously shut down.
> I0930 09:20:45.027308 17173 log_reader.cc:151] Log segment
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001
> was likely left in-progress after a previous crash. Will try to rebuild
> footer by scanning data.
> I0930 09:20:45.027604 17173 log_util.cc:575] Scanning
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001
> for valid entry headers following offset 12...
> I0930 09:20:45.842016 17173 log_util.cc:612] Found no log entry headers
> I0930 09:20:45.873320 17173 log_util.cc:215] Ignoring log segment corruption
> in
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001
> because there are no log entries following the corrupted one. The server
> probably crashed in the middle of writing an entry to the write-ahead log or
> downloaded an active log via tablet copy. Error detail: Corruption: CRC
> mismatch in log entry header: Log file corruption detected. Failed trying to
> read batch #0 at offset 0 for log segment
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001:
> Prior entries:
> I0930 09:20:45.873550 17173 log_util.cc:365] Successfully rebuilt footer for
> segment:
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001
> (valid entries through byte offset 0)
> F0930 09:20:45.873761 17173 log_util.h:208] Check failed:
> header_.IsInitialized()
> *** Check failure stack trace: ***
> @ 0x7fa678822e0d google::LogMessage::SendToLog() at ??:0
> @ 0x7fa6788237e9 google::LogMessage::Flush() at ??:0
> @ 0x7fa67882722b google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0x7fa67f074331 kudu::log::ReadableLogSegment::header() at ??:0
> @ 0x7fa67d623e37 kudu::log::LogReader::Init() at ??:0
> @ 0x7fa67d624488 kudu::log::LogReader::OpenFromRecoveryDir() at ??:0
> @ 0x7fa67dc38dcc
> kudu::tablet::TabletBootstrap::OpenLogReaderInRecoveryDir() at ??:0
> @ 0x7fa67dc371fa kudu::tablet::TabletBootstrap::Bootstrap() at ??:0
> @ 0x7fa67dc36c6d kudu::tablet::BootstrapTablet() at ??:0
> @ 0x7fa67f0a1d98 kudu::tserver::TSTabletManager::OpenTablet() at ??:0
> @ 0x7fa67f0abeb8 boost::_mfi::mf2<>::operator()() at ??:0
> @ 0x7fa67f0abe04 boost::_bi::list3<>::operator()<>() at ??:0
> @ 0x7fa67f0abd74 boost::_bi::bind_t<>::operator()() at ??:0
> @ 0x7fa67f0abb12
> boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
> @ 0x7fa67ca1f432 boost::function0<>::operator()() at ??:0
> @ 0x7fa6790744fe kudu::FunctionRunnable::Run() at ??:0
> @ 0x7fa6790732de kudu::ThreadPool::DispatchThread() at ??:0
> @ 0x7fa679075b9f boost::_mfi::mf1<>::operator()() at ??:0
> @ 0x7fa679075afc boost::_bi::list2<>::operator()<>() at ??:0
> @ 0x7fa679075a74 boost::_bi::bind_t<>::operator()() at ??:0
> @ 0x7fa679075872
> boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
> @ 0x7fa67ca1f432 boost::function0<>::operator()() at ??:0
> @ 0x7fa67906cc55 kudu::Thread::SuperviseThread() at ??:0
> @ 0x42506c __tsan_thread_start_func at ??:0
> @ 0x34186079d1 (unknown) at ??:0
> @ 0x34182e88fd (unknown) at ??:0
> @ (nil) (unknown)
> {code}
> This is my first time spelunking through this code, but it seems we have some
> broken invariants in this case. We have a {{ReadableLogSegment}} with an
> uninitialized header, but with its {{is_initialized_}} flag set to true. The
> following comment from log_util.cc:388 leads me to believe this is not
> expected:
> {code}
> // If a log file has been pre-allocated but not initialized, then
> // 'header_size' will be 0 even the file size is > 0; in this
> // case, 'is_initialized_' remains set to false and return
> // Status::OK() early. LogReader ignores segments where
> // IsInitialized() returns false.
> {code}
> You can see from the warning log lines above that we are indeed in the case
> where the magic bytes are all 0. By reading through the code it's clear to
> me _how_ this happened ({{is_initalized_}} gets set at log_util.cc:321), but
> I'm not quite familiar with the code enough to figure out how to fix it, or
> what the relevant tests should be. The stack at the point where this
> invariant is broken is (growing down):
> {code}
> ReadableLogSegment::Init() // No arg version, eventually sets
> is_initialized_=true since none of the initialization steps fail
> ReadableLogSegment::ReadHeader // Returns Status::OK back to Init, because
> header_size==0
> ReadableLogSegment::ReadHeaderMagicAndHeaderLength // Returns Status::OK
> ReadableLogSegment::ParseHeaderMagicAndHeaderLength // Returns Status::OK,
> because magic bytes are all 0, sets parsed_len=0 (parsed_len is the same as
> header_size in the ReadHeader frame)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)