[ 
https://issues.apache.org/jira/browse/KUDU-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-1670.
-------------------------------
       Resolution: Duplicate
    Fix Version/s: n/a

> Crash when bootstrapping recovery wal
> -------------------------------------
>
>                 Key: KUDU-1670
>                 URL: https://issues.apache.org/jira/browse/KUDU-1670
>             Project: Kudu
>          Issue Type: Bug
>          Components: tablet
>    Affects Versions: 1.0.0
>            Reporter: Dan Burkert
>            Assignee: Todd Lipcon
>            Priority: Critical
>             Fix For: n/a
>
>         Attachments: alter_table-randomized-test.txt.gz
>
>
> {{alter_table-randomized-itest}} turned up an interesting crash in the tablet 
> bootstrap / wal recovery codepath.  Relevant log lines (full log attached):
> {code}
> I0930 09:20:45.003783 17173 ts_tablet_manager.cc:632] T 
> fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e: 
> Bootstrapping tablet
> I0930 09:20:45.008419 17173 tablet_bootstrap.cc:380] T 
> fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e: 
> Bootstrap starting.
> I0930 09:20:45.025218 17173 tablet_bootstrap.cc:542] T 
> fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e: Time 
> spent opening tablet: real 0.000s    user 0.001s     sys 0.000s
> I0930 09:20:45.025753 17173 tablet_bootstrap.cc:598] T 
> fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e: Will 
> attempt to recover log segment 
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246/wal-000000001
>  to 
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001
> I0930 09:20:45.026008 17173 tablet_bootstrap.cc:606] T 
> fd6c30c11e8f488aa7eac04a76b5a246 P cfa92dcbfd8f41798ec6d6fce44e6c2e: Moving 
> log directory 
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246
>  to recovery directory 
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery
>  in preparation for log replay
> W0930 09:20:45.026758 17173 log_util.cc:475] Log segment file 
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001
>  has 12 initial NULL bytes instead of magic and header length: 
> \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 and will be treated as a 
> blank segment.
> I0930 09:20:45.027081 17173 log_util.cc:312] Log segment 
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001
>  has no footer. This segment was likely being written when the server 
> previously shut down.
> I0930 09:20:45.027308 17173 log_reader.cc:151] Log segment 
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001
>  was likely left in-progress after a previous crash. Will try to rebuild 
> footer by scanning data.
> I0930 09:20:45.027604 17173 log_util.cc:575] Scanning 
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001
>  for valid entry headers following offset 12...
> I0930 09:20:45.842016 17173 log_util.cc:612] Found no log entry headers
> I0930 09:20:45.873320 17173 log_util.cc:215] Ignoring log segment corruption 
> in 
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001
>  because there are no log entries following the corrupted one. The server 
> probably crashed in the middle of writing an entry to the write-ahead log or 
> downloaded an active log via tablet copy. Error detail: Corruption: CRC 
> mismatch in log entry header: Log file corruption detected. Failed trying to 
> read batch #0 at offset 0 for log segment 
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001:
>  Prior entries:
> I0930 09:20:45.873550 17173 log_util.cc:365] Successfully rebuilt footer for 
> segment: 
> /data1/test-tmp/alter_table-randomized-test.AlterTableRandomized.TestRandomSequence.1475252278893135-22926/minicluster-data/ts-0/wals/fd6c30c11e8f488aa7eac04a76b5a246.recovery/wal-000000001
>  (valid entries through byte offset 0)
> F0930 09:20:45.873761 17173 log_util.h:208] Check failed: 
> header_.IsInitialized() 
> *** Check failure stack trace: ***
>     @     0x7fa678822e0d  google::LogMessage::SendToLog() at ??:0
>     @     0x7fa6788237e9  google::LogMessage::Flush() at ??:0
>     @     0x7fa67882722b  google::LogMessageFatal::~LogMessageFatal() at ??:0
>     @     0x7fa67f074331  kudu::log::ReadableLogSegment::header() at ??:0
>     @     0x7fa67d623e37  kudu::log::LogReader::Init() at ??:0
>     @     0x7fa67d624488  kudu::log::LogReader::OpenFromRecoveryDir() at ??:0
>     @     0x7fa67dc38dcc  
> kudu::tablet::TabletBootstrap::OpenLogReaderInRecoveryDir() at ??:0
>     @     0x7fa67dc371fa  kudu::tablet::TabletBootstrap::Bootstrap() at ??:0
>     @     0x7fa67dc36c6d  kudu::tablet::BootstrapTablet() at ??:0
>     @     0x7fa67f0a1d98  kudu::tserver::TSTabletManager::OpenTablet() at ??:0
>     @     0x7fa67f0abeb8  boost::_mfi::mf2<>::operator()() at ??:0
>     @     0x7fa67f0abe04  boost::_bi::list3<>::operator()<>() at ??:0
>     @     0x7fa67f0abd74  boost::_bi::bind_t<>::operator()() at ??:0
>     @     0x7fa67f0abb12  
> boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
>     @     0x7fa67ca1f432  boost::function0<>::operator()() at ??:0
>     @     0x7fa6790744fe  kudu::FunctionRunnable::Run() at ??:0
>     @     0x7fa6790732de  kudu::ThreadPool::DispatchThread() at ??:0
>     @     0x7fa679075b9f  boost::_mfi::mf1<>::operator()() at ??:0
>     @     0x7fa679075afc  boost::_bi::list2<>::operator()<>() at ??:0
>     @     0x7fa679075a74  boost::_bi::bind_t<>::operator()() at ??:0
>     @     0x7fa679075872  
> boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
>     @     0x7fa67ca1f432  boost::function0<>::operator()() at ??:0
>     @     0x7fa67906cc55  kudu::Thread::SuperviseThread() at ??:0
>     @           0x42506c  __tsan_thread_start_func at ??:0
>     @       0x34186079d1  (unknown) at ??:0
>     @       0x34182e88fd  (unknown) at ??:0
>     @              (nil)  (unknown)
> {code}
> This is my first time spelunking through this code, but it seems we have some 
> broken invariants in this case.  We have a {{ReadableLogSegment}} with an 
> uninitialized header, but with its {{is_initialized_}} flag set to true.  The 
> following comment from log_util.cc:388 leads me to believe this is not 
> expected:
> {code}
>     // If a log file has been pre-allocated but not initialized, then
>     // 'header_size' will be 0 even the file size is > 0; in this
>     // case, 'is_initialized_' remains set to false and return
>     // Status::OK() early. LogReader ignores segments where
>     // IsInitialized() returns false.
> {code}
> You can see from the warning log lines above that we are indeed in the case 
> where the magic bytes are all 0.  By reading through the code it's clear to 
> me _how_ this happened ({{is_initalized_}} gets set at log_util.cc:321), but 
> I'm not quite familiar with the code enough to figure out how to fix it, or 
> what the relevant tests should be.  The stack at the point where this 
> invariant is broken is (growing down):
> {code}
> ReadableLogSegment::Init() // No arg version, eventually sets 
> is_initialized_=true since none of the initialization steps fail
> ReadableLogSegment::ReadHeader // Returns Status::OK back to Init, because 
> header_size==0
> ReadableLogSegment::ReadHeaderMagicAndHeaderLength // Returns Status::OK
> ReadableLogSegment::ParseHeaderMagicAndHeaderLength // Returns Status::OK, 
> because magic bytes are all 0, sets parsed_len=0 (parsed_len is the same as 
> header_size in the ReadHeader frame)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to