[
https://issues.apache.org/jira/browse/KUDU-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Wong updated KUDU-3151:
------------------------------
Summary: segfault when repairing log block container with a missing LBM
container data file (was: segfault when starting repairing log block container
with a missing LBM container data file)
> segfault when repairing log block container with a missing LBM container data
> file
> ----------------------------------------------------------------------------------
>
> Key: KUDU-3151
> URL: https://issues.apache.org/jira/browse/KUDU-3151
> Project: Kudu
> Issue Type: Bug
> Components: fs
> Reporter: Andrew Wong
> Priority: Critical
> Attachments: metadump.txt
>
>
> We upgraded a cluster from 1.7 to 1.12 and saw the following segfault on one
> node:
> {code:java}
> *** SIGSEGV (@0x20f2008) received by PID 35899 (TID 0x7ff7e40cc700) from PID
> 34545672; stack trace: ***
> @ 0x7ff7f2a395d0 (unknown)
> @ 0x9fe02e std::_Sp_counted_base<>::_M_release()
> @ 0x2049f77 kudu::fs::LogBlockManager::Repair()
> @ 0x204ae45 kudu::fs::LogBlockManager::RepairTask()
> @ 0x228e67e kudu::ThreadPool::DispatchThread()
> @ 0x228778f kudu::Thread::SuperviseThread()
> @ 0x7ff7f2a31dd5 start_thread
> @ 0x7ff7f0d0902d __clone
> {code}
> When running {{kudu fs check}} we saw the following logs:
> {code:java}
> I0617 09:17:37.681373 147811 fs_manager.cc:433] Time spent opening block
> manager: real 10.871s user 0.215s sys 0.162s
> Not found: Could not open container 74e7b95f8ccb4c7b98e52dc48049e967:
> /data/5/kudu/tablet/data/data/74e7b95f8ccb4c7b98e52dc48049e967.data: No such
> file or directory (error 2)
> {code}
> and upon inspecting the files, we found 74e7b95f8ccb4c7b98e52dc48049e967.data
> was indeed missing, while the metadata file
> 74e7b95f8ccb4c7b98e52dc48049e967.metadata was present but non-empty (more
> creates than deletes, see attached).
> We were able to delete the metadata file, and I don't think we saw any failed
> tablets upon doing so (which may surface if the tablet were unable to find
> some necessary blocks at startup, eg PK blocks when reading min/max keys).
> It's possible the metadata might be left over from a LBM compaction, but it
> isn't clear what the exact issue is so far. It's also unclear whether the
> "missing" data file went missing before or after the upgrade, as we didn't
> run a {{kudu fs check}} before upgrading.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)