[
https://issues.apache.org/jira/browse/KUDU-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15923180#comment-15923180
]
Mike Percy commented on KUDU-1933:
----------------------------------
The relevant stack trace looks like this:
{noformat}
#6 0x000000000087153f in google::LogMessageFatal::~LogMessageFatal() ()
#7 0x0000000000ae1018 in kudu::log::LogIndex::GetChunkForIndex(long, bool,
scoped_refptr<kudu::log::LogIndex::IndexChunk>*) ()
#8 0x0000000000ae118f in
kudu::log::LogIndex::AddEntry(kudu::log::LogIndexEntry const&) ()
#9 0x0000000000ad5f47 in
kudu::log::Log::UpdateIndexForBatch(kudu::log::LogEntryBatch const&, long) ()
#10 0x0000000000adbdc9 in kudu::log::Log::DoAppend(kudu::log::LogEntryBatch*) ()
#11 0x0000000000addba2 in kudu::log::Log::Append(kudu::log::LogEntryPB*) ()
#12 0x00000000009918b0 in
kudu::tablet::TabletBootstrap::HandleReplicateMessage(kudu::tablet::ReplayState*,
kudu::log::LogEntryPB*) ()
#13 0x0000000000995f09 in
kudu::tablet::TabletBootstrap::HandleEntry(kudu::tablet::ReplayState*,
kudu::log::LogEntryPB*) ()
#14 0x00000000009967f7 in
kudu::tablet::TabletBootstrap::PlaySegments(kudu::consensus::ConsensusBootstrapInfo*)
()
#15 0x0000000000998a3b in
kudu::tablet::TabletBootstrap::Bootstrap(std::shared_ptr<kudu::tablet::Tablet>*,
scoped_refptr<kudu::log::Log>*, kudu::consensus::ConsensusBootstrapInfo*) ()
#16 0x0000000000999217 in
kudu::tablet::BootstrapTablet(scoped_refptr<kudu::tablet::TabletMetadata>
const&, scoped_refptr<kudu::server::Clock> const&,
std::shared_ptr<kudu::MemTracker> const&,
scoped_refptr<kudu::rpc::ResultTracker> const&, kudu::MetricRegistry*,
kudu::tablet::TabletStatusListener*, std::shared_ptr<kudu::tablet::Tablet>*,
scoped_refptr<kudu::log::Log>*, scoped_refptr<kudu::log::LogAnchorRegistry>
const&, kudu::consensus::ConsensusBootstrapInfo*) ()
#17 0x0000000000827b26 in
kudu::master::SysCatalogTable::SetupTablet(scoped_refptr<kudu::tablet::TabletMetadata>
const&) ()
#18 0x000000000082a5da in kudu::master::SysCatalogTable::Load(kudu::FsManager*)
()
#19 0x000000000083df99 in
kudu::master::CatalogManager::InitSysCatalogAsync(bool) ()
#20 0x000000000083f9e1 in kudu::master::CatalogManager::Init(bool) ()
#21 0x00000000008117e5 in kudu::master::Master::InitCatalogManager() ()
#22 0x00000000008118ba in kudu::master::Master::InitCatalogManagerTask() ()
{noformat}
> Master crashes after too many TS re-registrations
> -------------------------------------------------
>
> Key: KUDU-1933
> URL: https://issues.apache.org/jira/browse/KUDU-1933
> Project: Kudu
> Issue Type: Bug
> Components: master
> Affects Versions: 1.3.0
> Reporter: Jean-Daniel Cryans
>
> I had a cluster with mis-matched versions inside the 1.3 release (something
> no one would see using released versions) and ended up with the tablet
> servers constantly retrying to register with the master. After a few days of
> this, the master died this way:
> {noformat}
> I0308 00:25:47.038650 7619 ts_descriptor.cc:125] Processing retry of TS
> registration from permanent_uuid: "d8009e07d82b4e66a7ab50f85e60bc30"
> instance_seqno: 1487888450146835
> I0308 00:25:47.038702 7619 ts_manager.cc:84] Re-registered known tserver
> with Master: d8009e07d82b4e66a7ab50f85e60bc30 (ve0136.halxg.cloudera.com:7050)
> I0308 00:25:47.043874 7616 ts_descriptor.cc:125] Processing retry of TS
> registration from permanent_uuid: "335d132897de4bdb9b87443f2c487a42"
> instance_seqno: 1487888474889244
> I0308 00:25:47.043912 7616 ts_manager.cc:84] Re-registered known tserver
> with Master: 335d132897de4bdb9b87443f2c487a42 (ve0126.halxg.cloudera.com:7050)
> I0308 00:25:47.108677 7617 ts_descriptor.cc:125] Processing retry of TS
> registration from permanent_uuid: "7425c65d80f54f2da0a85494a5eb3e68"
> instance_seqno: 1487888491433564
> I0308 00:25:47.108719 7617 ts_manager.cc:84] Re-registered known tserver
> with Master: 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050)
> I0308 00:25:47.111563 7611 ts_descriptor.cc:125] Processing retry of TS
> registration from permanent_uuid: "c108a85a68504c2bb9f49e4ee683d981"
> instance_seqno: 1487888392795318
> I0308 00:25:47.111604 7611 ts_manager.cc:84] Re-registered known tserver
> with Master: c108a85a68504c2bb9f49e4ee683d981 (ve0128.halxg.cloudera.com:7050)
> F0308 00:25:53.568773 7655 log_index.cc:171] Check failed: log_index > 0
> (-2147483648 vs. 0)
> {noformat}
> Ideally the master shouldn't crash, but it also sounds like we're not
> handling log_index overflows.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)