[ https://issues.apache.org/jira/browse/KUDU-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700841#comment-16700841 ]
Will Berkeley commented on KUDU-2626: ------------------------------------- [~zzjj] Thanks for the report. The error comes from the function below: {code} // Compute the lower and upper bounds to the 0-1 knapsack problem with the elements // added so far. std::pair<double, double> ComputeLowerAndUpperBound() const { int excess_weight = total_weight_ - max_weight_; if (excess_weight <= 0) { // If we've added less than the budget, our "bounds" are just including // all of the items. return { total_value_, total_value_ }; } const RowSetInfo& top = *fractional_solution_.front(); // The lower bound is either of: // - the highest density N items such that they all fit, OR // - the N+1th item, if it fits // This is a 2-approximation (i.e. no worse than 1/2 of the best solution). // See https://courses.engr.illinois.edu/cs598csc/sp2009/lectures/lecture_4.pdf double lower_bound = std::max(total_value_ - top.width(), top.width()); // An upper bound for the integer problem is the solution to the fractional problem: // in the fractional problem we can add just a portion of the top element. The // portion to remove is determined by the amount of excess weight: // // fraction_to_remove = excess_weight / top.size_mb(); // portion_to_remove = fraction_to_remove * top.width() // // To avoid the division, we can just use the fact that density = width/size: double portion_of_top_to_remove = static_cast<double>(excess_weight) * top.density(); DCHECK_GT(portion_of_top_to_remove, 0); double upper_bound = total_value_ - portion_of_top_to_remove; return {lower_bound, upper_bound}; } {code} {{portion_of_top_to_remove}} is 0 if and only if {{top.density()}} is zero, because of the conditional on {{excess_weight}} at the top of the function (L191). The density is zero if the width of the rowset is zero, which I think should happen if and only if the rowset consists of a single key (so its min and max key are the same). The following test case in compaction_policy-test.cc reproduces the crash and the comments explain a little more (the diff applies to the master branch): {noformat} diff --git a/src/kudu/tablet/compaction_policy-test.cc b/src/kudu/tablet/compaction_policy-test.cc index ce99b5b86..cb11bcd13 100644 --- a/src/kudu/tablet/compaction_policy-test.cc +++ b/src/kudu/tablet/compaction_policy-test.cc @@ -434,6 +434,31 @@ TEST_F(TestCompactionPolicy, TestHeightBasedDominatesSizeBased) { } } +// KUDU-2626: Test for a case where a rowset has zero width (like if it contains +// a single key) and it is part of a proposed compaction selection that is over +// budget. Since it's the least dense rowset in the selection, it will be +// picked to be removed from the fractional knapsack solution, and this will +// trigger a CHECK failure. +TEST_F(TestCompactionPolicy, KUDU2626) { + // KUDU-1400's adjustments mean that a rowset should have a positive density + // unless it is exactly the target size. Since this is very unlikely to happen + // for a rowset consisting of a single row, KUDU-1400 basically + // unintentionally fixes this issue. + FLAGS_compaction_small_rowset_tradeoff = 0.0; + + const RowSetVector rowsets = { + std::make_shared<MockDiskRowSet>("C", "c"), + std::make_shared<MockDiskRowSet>("B", "a"), + std::make_shared<MockDiskRowSet>("c", "c"), + std::make_shared<MockDiskRowSet>("C", "c"), + }; + + constexpr auto kBudgetMb = 3; // Enough to pick only 3 of 4. + CompactionSelection picked; + double quality = 0.0; + // Crashes with: "Check failed: portion_of_top_to_remove > 0 (0 vs. 0)" + NO_FATALS(RunTestCase(rowsets, kBudgetMb, &picked, &quality)); +} + {noformat} If this is only affecting one tablet server, probably the easiest thing to do is to wipe out that tablet server completely, start a fresh one on the same host, and use the rebalancer to rebalance the cluster after it re-replicates. It is possible to surgically remove the width-zero, single-key rowset that's causing this but it requires messing with Kudu's files on disk directly, and since the tablet server is already down its replicas should have been re-replicated already anyway so bringing the server back up will just cause it to delete all its data. You can try to confirm the existence of the width zero rowset (rowsets?) using the following command: {code:bash} sudo -u kudu kudu fs list -fs_wal_dir=<the wal dir> -fs_data_dirs=<the data dirs> -columns=tablet_id,column,cfile_min_key,cfile_max_key,rowset_id -format=csv 2> /dev/null | awk -F, '{ if ($3 != "" && $4 != "" && $3 == $4) print $0 }' {code} On a test cluster where I forced Kudu to flush a single 1-row rowset with single key {{0}} the output looks like {{28918a9441e541dd87e79de963dca46e,key,(0),(0),0}} YMMV since I only tested with a simple 1-column primary key. > kudu crash when using impala insert > ------------------------------------- > > Key: KUDU-2626 > URL: https://issues.apache.org/jira/browse/KUDU-2626 > Project: Kudu > Issue Type: Bug > Affects Versions: 1.8.0 > Environment: centos 6.4 > kudu-1.8.0 release > Reporter: yyzzjj > Assignee: Will Berkeley > Priority: Critical > Attachments: image-2018-11-21-11-08-33-019.png, > image-2018-11-21-11-09-59-088.png > > > error msg like : !image-2018-11-21-11-09-59-088.png! > coredump stack : !image-2018-11-21-11-08-33-019.png! > can't restart kudu-tserver with the same msg F1121 10:48:33.988695 25505 > compaction_policy.cc:215] Check failed: portion_of_top_to_remove > 0 (0 vs. > 0) -- This message was sent by Atlassian JIRA (v7.6.3#76005)