[ 
https://issues.apache.org/jira/browse/KUDU-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16700841#comment-16700841
 ] 

Will Berkeley commented on KUDU-2626:
-------------------------------------

[~zzjj] Thanks for the report. The error comes from the function below:
{code}
  // Compute the lower and upper bounds to the 0-1 knapsack problem with the 
elements
  // added so far.
  std::pair<double, double> ComputeLowerAndUpperBound() const {
    int excess_weight = total_weight_ - max_weight_;
    if (excess_weight <= 0) {
      // If we've added less than the budget, our "bounds" are just including
      // all of the items.
      return { total_value_, total_value_ };
    }

    const RowSetInfo& top = *fractional_solution_.front();

    // The lower bound is either of:
    // - the highest density N items such that they all fit, OR
    // - the N+1th item, if it fits
    // This is a 2-approximation (i.e. no worse than 1/2 of the best solution).
    // See 
https://courses.engr.illinois.edu/cs598csc/sp2009/lectures/lecture_4.pdf
    double lower_bound = std::max(total_value_ - top.width(), top.width());

    // An upper bound for the integer problem is the solution to the fractional 
problem:
    // in the fractional problem we can add just a portion of the top element. 
The
    // portion to remove is determined by the amount of excess weight:
    //
    //   fraction_to_remove = excess_weight / top.size_mb();
    //   portion_to_remove = fraction_to_remove * top.width()
    //
    // To avoid the division, we can just use the fact that density = 
width/size:
    double portion_of_top_to_remove = static_cast<double>(excess_weight) * 
top.density();
    DCHECK_GT(portion_of_top_to_remove, 0);
    double upper_bound = total_value_ - portion_of_top_to_remove;

    return {lower_bound, upper_bound};
  }
{code}
{{portion_of_top_to_remove}} is 0 if and only if {{top.density()}} is zero, 
because of the conditional on {{excess_weight}} at the top of the function 
(L191). The density is zero if the width of the rowset is zero, which I think 
should happen if and only if the rowset consists of a single key (so its min 
and max key are the same). The following test case in compaction_policy-test.cc 
reproduces the crash and the comments explain a little more (the diff applies 
to the master branch):
{noformat}
diff --git a/src/kudu/tablet/compaction_policy-test.cc 
b/src/kudu/tablet/compaction_policy-test.cc
index ce99b5b86..cb11bcd13 100644
--- a/src/kudu/tablet/compaction_policy-test.cc
+++ b/src/kudu/tablet/compaction_policy-test.cc
@@ -434,6 +434,31 @@ TEST_F(TestCompactionPolicy, 
TestHeightBasedDominatesSizeBased) {
   }
 }

+// KUDU-2626: Test for a case where a rowset has zero width (like if it 
contains
+// a single key) and it is part of a proposed compaction selection that is over
+// budget. Since it's the least dense rowset in the selection, it will be
+// picked to be removed from the fractional knapsack solution, and this will
+// trigger a CHECK failure.
+TEST_F(TestCompactionPolicy, KUDU2626) {
+  // KUDU-1400's adjustments mean that a rowset should have a positive density
+  // unless it is exactly the target size. Since this is very unlikely to 
happen
+  // for a rowset consisting of a single row, KUDU-1400 basically
+  // unintentionally fixes this issue.
+  FLAGS_compaction_small_rowset_tradeoff = 0.0;
+
+  const RowSetVector rowsets = {
+    std::make_shared<MockDiskRowSet>("C", "c"),
+    std::make_shared<MockDiskRowSet>("B", "a"),
+    std::make_shared<MockDiskRowSet>("c", "c"),
+    std::make_shared<MockDiskRowSet>("C", "c"),
+  };
+
+  constexpr auto kBudgetMb = 3; // Enough to pick only 3 of 4.
+  CompactionSelection picked;
+  double quality = 0.0;
+  // Crashes with: "Check failed: portion_of_top_to_remove > 0 (0 vs. 0)"
+  NO_FATALS(RunTestCase(rowsets, kBudgetMb, &picked, &quality));
+}
+
{noformat}
If this is only affecting one tablet server, probably the easiest thing to do 
is to wipe out that tablet server completely, start a fresh one on the same 
host, and use the rebalancer to rebalance the cluster after it re-replicates. 
It is possible to surgically remove the width-zero, single-key rowset that's 
causing this but it requires messing with Kudu's files on disk directly, and 
since the tablet server is already down its replicas should have been 
re-replicated already anyway so bringing the server back up will just cause it 
to delete all its data.

You can try to confirm the existence of the width zero rowset (rowsets?) using 
the following command:

{code:bash}
sudo -u kudu kudu fs list -fs_wal_dir=<the wal dir> -fs_data_dirs=<the data 
dirs> -columns=tablet_id,column,cfile_min_key,cfile_max_key,rowset_id 
-format=csv 2> /dev/null | awk -F, '{ if ($3 != "" && $4 != "" && $3 == $4) 
print $0 }'
{code}

On a test cluster where I forced Kudu to flush a single 1-row rowset with 
single key {{0}} the output looks like

{{28918a9441e541dd87e79de963dca46e,key,(0),(0),0}}

YMMV since I only tested with a simple 1-column primary key.

> kudu crash when using impala  insert 
> -------------------------------------
>
>                 Key: KUDU-2626
>                 URL: https://issues.apache.org/jira/browse/KUDU-2626
>             Project: Kudu
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>         Environment: centos 6.4    
> kudu-1.8.0 release 
>            Reporter: yyzzjj
>            Assignee: Will Berkeley
>            Priority: Critical
>         Attachments: image-2018-11-21-11-08-33-019.png, 
> image-2018-11-21-11-09-59-088.png
>
>
> error msg like : !image-2018-11-21-11-09-59-088.png!  
> coredump stack : !image-2018-11-21-11-08-33-019.png! 
> can't restart kudu-tserver  with the same msg   F1121 10:48:33.988695 25505 
> compaction_policy.cc:215] Check failed: portion_of_top_to_remove > 0 (0 vs. 
> 0) 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to