Will Berkeley created KUDU-2376:
-----------------------------------
Summary: SIGSEGV while adding and dropping the same range
partition and concurrently writing
Key: KUDU-2376
URL: https://issues.apache.org/jira/browse/KUDU-2376
Project: Kudu
Issue Type: Bug
Affects Versions: 1.7.0
Reporter: Will Berkeley
Attachments: alter_table-test.patch
While adding a test to https://gerrit.cloudera.org/#/c/9393/, I ran into the
problem that writing while doing a replace tablet operation caused the client
to segfault. After inspecting the client code, it looked like the same problem
could occur if the same range partition was added and dropped with concurrent
writes.
Attached is a patch that adds a test to alter_table-test that reliably
reproduces the segmentation fault.
I don't totally understand what's happening, but here's what I think I have
figured out:
Suppose the range partition P=[0, 100) is dropped and re-added in a single
alter with a batch. This causes the tablet X for hash bucket 0 and range
partition P to be dropped, and a new one Y created for the same partition.
There is a batch pending to X which the client attempts to send to each of the
replicas of X in turn. Once the replicas are exhausted, the client attempts to
find a new leader with MetaCacheServerPicker::PickLeader, which triggers a
master lookup to get the latest consensus info for X (#5 in the big comment in
PickLeader). This calls LookupTabletByKey, which attempts a fast path lookup.
Assuming other metadata operations have already cached a tablet for Y, the
tablet for X will have been removed from the by-table-and-by-key map, and the
fast lookup with return an entry for Y. The client code doesn't know the
difference because the code paths just look at partition boundaries, which
match for X and Y. The lookup doesn't happen, and the client ends up in a
pretty tight loop looking repeating the above process, until the segfault.
I'm not sure exactly what the segmentation fault is. I looked at it a bit in
gdb and the segfault was a few calls deep into STL maps in release mode and
inside a refcount increment in debug mode. I'll try to attach some gdb output
showing that later.
The problem is also hinted at in a TODO in PickLeader:
{noformat}
// TODO: When we support tablet splits, we should let the lookup shift
// the write to another tablet (i.e. if it's since been split).
{noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)