[jira] [Commented] (KUDU-1692) Deleting large tablets causes a lot of tcmalloc contention

Todd Lipcon (JIRA) Mon, 10 Oct 2016 17:13:44 -0700

    [ 
https://issues.apache.org/jira/browse/KUDU-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15563979#comment-15563979
 ]


Todd Lipcon commented on KUDU-1692:
-----------------------------------

This seems to hold some tcmalloc locks, as I saw a lot of log messages 
following the deletion like:

{code}
W1010 17:01:42.257256  4710 kernel_stack_watchdog.cc:144] Thread 5078 stuck at 
../../src/kudu/rpc/outbound_call.cc:185 for 154ms:
Kernel stack:
[<ffffffff810a40c4>] hrtimer_nanosleep+0xc4/0x180
[<ffffffff810a41ee>] sys_nanosleep+0x6e/0x80
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

User stack:
    @           0x8655e8  base::internal::SpinLockDelay()
    @           0x854ac7  SpinLock::SlowLock()
    @           0x855777  tcmalloc::ThreadCache::IncreaseCacheLimit()
    @          0x1b936d6  operator delete()
{code}

as well as many long reactor freezes (8+ seconds):
{code}
W1010 17:02:58.113458  5078 connection.cc:199] RPC call timeout handler was 
delayed by 8.05891s! This may be due to a process-wide pause such as swapping, 
logging-related delays, or allocator lock contention. Will allow an additional 
1.5608s for a response.
{code}

The pauses were bad enough that this caused lots of leader elections, write 
timeouts, etc. This went on for about a minute and a half before the servers 
became stable again.

> Deleting large tablets causes a lot of tcmalloc contention
> ----------------------------------------------------------
>
>                 Key: KUDU-1692
>                 URL: https://issues.apache.org/jira/browse/KUDU-1692
>             Project: Kudu
>          Issue Type: Bug
>          Components: tablet, util
>    Affects Versions: 1.0.0
>            Reporter: Todd Lipcon
>
> I deleted a large table which contained about 1TB of data per tablet server. 
> The tablet servers then started spending a large amount of time in this stack:
> {code}
>                   855e94 tcmalloc::ThreadCache::GetThreadStats(unsigned 
> long*, unsigned long*) 
> (/opt/cloudera/parcels/KUDU-1.0.0-1.kudu1.0.0.p0.6/lib/kudu/sbin-release/kudu-tserver)
>                   84e9ba ExtractStats(TCMallocStats*, unsigned long*, 
> tcmalloc::PageHeap::SmallSpanStats*, tcmalloc::PageHeap::LargeSpanStats*) 
> (/opt/cloudera/parcels/KUDU-1.0.0-1.kudu1.0.0.p0.6/lib/kudu/sbin-releas
>                   850f8f TCMallocImplementation::GetNumericProperty(char 
> const*, unsigned long*) 
> (/opt/cloudera/parcels/KUDU-1.0.0-1.kudu1.0.0.p0.6/lib/kudu/sbin-release/kudu-tserver)
>                  1a18c50 kudu::GetTCMallocCurrentAllocatedBytes() 
> (/opt/cloudera/parcels/KUDU-1.0.0-1.kudu1.0.0.p0.6/lib/kudu/sbin-release/kudu-tserver)
>                  1a19a50 kudu::MemTracker::UpdateConsumption() 
> (/opt/cloudera/parcels/KUDU-1.0.0-1.kudu1.0.0.p0.6/lib/kudu/sbin-release/kudu-tserver)
>                   980f01 std::_Sp_counted_ptr<kudu::cfile::CFileReader*, 
> (__gnu_cxx::_Lock_policy)2>::_M_dispose() 
> (/opt/cloudera/parcels/KUDU-1.0.0-1.kudu1.0.0.p0.6/lib/kudu/sbin-release/kudu-tserver)
>                   99a937 kudu::tablet::CFileSet::~CFileSet() 
> (/opt/cloudera/parcels/KUDU-1.0.0-1.kudu1.0.0.p0.6/lib/kudu/sbin-release/kudu-tserver)
>                   99ad61 kudu::tablet::CFileSet::~CFileSet() 
> (/opt/cloudera/parcels/KUDU-1.0.0-1.kudu1.0.0.p0.6/lib/kudu/sbin-release/kudu-tserver)
>                   948b42 kudu::tablet::DiskRowSet::~DiskRowSet() 
> (/opt/cloudera/parcels/KUDU-1.0.0-1.kudu1.0.0.p0.6/lib/kudu/sbin-release/kudu-tserver)
>                   965f35 kudu::tablet::RowSetTree::~RowSetTree() 
> (/opt/cloudera/parcels/KUDU-1.0.0-1.kudu1.0.0.p0.6/lib/kudu/sbin-release/kudu-tserver)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KUDU-1692) Deleting large tablets causes a lot of tcmalloc contention

Reply via email to