[hypertable-dev] Log cleanup still getting stuck

Joshua Taylor Mon, 15 Sep 2008 16:41:18 -0700

I've managed to load about 2-3 TB into a Hypertable cluster with 15
RangeServers.  I'm still having problems with the RangeServers failing to
clean up their old logs.


Server              User commit log in MB
10.10.1.194_38060   51512.8
10.10.1.200_38060   48050.6
10.10.1.199_38060   12464.7
10.10.1.190_38060   90.8244
10.10.1.193_38060   86.9587
10.10.1.188_38060   70.4194
10.10.1.191_38060   58.9111
10.10.1.198_38060   48.7823
10.10.1.192_38060   41.6324
10.10.1.202_38060   34.0847
10.10.1.189_38060   28.5852
10.10.1.197_38060   22.3202
10.10.1.196_38060   17.3054
10.10.1.201_38060   14.2905
10.10.1.195_38060   0

No new updates have been added to this cluster in about 12 hours.  Most of
the log directories look fine but a few have gigabytes of logs that should
be cleaned up.

I've added some debug logging to try and isolate what's happening.  For each
call to CommitLog::purge(), I'm logging the cutoff timestamp used.  I've
also added messages to print out each of the remaining log fragments after
the purge breaks off:

diff --git a/src/cc/Hypertable/Lib/CommitLog.cc
b/src/cc/Hypertable/Lib/CommitLog.cc
index a6b8920..ea0f894 100644
--- a/src/cc/Hypertable/Lib/CommitLog.cc
+++ b/src/cc/Hypertable/Lib/CommitLog.cc
@@ -232,6 +232,8 @@ int CommitLog::purge(uint64_t timestamp) {
   CommitLogFileInfo file_info;
   String fname;

+  HT_DEBUGF("purge: %llu", timestamp);
+
   try {

     while (!m_fragment_queue.empty()) {
@@ -243,7 +245,16 @@ int CommitLog::purge(uint64_t timestamp) {
         HT_INFOF("Removed log fragment file='%s' timestamp=%lld",
fname.c_str(), file_info.ti
       }
       else {
-        //HT_INFOF("LOG FRAGMENT PURGE breaking because %lld >= %lld",
file_info.timestamp, t
+        HT_DEBUGF("LOG FRAGMENT PURGE breaking because %lld >= %lld, queue
size = %llu",
+                 file_info.timestamp, timestamp, m_fragment_queue.size());
+        size_t i = 0;
+        foreach(CommitLogFileInfo const & info, m_fragment_queue)
+        {
+            ++i;
+            HT_DEBUGF("  Fragment %llu: time=%lld size=%llu name=%s%u",
+                      i, info.timestamp, info.size, info.log_dir.c_str(),
+                      info.num);
+        }
         break;
       }
     }

What I find is that the purge() function is still being called regularly,
but that the timestamp that gets passed in becomes stuck before the oldest
fragment and never changes after in later calls.

Sometimes the timestamp threshold is just a few minutes earlier than the
oldest log's timestamp.  Sometimes the passed in threshold value is 1, which
seems like a suspicious number.  Here's the first few lines of a recent
purge from the 52 GB server:

1221520163 DEBUG Hypertable.RangeServer :
(/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:235) purge: 1
1221520163 DEBUG Hypertable.RangeServer :
(/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:249) LOG FRAGMENT
PURGE breaking because 1221271063903383000 >= 1, queue size = 831
1221520163 DEBUG Hypertable.RangeServer :
(/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 1:
time=1221271063903383000 size=67436344
name=/hypertable/servers/10.10.1.194_38060/log/user/3
1221520163 DEBUG Hypertable.RangeServer :
(/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 2:
time=1221271046265865000 size=896929
name=/hypertable/servers/10.10.1.193_38060/log/9F4D8D5249B24A2C52F06D2B/0
1221520163 DEBUG Hypertable.RangeServer :
(/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 3:
time=1221271374609078000 size=100133731
name=/hypertable/servers/10.10.1.194_38060/log/user/4

Yes, the log is currently holding 831 fragments.  Also I see compactions are
still happening on this server, but they start and finish in the same
second.

Here's a purge from the 48 GB server:

1221520459 DEBUG Hypertable.RangeServer :
(/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:235) purge:
1221327270724907001
1221520459 DEBUG Hypertable.RangeServer :
(/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:249) LOG FRAGMENT
PURGE breaking because 1221327332648604000 >= 1221327270724907001, queue
size = 622
1221520459 DEBUG Hypertable.RangeServer :
(/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 1:
time=1221327332648604000 size=100006176
name=/hypertable/servers/10.10.1.200_38060/log/user/916
1221520459 DEBUG Hypertable.RangeServer :
(/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 2:
time=1221327395726285000 size=100147244
name=/hypertable/servers/10.10.1.200_38060/log/user/917
1221520459 DEBUG Hypertable.RangeServer :
(/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 3:
time=1221327503701781000 size=100179860
name=/hypertable/servers/10.10.1.200_38060/log/user/918

This one has 622 fragments.  The last logged compaction on this server was
25 hours ago, though it apparently did some non-trivial work in that time.
At least, the begin and end compaction log messages have some time between
them.

The log fragments appear to be in increasing time order, though a lot of the
fragments seem to be mostly empty.  For example, on the 70 MB server, the
first fragment contains 70 MB of data followed by 353 fragments with
size=101.

I've looked around a bit at where these purge thresholds come from.  When
the purge starts, the code tries to get the oldest cached timestamp from
each access group, then initiates a purge of logs older than that.  This
timestamp seems to be the thing getting stuck (or set to 1), so tried to
find out where that gets set.  I think it's set in each access group after a
compaction to the timestamp of the compaction request (plus 1).  Where the
compaction timestamp comes from is a bit of a mystery to me.  The range
seems to schedule the compactions either from m_state.timestamp or
m_timestamp, depending on whether the compaction is initiated in
Range::run_compaction or Range::split_compact_and_shrink.  Also, I don't
understand what's going on with Timestamp.real vs. Timestamp.logical.  The
flow of this code is just a bit hard for me to track.  ;)

Anyway, that's as far as I've managed to go on this problem so far.  I was
wondering if anyone on the list who understands this code better might be
able to look at the problem description and say, "A-ha!  I know what's going
wrong here!"  (And hopefully issue a patch.)  Otherwise, I'll continue
unravelling a bit at a time.

Josh

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[hypertable-dev] Log cleanup still getting stuck

Reply via email to