Hi Josh,

Thanks for investigating!  Comments inline ...

On Mon, Sep 15, 2008 at 4:41 PM, Joshua Taylor <[EMAIL PROTECTED]>wrote:

> I've managed to load about 2-3 TB into a Hypertable cluster with 15
> RangeServers.  I'm still having problems with the RangeServers failing to
> clean up their old logs.
>
> Server              User commit log in MB
> 10.10.1.194_38060   51512.8
> 10.10.1.200_38060   48050.6
> 10.10.1.199_38060   12464.7
> 10.10.1.190_38060   90.8244
> 10.10.1.193_38060   86.9587
> 10.10.1.188_38060   70.4194
> 10.10.1.191_38060   58.9111
> 10.10.1.198_38060   48.7823
> 10.10.1.192_38060   41.6324
> 10.10.1.202_38060   34.0847
> 10.10.1.189_38060   28.5852
> 10.10.1.197_38060   22.3202
> 10.10.1.196_38060   17.3054
> 10.10.1.201_38060   14.2905
> 10.10.1.195_38060   0
>
> No new updates have been added to this cluster in about 12 hours.  Most of
> the log directories look fine but a few have gigabytes of logs that should
> be cleaned up.
>
> I've added some debug logging to try and isolate what's happening.  For
> each call to CommitLog::purge(), I'm logging the cutoff timestamp used.
> I've also added messages to print out each of the remaining log fragments
> after the purge breaks off:
>
> diff --git a/src/cc/Hypertable/Lib/CommitLog.cc
> b/src/cc/Hypertable/Lib/CommitLog.cc
> index a6b8920..ea0f894 100644
> --- a/src/cc/Hypertable/Lib/CommitLog.cc
> +++ b/src/cc/Hypertable/Lib/CommitLog.cc
> @@ -232,6 +232,8 @@ int CommitLog::purge(uint64_t timestamp) {
>    CommitLogFileInfo file_info;
>    String fname;
>
> +  HT_DEBUGF("purge: %llu", timestamp);
> +
>    try {
>
>      while (!m_fragment_queue.empty()) {
> @@ -243,7 +245,16 @@ int CommitLog::purge(uint64_t timestamp) {
>          HT_INFOF("Removed log fragment file='%s' timestamp=%lld",
> fname.c_str(), file_info.ti
>        }
>        else {
> -        //HT_INFOF("LOG FRAGMENT PURGE breaking because %lld >= %lld",
> file_info.timestamp, t
> +        HT_DEBUGF("LOG FRAGMENT PURGE breaking because %lld >= %lld, queue
> size = %llu",
> +                 file_info.timestamp, timestamp, m_fragment_queue.size());
> +        size_t i = 0;
> +        foreach(CommitLogFileInfo const & info, m_fragment_queue)
> +        {
> +            ++i;
> +            HT_DEBUGF("  Fragment %llu: time=%lld size=%llu name=%s%u",
> +                      i, info.timestamp, info.size, info.log_dir.c_str(),
> +                      info.num);
> +        }
>          break;
>        }
>      }
>
> What I find is that the purge() function is still being called regularly,
> but that the timestamp that gets passed in becomes stuck before the oldest
> fragment and never changes after in later calls.
>
> Sometimes the timestamp threshold is just a few minutes earlier than the
> oldest log's timestamp.  Sometimes the passed in threshold value is 1, which
> seems like a suspicious number.  Here's the first few lines of a recent
> purge from the 52 GB server:
>
> 1221520163 DEBUG Hypertable.RangeServer :
> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:235) purge: 1
> 1221520163 DEBUG Hypertable.RangeServer :
> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:249) LOG FRAGMENT
> PURGE breaking because 1221271063903383000 >= 1, queue size = 831
> 1221520163 DEBUG Hypertable.RangeServer :
> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 1:
> time=1221271063903383000 size=67436344
> name=/hypertable/servers/10.10.1.194_38060/log/user/3
> 1221520163 DEBUG Hypertable.RangeServer :
> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 2:
> time=1221271046265865000 size=896929
> name=/hypertable/servers/10.10.1.193_38060/log/9F4D8D5249B24A2C52F06D2B/0
> 1221520163 DEBUG Hypertable.RangeServer :
> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 3:
> time=1221271374609078000 size=100133731
> name=/hypertable/servers/10.10.1.194_38060/log/user/4
>
> Yes, the log is currently holding 831 fragments.  Also I see compactions
> are still happening on this server, but they start and finish in the same
> second.
>
> Here's a purge from the 48 GB server:
>
> 1221520459 DEBUG Hypertable.RangeServer :
> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:235) purge:
> 1221327270724907001
> 1221520459 DEBUG Hypertable.RangeServer :
> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:249) LOG FRAGMENT
> PURGE breaking because 1221327332648604000 >= 1221327270724907001, queue
> size = 622
> 1221520459 DEBUG Hypertable.RangeServer :
> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 1:
> time=1221327332648604000 size=100006176
> name=/hypertable/servers/10.10.1.200_38060/log/user/916
> 1221520459 DEBUG Hypertable.RangeServer :
> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 2:
> time=1221327395726285000 size=100147244
> name=/hypertable/servers/10.10.1.200_38060/log/user/917
> 1221520459 DEBUG Hypertable.RangeServer :
> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 3:
> time=1221327503701781000 size=100179860
> name=/hypertable/servers/10.10.1.200_38060/log/user/918
>
> This one has 622 fragments.  The last logged compaction on this server was
> 25 hours ago, though it apparently did some non-trivial work in that time.
> At least, the begin and end compaction log messages have some time between
> them.
>

This definitely sheds light on the problem.  I think I've figured out how
the purge timestamp can get set to 1.  It's set by the following line in
AccessGroup::run_compaction() :

    m_oldest_cached_timestamp = (m_cell_cache_ptr->size() > 0) ?
timestamp.real + 1 : 0;

If after a split, the split-off portion has a few updates in the split log,
but never receives any updates therafter, then the range could get compacted
with timestamp.real == 0 causing the above line to set
"m_oldest_cached_timestamp" to 1.  I'll need to investigate a little further
to figure out the other case you point out.


> The log fragments appear to be in increasing time order, though a lot of
> the fragments seem to be mostly empty.  For example, on the 70 MB server,
> the first fragment contains 70 MB of data followed by 353 fragments with
> size=101.
>

The issue here is that when a range splits, the first thing that happens is
a split log gets installed.  The split log is for capturing any updates to
the portion of the range that is getting split off.  When the split finally
completes, the RangeServer that receives the newly split off portion will be
notified with the name of the split log and will re-play it after it loads
the range.  After it finishes replaying the split log, it will link the
split log into the range's primary commit log.  This is just a commit log
entry that contains the name of the log that is to be linked in.  When it
does this, it rolls the log.  So, in this "degenerate" primary-key order
insert, you have a bunch of split logs that basically contain nothing being
linked into commit logs that are not receiving any commits.  That's why you
have a bunch of tiny log fragments, then just contain a single link entry.
There's a program called 'dumplog' that you can use to dump out a commit
log.  Point it to the log directory and it should print out a summary of the
entries.


> I've looked around a bit at where these purge thresholds come from.  When
> the purge starts, the code tries to get the oldest cached timestamp from
> each access group, then initiates a purge of logs older than that.  This
> timestamp seems to be the thing getting stuck (or set to 1), so tried to
> find out where that gets set.  I think it's set in each access group after a
> compaction to the timestamp of the compaction request (plus 1).  Where the
> compaction timestamp comes from is a bit of a mystery to me.  The range
> seems to schedule the compactions either from m_state.timestamp or
> m_timestamp, depending on whether the compaction is initiated in
> Range::run_compaction or Range::split_compact_and_shrink.  Also, I don't
> understand what's going on with Timestamp.real vs. Timestamp.logical.  The
> flow of this code is just a bit hard for me to track.  ;)
>

This code is definitely questionable and obviously broken.  I've spent the
past 1 1/2 weeks overhauling all of this logic based on two of your
suggestions:

1. Freeze and copy the CellCache during a compaction
2. Separate the timestamp from the revision number

I'm nearly finished with this overhaul, but have a few more systems tests
that I need to get passing.  I hope to have this checked in within two or
three days.  These changes have cleaned up the code tremendously.  Now, the
m_oldest_cached_timestamp variable gets set with each call to
AccessGroup::add() and does not get set with the line above.  Plus, I've
combined add() and replay_add() into just add().

Anyway, that's as far as I've managed to go on this problem so far.  I was
> wondering if anyone on the list who understands this code better might be
> able to look at the problem description and say, "A-ha!  I know what's going
> wrong here!"  (And hopefully issue a patch.)  Otherwise, I'll continue
> unravelling a bit at a time.
>

I should have the new code checked in within a few days, which should take
care of these problems, so I wouldn't bother spending too much time on
this.  Thanks for all of your help.

- Doug

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to