Thanks, Josh.  This issue should be squared away by the time you get back.
Have a great vacation!

- Doug

On Wed, Sep 17, 2008 at 4:47 PM, Joshua Taylor <[EMAIL PROTECTED]>wrote:

> (sorry premature send)
>
> Our current strategy is:
>   1. Store things in Hypertable until something crashes
>   2. Wipe out all the commit logs (losing some data)
>   3. Restart Hypertable
>
> This is ok for our purposes at the moment.  We can treat Hypertable as a
> somewhat lossy large data store.  :)
>
> Josh
>
>
>
> On Wed, Sep 17, 2008 at 4:46 PM, Joshua Taylor <[EMAIL PROTECTED]>wrote:
>
>> Hi Doug, thanks for the reply.  I'm glad to hear of your upcoming update.
>> I'll put my investigation on hold for a while then (well, I have to anyway
>> since I'm going on vacation).
>>
>> Our current strategy is:
>>   1. Store things in Hypertable until something crashes
>>   2. Wipe out all the commit logs (losing some data)
>>
>>
>>
>> On Mon, Sep 15, 2008 at 9:32 PM, Doug Judd <[EMAIL PROTECTED]> wrote:
>>
>>> Hi Josh,
>>>
>>> Thanks for investigating!  Comments inline ...
>>>
>>> On Mon, Sep 15, 2008 at 4:41 PM, Joshua Taylor <[EMAIL PROTECTED]>wrote:
>>>
>>>> I've managed to load about 2-3 TB into a Hypertable cluster with 15
>>>> RangeServers.  I'm still having problems with the RangeServers failing to
>>>> clean up their old logs.
>>>>
>>>> Server              User commit log in MB
>>>> 10.10.1.194_38060   51512.8
>>>> 10.10.1.200_38060   48050.6
>>>> 10.10.1.199_38060   12464.7
>>>> 10.10.1.190_38060   90.8244
>>>> 10.10.1.193_38060   86.9587
>>>> 10.10.1.188_38060   70.4194
>>>> 10.10.1.191_38060   58.9111
>>>> 10.10.1.198_38060   48.7823
>>>> 10.10.1.192_38060   41.6324
>>>> 10.10.1.202_38060   34.0847
>>>> 10.10.1.189_38060   28.5852
>>>> 10.10.1.197_38060   22.3202
>>>> 10.10.1.196_38060   17.3054
>>>> 10.10.1.201_38060   14.2905
>>>> 10.10.1.195_38060   0
>>>>
>>>> No new updates have been added to this cluster in about 12 hours.  Most
>>>> of the log directories look fine but a few have gigabytes of logs that
>>>> should be cleaned up.
>>>>
>>>> I've added some debug logging to try and isolate what's happening.  For
>>>> each call to CommitLog::purge(), I'm logging the cutoff timestamp used.
>>>> I've also added messages to print out each of the remaining log fragments
>>>> after the purge breaks off:
>>>>
>>>> diff --git a/src/cc/Hypertable/Lib/CommitLog.cc
>>>> b/src/cc/Hypertable/Lib/CommitLog.cc
>>>> index a6b8920..ea0f894 100644
>>>> --- a/src/cc/Hypertable/Lib/CommitLog.cc
>>>> +++ b/src/cc/Hypertable/Lib/CommitLog.cc
>>>> @@ -232,6 +232,8 @@ int CommitLog::purge(uint64_t timestamp) {
>>>>    CommitLogFileInfo file_info;
>>>>    String fname;
>>>>
>>>> +  HT_DEBUGF("purge: %llu", timestamp);
>>>> +
>>>>    try {
>>>>
>>>>      while (!m_fragment_queue.empty()) {
>>>> @@ -243,7 +245,16 @@ int CommitLog::purge(uint64_t timestamp) {
>>>>          HT_INFOF("Removed log fragment file='%s' timestamp=%lld",
>>>> fname.c_str(), file_info.ti
>>>>        }
>>>>        else {
>>>> -        //HT_INFOF("LOG FRAGMENT PURGE breaking because %lld >= %lld",
>>>> file_info.timestamp, t
>>>> +        HT_DEBUGF("LOG FRAGMENT PURGE breaking because %lld >= %lld,
>>>> queue size = %llu",
>>>> +                 file_info.timestamp, timestamp,
>>>> m_fragment_queue.size());
>>>> +        size_t i = 0;
>>>> +        foreach(CommitLogFileInfo const & info, m_fragment_queue)
>>>> +        {
>>>> +            ++i;
>>>> +            HT_DEBUGF("  Fragment %llu: time=%lld size=%llu name=%s%u",
>>>> +                      i, info.timestamp, info.size,
>>>> info.log_dir.c_str(),
>>>> +                      info.num);
>>>> +        }
>>>>          break;
>>>>        }
>>>>      }
>>>>
>>>> What I find is that the purge() function is still being called
>>>> regularly, but that the timestamp that gets passed in becomes stuck before
>>>> the oldest fragment and never changes after in later calls.
>>>>
>>>> Sometimes the timestamp threshold is just a few minutes earlier than the
>>>> oldest log's timestamp.  Sometimes the passed in threshold value is 1, 
>>>> which
>>>> seems like a suspicious number.  Here's the first few lines of a recent
>>>> purge from the 52 GB server:
>>>>
>>>> 1221520163 DEBUG Hypertable.RangeServer :
>>>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:235) purge: 1
>>>> 1221520163 DEBUG Hypertable.RangeServer :
>>>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:249) LOG FRAGMENT
>>>> PURGE breaking because 1221271063903383000 >= 1, queue size = 831
>>>> 1221520163 DEBUG Hypertable.RangeServer :
>>>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 
>>>> 1:
>>>> time=1221271063903383000 size=67436344
>>>> name=/hypertable/servers/10.10.1.194_38060/log/user/3
>>>> 1221520163 DEBUG Hypertable.RangeServer :
>>>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 
>>>> 2:
>>>> time=1221271046265865000 size=896929
>>>> name=/hypertable/servers/10.10.1.193_38060/log/9F4D8D5249B24A2C52F06D2B/0
>>>> 1221520163 DEBUG Hypertable.RangeServer :
>>>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 
>>>> 3:
>>>> time=1221271374609078000 size=100133731
>>>> name=/hypertable/servers/10.10.1.194_38060/log/user/4
>>>>
>>>> Yes, the log is currently holding 831 fragments.  Also I see compactions
>>>> are still happening on this server, but they start and finish in the same
>>>> second.
>>>>
>>>> Here's a purge from the 48 GB server:
>>>>
>>>> 1221520459 DEBUG Hypertable.RangeServer :
>>>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:235) purge:
>>>> 1221327270724907001
>>>> 1221520459 DEBUG Hypertable.RangeServer :
>>>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:249) LOG FRAGMENT
>>>> PURGE breaking because 1221327332648604000 >= 1221327270724907001, queue
>>>> size = 622
>>>> 1221520459 DEBUG Hypertable.RangeServer :
>>>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 
>>>> 1:
>>>> time=1221327332648604000 size=100006176
>>>> name=/hypertable/servers/10.10.1.200_38060/log/user/916
>>>> 1221520459 DEBUG Hypertable.RangeServer :
>>>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 
>>>> 2:
>>>> time=1221327395726285000 size=100147244
>>>> name=/hypertable/servers/10.10.1.200_38060/log/user/917
>>>> 1221520459 DEBUG Hypertable.RangeServer :
>>>> (/home/josh/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:256)   Fragment 
>>>> 3:
>>>> time=1221327503701781000 size=100179860
>>>> name=/hypertable/servers/10.10.1.200_38060/log/user/918
>>>>
>>>> This one has 622 fragments.  The last logged compaction on this server
>>>> was 25 hours ago, though it apparently did some non-trivial work in that
>>>> time.  At least, the begin and end compaction log messages have some time
>>>> between them.
>>>>
>>>
>>> This definitely sheds light on the problem.  I think I've figured out how
>>> the purge timestamp can get set to 1.  It's set by the following line in
>>> AccessGroup::run_compaction() :
>>>
>>>     m_oldest_cached_timestamp = (m_cell_cache_ptr->size() > 0) ?
>>> timestamp.real + 1 : 0;
>>>
>>> If after a split, the split-off portion has a few updates in the split
>>> log, but never receives any updates therafter, then the range could get
>>> compacted with timestamp.real == 0 causing the above line to set
>>> "m_oldest_cached_timestamp" to 1.  I'll need to investigate a little further
>>> to figure out the other case you point out.
>>>
>>>
>>>> The log fragments appear to be in increasing time order, though a lot of
>>>> the fragments seem to be mostly empty.  For example, on the 70 MB server,
>>>> the first fragment contains 70 MB of data followed by 353 fragments with
>>>> size=101.
>>>>
>>>
>>> The issue here is that when a range splits, the first thing that happens
>>> is a split log gets installed.  The split log is for capturing any updates
>>> to the portion of the range that is getting split off.  When the split
>>> finally completes, the RangeServer that receives the newly split off portion
>>> will be notified with the name of the split log and will re-play it after it
>>> loads the range.  After it finishes replaying the split log, it will link
>>> the split log into the range's primary commit log.  This is just a commit
>>> log entry that contains the name of the log that is to be linked in.  When
>>> it does this, it rolls the log.  So, in this "degenerate" primary-key order
>>> insert, you have a bunch of split logs that basically contain nothing being
>>> linked into commit logs that are not receiving any commits.  That's why you
>>> have a bunch of tiny log fragments, then just contain a single link entry.
>>> There's a program called 'dumplog' that you can use to dump out a commit
>>> log.  Point it to the log directory and it should print out a summary of the
>>> entries.
>>>
>>>
>>>> I've looked around a bit at where these purge thresholds come from.
>>>> When the purge starts, the code tries to get the oldest cached timestamp
>>>> from each access group, then initiates a purge of logs older than that.
>>>> This timestamp seems to be the thing getting stuck (or set to 1), so tried
>>>> to find out where that gets set.  I think it's set in each access group
>>>> after a compaction to the timestamp of the compaction request (plus 1).
>>>> Where the compaction timestamp comes from is a bit of a mystery to me.  The
>>>> range seems to schedule the compactions either from m_state.timestamp or
>>>> m_timestamp, depending on whether the compaction is initiated in
>>>> Range::run_compaction or Range::split_compact_and_shrink.  Also, I don't
>>>> understand what's going on with Timestamp.real vs. Timestamp.logical.  The
>>>> flow of this code is just a bit hard for me to track.  ;)
>>>>
>>>
>>> This code is definitely questionable and obviously broken.  I've spent
>>> the past 1 1/2 weeks overhauling all of this logic based on two of your
>>> suggestions:
>>>
>>> 1. Freeze and copy the CellCache during a compaction
>>> 2. Separate the timestamp from the revision number
>>>
>>> I'm nearly finished with this overhaul, but have a few more systems tests
>>> that I need to get passing.  I hope to have this checked in within two or
>>> three days.  These changes have cleaned up the code tremendously.  Now, the
>>> m_oldest_cached_timestamp variable gets set with each call to
>>> AccessGroup::add() and does not get set with the line above.  Plus, I've
>>> combined add() and replay_add() into just add().
>>>
>>> Anyway, that's as far as I've managed to go on this problem so far.  I
>>>> was wondering if anyone on the list who understands this code better might
>>>> be able to look at the problem description and say, "A-ha!  I know what's
>>>> going wrong here!"  (And hopefully issue a patch.)  Otherwise, I'll 
>>>> continue
>>>> unravelling a bit at a time.
>>>>
>>>
>>> I should have the new code checked in within a few days, which should
>>> take care of these problems, so I wouldn't bother spending too much time on
>>> this.  Thanks for all of your help.
>>>
>>> - Doug
>>>
>>>
>>>
>>>
>>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to