[hypertable-dev] Re: Crash when Hypertable.RangeServer.MaintenanceThreads > 1

Liu Kejia (Donald) Sat, 20 Dec 2008 20:59:25 -0800

Hi Doug,

I did pull in your fix, despite I didn't use the newest 0.9.1.0 code:


commit fb817f12954991d58212fccd5e1fdb1564da123c
Author: Doug Judd <[email protected]>
Date:   Wed Dec 3 17:11:49 2008 -0800

    Fixed crash with > 1 maintenance threads

Donald


On Sun, Dec 21, 2008 at 12:40 PM, Doug Judd <[email protected]> wrote:
> Hi Donald,
>
> This stack trace is from an old version of the code that does not have the
> fix.  Try running with the latest code from the git repository.  As far as
> issue 84 goes, I push it into the beta release.
>
> - Doug
>
> On Sat, Dec 20, 2008 at 6:38 PM, Liu Kejia (Donald) <[email protected]>
> wrote:
>>
>> Hi Doug,
>>
>> The only related core file I can find was generated on 01:11 Dec 14
>> Beijing time. The stack trace shows a similar problem as phenomenon 3
>> in my first post in this thread:
>>
>> #0  0x00000000005763e6 in Hypertable::AccessGroup::run_compaction
>> (this=0x2a962da7b0, major=false)
>>    at
>> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/AccessGroup.cc:280
>> 280         HT_EXPECT(m_immutable_cache_ptr, Error::FAILED_EXPECTATION);
>> (gdb) where
>> #0  0x00000000005763e6 in Hypertable::AccessGroup::run_compaction
>> (this=0x2a962da7b0, major=false)
>>    at
>> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/AccessGroup.cc:280
>> #1  0x00000000005633e9 in Hypertable::Range::run_compaction
>> (this=0x2a962d9190, major=false)
>>    at /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/Range.cc:628
>> #2  0x00000000005632af in Hypertable::Range::compact
>> (this=0x2a962d9190, major=false)
>>    at /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/Range.cc:611
>> #3  0x000000000055ce7d in
>> Hypertable::MaintenanceTaskCompaction::execute (this=0x2a962d7c80)
>>    at
>> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/MaintenanceTaskCompaction.cc:38
>> #4  0x0000000000546aef in
>> Hypertable::MaintenanceQueue::Worker::operator() (this=0x5221c1a8)
>>    at
>> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/MaintenanceQueue.h:108
>> #5  0x0000000000546925 in
>>
>> boost::detail::function::void_function_obj_invoker0<Hypertable::MaintenanceQueue::Worker,
>> void>::invoke (
>>    function_obj_p...@0x5221c1a8) at
>>
>> /home/pp/src/hypertable/src/cc/boost-1_34-fix/boost/function/function_template.hpp:158
>> #6  0x0000002a95a16dc7 in boost::function0<void,
>> std::allocator<boost::function_base> >::operator() ()
>>   from /usr/local/lib/libboost_thread-gcc34-mt-1_34_1.so.1.34.1
>> #7  0x0000002a95a16407 in boost::thread_group::join_all () from
>> /usr/local/lib/libboost_thread-gcc34-mt-1_34_1.so.1.34.1
>> #8  0x000000302b80610a in ?? ()
>> #9  0x0000000000000000 in ?? ()
>>
>> Maybe Phoenix may provide you more information.
>>
>> About issue 84, I notice it was planned not to be implemented before
>> release 1.1. Can you give some hints on how you plan to do it so that
>> I may throw in a quick hack? This problem really annoys me these days
>> :(
>>
>> Donald
>>
>>
>> On Sun, Dec 21, 2008 at 5:58 AM, Doug Judd <[email protected]> wrote:
>> > Hi Donald,
>> >
>> > Comments inline ...
>> >
>> > On Sat, Dec 20, 2008 at 12:12 PM, donald <[email protected]> wrote:
>> >>
>> >> Hi Doug,
>> >>
>> >> I'm afraid there are still deeper causes of this bug. With your fix
>> >> applied, it doesn't happen that frequently as before, but still
>> >> happens after inserting some hundreds of gigabytes of data. We need to
>> >> fix this because the maintenance task is currently the bottleneck of
>> >> the Range Server.
>> >
>> > Can you post a stack trace?
>> >
>> >>
>> >> Actually, Range Server workers can accept updates much faster than
>> >> maintenance task compacts them. This fact makes range servers
>> >> unreliable. Consider if we feed Hypertable with MapReduce tasks, very
>> >> soon range servers are all filled with over-sized ranges waiting for
>> >> compaction. The situation gets worse and worse as time goes on because
>> >> workers still accept more updates without knowing that the maintenance
>> >> tasks are seriously lagged and the memory will to be used out soon. In
>> >> fact in our application range servers die many times per week due to
>> >> out of memory, this makes the maintenance a heavy task because
>> >> Hypertable doesn't have usable auto-recovery functionality yet. To
>> >> make range servers more reliable, we need a mechanism to slow down.
>> >
>> > Issue 84 has to do with request throttling.  Once it's finished, the
>> > requests will get held up until the RangeServer is ready to service
>> > them.
>> > This will add backpressure to the application generating the updates, so
>> > you
>> > should no longer have any out-of-memory errors.
>> >
>> >> On the other hand, why should compactions be handled by background
>> >> maintenance tasks? IMHO if we do compactions directly in
>> >> RangeServer::update(), a lot of trouble could be saved. It won't block
>> >> the client initiating the current update as long as a response message
>> >> could be sent before starting the compaction. Upcoming updates won't
>> >> block either because no lock is needed while doing compaction, other
>> >> workers may handle those updates. The only situation that may block
>> >> client updates is that all workers are busy doing compactions, which
>> >> is the situation when clients should definitely slow down.
>> >
>> > I think our approach to issue 84 will effectively do the same thing.
>> >  The
>> > nice thing about having a maintenance queue with maintenance threads is
>> > that
>> > the compaction and split tasks can get added to the queue and carried
>> > out in
>> > priority order.
>> >
>> >> What do you think?
>> >>
>> >> Donald
>> >>
>> >> On Dec 4, 9:32 am, "Doug Judd" <[email protected]> wrote:
>> >> > Hi Donald,
>> >> >
>> >> > I've reproduced this problem and have checked in a fix to the 'next'
>> >> > branch.  This was introduced with the major overhaul.  I have added a
>> >> > multiple maintenance thread system test to prevent this from
>> >> > happening
>> >> > in
>> >> > the future.
>> >> >
>> >> > BTW, if you do pull the 'next' branch, it has a number of changes
>> >> > that
>> >> > make
>> >> > it incompatible with the previous versions.  You'll have to start
>> >> > with a
>> >> > clean database.  The 'next' branch will be compatible with 0.9.1.0
>> >> > which
>> >> > should get released tomorrow.
>> >> >
>> >> > - Doug
>> >> >
>> >> > On Tue, Dec 2, 2008 at 7:10 PM, donald <[email protected]> wrote:
>> >> >
>> >> > > Hi Doug,
>> >> >
>> >> > > I thinks it's better to open a new thread on this topic :)
>> >> >
>> >> > > The multiple maintenance thread crash is easy to reproduce: just
>> >> > > set
>> >> > > Hypertable.RangeServer.MaintenanceThreads=2, start all servers
>> >> > > locally
>> >> > > on a single node and run random_write_test 10000000000. The range
>> >> > > server will crash in a minute. But the reason is sort of hard to
>> >> > > track.
>> >> >
>> >> > > What we know till now:
>> >> > > 1. The bug is introduced in version 0.9.0.11. Former versions
>> >> > > doesn't
>> >> > > have this problem
>> >> > > 2. According to RangeServer.log, the crash usually happens when two
>> >> > > adjacent ranges are both splitting in two maintenance threads
>> >> > > concurrently. If we forbid this behavior by modifying
>> >> > > MaintenanceTaskQueue code, the crash problem is gone, but the
>> >> > > reason
>> >> > > is unknown. (Pheonix discovered this)
>> >> > > 3. Sometimes the Range Server fails at HT_EXPECT
>> >> > > (m_immutable_cache_ptr, Error::FAILED_EXPECTATION); in
>> >> > > AccessGroup::run_compaction(). m_immutable_cache_ptr is set to 0 in
>> >> > > multiple places with m_mutex locked, but not always checked in a
>> >> > > locked environment, which is doubtable.
>> >> >
>> >> > > Do you have any idea based on these facts?
>> >> >
>> >> > > Donald
>> >>
>> >
>> >
>> > >
>> >
>>
>>
>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[hypertable-dev] Re: Crash when Hypertable.RangeServer.MaintenanceThreads > 1

Reply via email to