[hypertable-dev] Re: Crash when Hypertable.RangeServer.MaintenanceThreads > 1

Doug Judd Sat, 20 Dec 2008 20:40:49 -0800

Hi Donald,

This stack trace is from an old version of the code that does not have the
fix.  Try running with the latest code from the git repository.  As far as
issue 84 goes, I push it into the beta release.


- Doug

On Sat, Dec 20, 2008 at 6:38 PM, Liu Kejia (Donald) <[email protected]>wrote:

>
> Hi Doug,
>
> The only related core file I can find was generated on 01:11 Dec 14
> Beijing time. The stack trace shows a similar problem as phenomenon 3
> in my first post in this thread:
>
> #0  0x00000000005763e6 in Hypertable::AccessGroup::run_compaction
> (this=0x2a962da7b0, major=false)
>    at
> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/AccessGroup.cc:280
> 280         HT_EXPECT(m_immutable_cache_ptr, Error::FAILED_EXPECTATION);
> (gdb) where
> #0  0x00000000005763e6 in Hypertable::AccessGroup::run_compaction
> (this=0x2a962da7b0, major=false)
>    at
> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/AccessGroup.cc:280
> #1  0x00000000005633e9 in Hypertable::Range::run_compaction
> (this=0x2a962d9190, major=false)
>    at /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/Range.cc:628
> #2  0x00000000005632af in Hypertable::Range::compact
> (this=0x2a962d9190, major=false)
>    at /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/Range.cc:611
> #3  0x000000000055ce7d in
> Hypertable::MaintenanceTaskCompaction::execute (this=0x2a962d7c80)
>    at
> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/MaintenanceTaskCompaction.cc:38
> #4  0x0000000000546aef in
> Hypertable::MaintenanceQueue::Worker::operator() (this=0x5221c1a8)
>    at
> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/MaintenanceQueue.h:108
> #5  0x0000000000546925 in
>
> boost::detail::function::void_function_obj_invoker0<Hypertable::MaintenanceQueue::Worker,
> void>::invoke (
>    function_obj_p...@0x5221c1a8) at
>
> /home/pp/src/hypertable/src/cc/boost-1_34-fix/boost/function/function_template.hpp:158
> #6  0x0000002a95a16dc7 in boost::function0<void,
> std::allocator<boost::function_base> >::operator() ()
>   from /usr/local/lib/libboost_thread-gcc34-mt-1_34_1.so.1.34.1
> #7  0x0000002a95a16407 in boost::thread_group::join_all () from
> /usr/local/lib/libboost_thread-gcc34-mt-1_34_1.so.1.34.1
> #8  0x000000302b80610a in ?? ()
> #9  0x0000000000000000 in ?? ()
>
> Maybe Phoenix may provide you more information.
>
> About issue 84, I notice it was planned not to be implemented before
> release 1.1. Can you give some hints on how you plan to do it so that
> I may throw in a quick hack? This problem really annoys me these days
> :(
>
> Donald
>
>
> On Sun, Dec 21, 2008 at 5:58 AM, Doug Judd <[email protected]> wrote:
> > Hi Donald,
> >
> > Comments inline ...
> >
> > On Sat, Dec 20, 2008 at 12:12 PM, donald <[email protected]> wrote:
> >>
> >> Hi Doug,
> >>
> >> I'm afraid there are still deeper causes of this bug. With your fix
> >> applied, it doesn't happen that frequently as before, but still
> >> happens after inserting some hundreds of gigabytes of data. We need to
> >> fix this because the maintenance task is currently the bottleneck of
> >> the Range Server.
> >
> > Can you post a stack trace?
> >
> >>
> >> Actually, Range Server workers can accept updates much faster than
> >> maintenance task compacts them. This fact makes range servers
> >> unreliable. Consider if we feed Hypertable with MapReduce tasks, very
> >> soon range servers are all filled with over-sized ranges waiting for
> >> compaction. The situation gets worse and worse as time goes on because
> >> workers still accept more updates without knowing that the maintenance
> >> tasks are seriously lagged and the memory will to be used out soon. In
> >> fact in our application range servers die many times per week due to
> >> out of memory, this makes the maintenance a heavy task because
> >> Hypertable doesn't have usable auto-recovery functionality yet. To
> >> make range servers more reliable, we need a mechanism to slow down.
> >
> > Issue 84 has to do with request throttling.  Once it's finished, the
> > requests will get held up until the RangeServer is ready to service them.
> > This will add backpressure to the application generating the updates, so
> you
> > should no longer have any out-of-memory errors.
> >
> >> On the other hand, why should compactions be handled by background
> >> maintenance tasks? IMHO if we do compactions directly in
> >> RangeServer::update(), a lot of trouble could be saved. It won't block
> >> the client initiating the current update as long as a response message
> >> could be sent before starting the compaction. Upcoming updates won't
> >> block either because no lock is needed while doing compaction, other
> >> workers may handle those updates. The only situation that may block
> >> client updates is that all workers are busy doing compactions, which
> >> is the situation when clients should definitely slow down.
> >
> > I think our approach to issue 84 will effectively do the same thing.  The
> > nice thing about having a maintenance queue with maintenance threads is
> that
> > the compaction and split tasks can get added to the queue and carried out
> in
> > priority order.
> >
> >> What do you think?
> >>
> >> Donald
> >>
> >> On Dec 4, 9:32 am, "Doug Judd" <[email protected]> wrote:
> >> > Hi Donald,
> >> >
> >> > I've reproduced this problem and have checked in a fix to the 'next'
> >> > branch.  This was introduced with the major overhaul.  I have added a
> >> > multiple maintenance thread system test to prevent this from happening
> >> > in
> >> > the future.
> >> >
> >> > BTW, if you do pull the 'next' branch, it has a number of changes that
> >> > make
> >> > it incompatible with the previous versions.  You'll have to start with
> a
> >> > clean database.  The 'next' branch will be compatible with 0.9.1.0
> which
> >> > should get released tomorrow.
> >> >
> >> > - Doug
> >> >
> >> > On Tue, Dec 2, 2008 at 7:10 PM, donald <[email protected]> wrote:
> >> >
> >> > > Hi Doug,
> >> >
> >> > > I thinks it's better to open a new thread on this topic :)
> >> >
> >> > > The multiple maintenance thread crash is easy to reproduce: just set
> >> > > Hypertable.RangeServer.MaintenanceThreads=2, start all servers
> locally
> >> > > on a single node and run random_write_test 10000000000. The range
> >> > > server will crash in a minute. But the reason is sort of hard to
> >> > > track.
> >> >
> >> > > What we know till now:
> >> > > 1. The bug is introduced in version 0.9.0.11. Former versions
> doesn't
> >> > > have this problem
> >> > > 2. According to RangeServer.log, the crash usually happens when two
> >> > > adjacent ranges are both splitting in two maintenance threads
> >> > > concurrently. If we forbid this behavior by modifying
> >> > > MaintenanceTaskQueue code, the crash problem is gone, but the reason
> >> > > is unknown. (Pheonix discovered this)
> >> > > 3. Sometimes the Range Server fails at HT_EXPECT
> >> > > (m_immutable_cache_ptr, Error::FAILED_EXPECTATION); in
> >> > > AccessGroup::run_compaction(). m_immutable_cache_ptr is set to 0 in
> >> > > multiple places with m_mutex locked, but not always checked in a
> >> > > locked environment, which is doubtable.
> >> >
> >> > > Do you have any idea based on these facts?
> >> >
> >> > > Donald
> >>
> >
> >
> > >
> >
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[hypertable-dev] Re: Crash when Hypertable.RangeServer.MaintenanceThreads > 1

Reply via email to