Hi Donald,

This problem is due to a race condition that was introduced with the major
overhaul.  The attached patch should take care of it.

- Doug

On Sat, Dec 20, 2008 at 8:59 PM, Liu Kejia (Donald) <[email protected]>wrote:

>
> Hi Doug,
>
> I did pull in your fix, despite I didn't use the newest 0.9.1.0 code:
>
> commit fb817f12954991d58212fccd5e1fdb1564da123c
> Author: Doug Judd <[email protected]>
> Date:   Wed Dec 3 17:11:49 2008 -0800
>
>    Fixed crash with > 1 maintenance threads
>
> Donald
>
>
> On Sun, Dec 21, 2008 at 12:40 PM, Doug Judd <[email protected]> wrote:
> > Hi Donald,
> >
> > This stack trace is from an old version of the code that does not have
> the
> > fix.  Try running with the latest code from the git repository.  As far
> as
> > issue 84 goes, I push it into the beta release.
> >
> > - Doug
> >
> > On Sat, Dec 20, 2008 at 6:38 PM, Liu Kejia (Donald) <
> [email protected]>
> > wrote:
> >>
> >> Hi Doug,
> >>
> >> The only related core file I can find was generated on 01:11 Dec 14
> >> Beijing time. The stack trace shows a similar problem as phenomenon 3
> >> in my first post in this thread:
> >>
> >> #0  0x00000000005763e6 in Hypertable::AccessGroup::run_compaction
> >> (this=0x2a962da7b0, major=false)
> >>    at
> >> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/AccessGroup.cc:280
> >> 280         HT_EXPECT(m_immutable_cache_ptr, Error::FAILED_EXPECTATION);
> >> (gdb) where
> >> #0  0x00000000005763e6 in Hypertable::AccessGroup::run_compaction
> >> (this=0x2a962da7b0, major=false)
> >>    at
> >> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/AccessGroup.cc:280
> >> #1  0x00000000005633e9 in Hypertable::Range::run_compaction
> >> (this=0x2a962d9190, major=false)
> >>    at /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/Range.cc:628
> >> #2  0x00000000005632af in Hypertable::Range::compact
> >> (this=0x2a962d9190, major=false)
> >>    at /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/Range.cc:611
> >> #3  0x000000000055ce7d in
> >> Hypertable::MaintenanceTaskCompaction::execute (this=0x2a962d7c80)
> >>    at
> >>
> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/MaintenanceTaskCompaction.cc:38
> >> #4  0x0000000000546aef in
> >> Hypertable::MaintenanceQueue::Worker::operator() (this=0x5221c1a8)
> >>    at
> >>
> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/MaintenanceQueue.h:108
> >> #5  0x0000000000546925 in
> >>
> >>
> boost::detail::function::void_function_obj_invoker0<Hypertable::MaintenanceQueue::Worker,
> >> void>::invoke (
> >>    function_obj_p...@0x5221c1a8) at
> >>
> >>
> /home/pp/src/hypertable/src/cc/boost-1_34-fix/boost/function/function_template.hpp:158
> >> #6  0x0000002a95a16dc7 in boost::function0<void,
> >> std::allocator<boost::function_base> >::operator() ()
> >>   from /usr/local/lib/libboost_thread-gcc34-mt-1_34_1.so.1.34.1
> >> #7  0x0000002a95a16407 in boost::thread_group::join_all () from
> >> /usr/local/lib/libboost_thread-gcc34-mt-1_34_1.so.1.34.1
> >> #8  0x000000302b80610a in ?? ()
> >> #9  0x0000000000000000 in ?? ()
> >>
> >> Maybe Phoenix may provide you more information.
> >>
> >> About issue 84, I notice it was planned not to be implemented before
> >> release 1.1. Can you give some hints on how you plan to do it so that
> >> I may throw in a quick hack? This problem really annoys me these days
> >> :(
> >>
> >> Donald
> >>
> >>
> >> On Sun, Dec 21, 2008 at 5:58 AM, Doug Judd <[email protected]>
> wrote:
> >> > Hi Donald,
> >> >
> >> > Comments inline ...
> >> >
> >> > On Sat, Dec 20, 2008 at 12:12 PM, donald <[email protected]>
> wrote:
> >> >>
> >> >> Hi Doug,
> >> >>
> >> >> I'm afraid there are still deeper causes of this bug. With your fix
> >> >> applied, it doesn't happen that frequently as before, but still
> >> >> happens after inserting some hundreds of gigabytes of data. We need
> to
> >> >> fix this because the maintenance task is currently the bottleneck of
> >> >> the Range Server.
> >> >
> >> > Can you post a stack trace?
> >> >
> >> >>
> >> >> Actually, Range Server workers can accept updates much faster than
> >> >> maintenance task compacts them. This fact makes range servers
> >> >> unreliable. Consider if we feed Hypertable with MapReduce tasks, very
> >> >> soon range servers are all filled with over-sized ranges waiting for
> >> >> compaction. The situation gets worse and worse as time goes on
> because
> >> >> workers still accept more updates without knowing that the
> maintenance
> >> >> tasks are seriously lagged and the memory will to be used out soon.
> In
> >> >> fact in our application range servers die many times per week due to
> >> >> out of memory, this makes the maintenance a heavy task because
> >> >> Hypertable doesn't have usable auto-recovery functionality yet. To
> >> >> make range servers more reliable, we need a mechanism to slow down.
> >> >
> >> > Issue 84 has to do with request throttling.  Once it's finished, the
> >> > requests will get held up until the RangeServer is ready to service
> >> > them.
> >> > This will add backpressure to the application generating the updates,
> so
> >> > you
> >> > should no longer have any out-of-memory errors.
> >> >
> >> >> On the other hand, why should compactions be handled by background
> >> >> maintenance tasks? IMHO if we do compactions directly in
> >> >> RangeServer::update(), a lot of trouble could be saved. It won't
> block
> >> >> the client initiating the current update as long as a response
> message
> >> >> could be sent before starting the compaction. Upcoming updates won't
> >> >> block either because no lock is needed while doing compaction, other
> >> >> workers may handle those updates. The only situation that may block
> >> >> client updates is that all workers are busy doing compactions, which
> >> >> is the situation when clients should definitely slow down.
> >> >
> >> > I think our approach to issue 84 will effectively do the same thing.
> >> >  The
> >> > nice thing about having a maintenance queue with maintenance threads
> is
> >> > that
> >> > the compaction and split tasks can get added to the queue and carried
> >> > out in
> >> > priority order.
> >> >
> >> >> What do you think?
> >> >>
> >> >> Donald
> >> >>
> >> >> On Dec 4, 9:32 am, "Doug Judd" <[email protected]> wrote:
> >> >> > Hi Donald,
> >> >> >
> >> >> > I've reproduced this problem and have checked in a fix to the
> 'next'
> >> >> > branch.  This was introduced with the major overhaul.  I have added
> a
> >> >> > multiple maintenance thread system test to prevent this from
> >> >> > happening
> >> >> > in
> >> >> > the future.
> >> >> >
> >> >> > BTW, if you do pull the 'next' branch, it has a number of changes
> >> >> > that
> >> >> > make
> >> >> > it incompatible with the previous versions.  You'll have to start
> >> >> > with a
> >> >> > clean database.  The 'next' branch will be compatible with 0.9.1.0
> >> >> > which
> >> >> > should get released tomorrow.
> >> >> >
> >> >> > - Doug
> >> >> >
> >> >> > On Tue, Dec 2, 2008 at 7:10 PM, donald <[email protected]>
> wrote:
> >> >> >
> >> >> > > Hi Doug,
> >> >> >
> >> >> > > I thinks it's better to open a new thread on this topic :)
> >> >> >
> >> >> > > The multiple maintenance thread crash is easy to reproduce: just
> >> >> > > set
> >> >> > > Hypertable.RangeServer.MaintenanceThreads=2, start all servers
> >> >> > > locally
> >> >> > > on a single node and run random_write_test 10000000000. The range
> >> >> > > server will crash in a minute. But the reason is sort of hard to
> >> >> > > track.
> >> >> >
> >> >> > > What we know till now:
> >> >> > > 1. The bug is introduced in version 0.9.0.11. Former versions
> >> >> > > doesn't
> >> >> > > have this problem
> >> >> > > 2. According to RangeServer.log, the crash usually happens when
> two
> >> >> > > adjacent ranges are both splitting in two maintenance threads
> >> >> > > concurrently. If we forbid this behavior by modifying
> >> >> > > MaintenanceTaskQueue code, the crash problem is gone, but the
> >> >> > > reason
> >> >> > > is unknown. (Pheonix discovered this)
> >> >> > > 3. Sometimes the Range Server fails at HT_EXPECT
> >> >> > > (m_immutable_cache_ptr, Error::FAILED_EXPECTATION); in
> >> >> > > AccessGroup::run_compaction(). m_immutable_cache_ptr is set to 0
> in
> >> >> > > multiple places with m_mutex locked, but not always checked in a
> >> >> > > locked environment, which is doubtable.
> >> >> >
> >> >> > > Do you have any idea based on these facts?
> >> >> >
> >> >> > > Donald
> >> >>
> >> >
> >> >
> >> > >
> >> >
> >>
> >>
> >
> >
> > >
> >
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Attachment: multi-maintenance.patch
Description: Binary data

Reply via email to