Thanks Doug, I will apply this patch and try again tomorrow.

Donald

On Sun, Dec 21, 2008 at 3:10 PM, Doug Judd <[email protected]> wrote:
> Hi Donald,
>
> This problem is due to a race condition that was introduced with the major
> overhaul.  The attached patch should take care of it.
>
> - Doug
>
> On Sat, Dec 20, 2008 at 8:59 PM, Liu Kejia (Donald) <[email protected]>
> wrote:
>>
>> Hi Doug,
>>
>> I did pull in your fix, despite I didn't use the newest 0.9.1.0 code:
>>
>> commit fb817f12954991d58212fccd5e1fdb1564da123c
>> Author: Doug Judd <[email protected]>
>> Date:   Wed Dec 3 17:11:49 2008 -0800
>>
>>    Fixed crash with > 1 maintenance threads
>>
>> Donald
>>
>>
>> On Sun, Dec 21, 2008 at 12:40 PM, Doug Judd <[email protected]> wrote:
>> > Hi Donald,
>> >
>> > This stack trace is from an old version of the code that does not have
>> > the
>> > fix.  Try running with the latest code from the git repository.  As far
>> > as
>> > issue 84 goes, I push it into the beta release.
>> >
>> > - Doug
>> >
>> > On Sat, Dec 20, 2008 at 6:38 PM, Liu Kejia (Donald)
>> > <[email protected]>
>> > wrote:
>> >>
>> >> Hi Doug,
>> >>
>> >> The only related core file I can find was generated on 01:11 Dec 14
>> >> Beijing time. The stack trace shows a similar problem as phenomenon 3
>> >> in my first post in this thread:
>> >>
>> >> #0  0x00000000005763e6 in Hypertable::AccessGroup::run_compaction
>> >> (this=0x2a962da7b0, major=false)
>> >>    at
>> >>
>> >> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/AccessGroup.cc:280
>> >> 280         HT_EXPECT(m_immutable_cache_ptr,
>> >> Error::FAILED_EXPECTATION);
>> >> (gdb) where
>> >> #0  0x00000000005763e6 in Hypertable::AccessGroup::run_compaction
>> >> (this=0x2a962da7b0, major=false)
>> >>    at
>> >>
>> >> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/AccessGroup.cc:280
>> >> #1  0x00000000005633e9 in Hypertable::Range::run_compaction
>> >> (this=0x2a962d9190, major=false)
>> >>    at
>> >> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/Range.cc:628
>> >> #2  0x00000000005632af in Hypertable::Range::compact
>> >> (this=0x2a962d9190, major=false)
>> >>    at
>> >> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/Range.cc:611
>> >> #3  0x000000000055ce7d in
>> >> Hypertable::MaintenanceTaskCompaction::execute (this=0x2a962d7c80)
>> >>    at
>> >>
>> >> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/MaintenanceTaskCompaction.cc:38
>> >> #4  0x0000000000546aef in
>> >> Hypertable::MaintenanceQueue::Worker::operator() (this=0x5221c1a8)
>> >>    at
>> >>
>> >> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/MaintenanceQueue.h:108
>> >> #5  0x0000000000546925 in
>> >>
>> >>
>> >> boost::detail::function::void_function_obj_invoker0<Hypertable::MaintenanceQueue::Worker,
>> >> void>::invoke (
>> >>    function_obj_p...@0x5221c1a8) at
>> >>
>> >>
>> >> /home/pp/src/hypertable/src/cc/boost-1_34-fix/boost/function/function_template.hpp:158
>> >> #6  0x0000002a95a16dc7 in boost::function0<void,
>> >> std::allocator<boost::function_base> >::operator() ()
>> >>   from /usr/local/lib/libboost_thread-gcc34-mt-1_34_1.so.1.34.1
>> >> #7  0x0000002a95a16407 in boost::thread_group::join_all () from
>> >> /usr/local/lib/libboost_thread-gcc34-mt-1_34_1.so.1.34.1
>> >> #8  0x000000302b80610a in ?? ()
>> >> #9  0x0000000000000000 in ?? ()
>> >>
>> >> Maybe Phoenix may provide you more information.
>> >>
>> >> About issue 84, I notice it was planned not to be implemented before
>> >> release 1.1. Can you give some hints on how you plan to do it so that
>> >> I may throw in a quick hack? This problem really annoys me these days
>> >> :(
>> >>
>> >> Donald
>> >>
>> >>
>> >> On Sun, Dec 21, 2008 at 5:58 AM, Doug Judd <[email protected]>
>> >> wrote:
>> >> > Hi Donald,
>> >> >
>> >> > Comments inline ...
>> >> >
>> >> > On Sat, Dec 20, 2008 at 12:12 PM, donald <[email protected]>
>> >> > wrote:
>> >> >>
>> >> >> Hi Doug,
>> >> >>
>> >> >> I'm afraid there are still deeper causes of this bug. With your fix
>> >> >> applied, it doesn't happen that frequently as before, but still
>> >> >> happens after inserting some hundreds of gigabytes of data. We need
>> >> >> to
>> >> >> fix this because the maintenance task is currently the bottleneck of
>> >> >> the Range Server.
>> >> >
>> >> > Can you post a stack trace?
>> >> >
>> >> >>
>> >> >> Actually, Range Server workers can accept updates much faster than
>> >> >> maintenance task compacts them. This fact makes range servers
>> >> >> unreliable. Consider if we feed Hypertable with MapReduce tasks,
>> >> >> very
>> >> >> soon range servers are all filled with over-sized ranges waiting for
>> >> >> compaction. The situation gets worse and worse as time goes on
>> >> >> because
>> >> >> workers still accept more updates without knowing that the
>> >> >> maintenance
>> >> >> tasks are seriously lagged and the memory will to be used out soon.
>> >> >> In
>> >> >> fact in our application range servers die many times per week due to
>> >> >> out of memory, this makes the maintenance a heavy task because
>> >> >> Hypertable doesn't have usable auto-recovery functionality yet. To
>> >> >> make range servers more reliable, we need a mechanism to slow down.
>> >> >
>> >> > Issue 84 has to do with request throttling.  Once it's finished, the
>> >> > requests will get held up until the RangeServer is ready to service
>> >> > them.
>> >> > This will add backpressure to the application generating the updates,
>> >> > so
>> >> > you
>> >> > should no longer have any out-of-memory errors.
>> >> >
>> >> >> On the other hand, why should compactions be handled by background
>> >> >> maintenance tasks? IMHO if we do compactions directly in
>> >> >> RangeServer::update(), a lot of trouble could be saved. It won't
>> >> >> block
>> >> >> the client initiating the current update as long as a response
>> >> >> message
>> >> >> could be sent before starting the compaction. Upcoming updates won't
>> >> >> block either because no lock is needed while doing compaction, other
>> >> >> workers may handle those updates. The only situation that may block
>> >> >> client updates is that all workers are busy doing compactions, which
>> >> >> is the situation when clients should definitely slow down.
>> >> >
>> >> > I think our approach to issue 84 will effectively do the same thing.
>> >> >  The
>> >> > nice thing about having a maintenance queue with maintenance threads
>> >> > is
>> >> > that
>> >> > the compaction and split tasks can get added to the queue and carried
>> >> > out in
>> >> > priority order.
>> >> >
>> >> >> What do you think?
>> >> >>
>> >> >> Donald
>> >> >>
>> >> >> On Dec 4, 9:32 am, "Doug Judd" <[email protected]> wrote:
>> >> >> > Hi Donald,
>> >> >> >
>> >> >> > I've reproduced this problem and have checked in a fix to the
>> >> >> > 'next'
>> >> >> > branch.  This was introduced with the major overhaul.  I have
>> >> >> > added a
>> >> >> > multiple maintenance thread system test to prevent this from
>> >> >> > happening
>> >> >> > in
>> >> >> > the future.
>> >> >> >
>> >> >> > BTW, if you do pull the 'next' branch, it has a number of changes
>> >> >> > that
>> >> >> > make
>> >> >> > it incompatible with the previous versions.  You'll have to start
>> >> >> > with a
>> >> >> > clean database.  The 'next' branch will be compatible with 0.9.1.0
>> >> >> > which
>> >> >> > should get released tomorrow.
>> >> >> >
>> >> >> > - Doug
>> >> >> >
>> >> >> > On Tue, Dec 2, 2008 at 7:10 PM, donald <[email protected]>
>> >> >> > wrote:
>> >> >> >
>> >> >> > > Hi Doug,
>> >> >> >
>> >> >> > > I thinks it's better to open a new thread on this topic :)
>> >> >> >
>> >> >> > > The multiple maintenance thread crash is easy to reproduce: just
>> >> >> > > set
>> >> >> > > Hypertable.RangeServer.MaintenanceThreads=2, start all servers
>> >> >> > > locally
>> >> >> > > on a single node and run random_write_test 10000000000. The
>> >> >> > > range
>> >> >> > > server will crash in a minute. But the reason is sort of hard to
>> >> >> > > track.
>> >> >> >
>> >> >> > > What we know till now:
>> >> >> > > 1. The bug is introduced in version 0.9.0.11. Former versions
>> >> >> > > doesn't
>> >> >> > > have this problem
>> >> >> > > 2. According to RangeServer.log, the crash usually happens when
>> >> >> > > two
>> >> >> > > adjacent ranges are both splitting in two maintenance threads
>> >> >> > > concurrently. If we forbid this behavior by modifying
>> >> >> > > MaintenanceTaskQueue code, the crash problem is gone, but the
>> >> >> > > reason
>> >> >> > > is unknown. (Pheonix discovered this)
>> >> >> > > 3. Sometimes the Range Server fails at HT_EXPECT
>> >> >> > > (m_immutable_cache_ptr, Error::FAILED_EXPECTATION); in
>> >> >> > > AccessGroup::run_compaction(). m_immutable_cache_ptr is set to 0
>> >> >> > > in
>> >> >> > > multiple places with m_mutex locked, but not always checked in a
>> >> >> > > locked environment, which is doubtable.
>> >> >> >
>> >> >> > > Do you have any idea based on these facts?
>> >> >> >
>> >> >> > > Donald
>> >> >>
>> >> >
>> >> >
>> >> > >
>> >> >
>> >>
>> >>
>> >
>> >
>> > >
>> >
>>
>>
>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to