Hi Doug,

The only related core file I can find was generated on 01:11 Dec 14
Beijing time. The stack trace shows a similar problem as phenomenon 3
in my first post in this thread:

#0  0x00000000005763e6 in Hypertable::AccessGroup::run_compaction
(this=0x2a962da7b0, major=false)
    at /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/AccessGroup.cc:280
280         HT_EXPECT(m_immutable_cache_ptr, Error::FAILED_EXPECTATION);
(gdb) where
#0  0x00000000005763e6 in Hypertable::AccessGroup::run_compaction
(this=0x2a962da7b0, major=false)
    at /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/AccessGroup.cc:280
#1  0x00000000005633e9 in Hypertable::Range::run_compaction
(this=0x2a962d9190, major=false)
    at /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/Range.cc:628
#2  0x00000000005632af in Hypertable::Range::compact
(this=0x2a962d9190, major=false)
    at /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/Range.cc:611
#3  0x000000000055ce7d in
Hypertable::MaintenanceTaskCompaction::execute (this=0x2a962d7c80)
    at 
/home/pp/src/hypertable/src/cc/Hypertable/RangeServer/MaintenanceTaskCompaction.cc:38
#4  0x0000000000546aef in
Hypertable::MaintenanceQueue::Worker::operator() (this=0x5221c1a8)
    at 
/home/pp/src/hypertable/src/cc/Hypertable/RangeServer/MaintenanceQueue.h:108
#5  0x0000000000546925 in
boost::detail::function::void_function_obj_invoker0<Hypertable::MaintenanceQueue::Worker,
void>::invoke (
    function_obj_p...@0x5221c1a8) at
/home/pp/src/hypertable/src/cc/boost-1_34-fix/boost/function/function_template.hpp:158
#6  0x0000002a95a16dc7 in boost::function0<void,
std::allocator<boost::function_base> >::operator() ()
   from /usr/local/lib/libboost_thread-gcc34-mt-1_34_1.so.1.34.1
#7  0x0000002a95a16407 in boost::thread_group::join_all () from
/usr/local/lib/libboost_thread-gcc34-mt-1_34_1.so.1.34.1
#8  0x000000302b80610a in ?? ()
#9  0x0000000000000000 in ?? ()

Maybe Phoenix may provide you more information.

About issue 84, I notice it was planned not to be implemented before
release 1.1. Can you give some hints on how you plan to do it so that
I may throw in a quick hack? This problem really annoys me these days
:(

Donald


On Sun, Dec 21, 2008 at 5:58 AM, Doug Judd <[email protected]> wrote:
> Hi Donald,
>
> Comments inline ...
>
> On Sat, Dec 20, 2008 at 12:12 PM, donald <[email protected]> wrote:
>>
>> Hi Doug,
>>
>> I'm afraid there are still deeper causes of this bug. With your fix
>> applied, it doesn't happen that frequently as before, but still
>> happens after inserting some hundreds of gigabytes of data. We need to
>> fix this because the maintenance task is currently the bottleneck of
>> the Range Server.
>
> Can you post a stack trace?
>
>>
>> Actually, Range Server workers can accept updates much faster than
>> maintenance task compacts them. This fact makes range servers
>> unreliable. Consider if we feed Hypertable with MapReduce tasks, very
>> soon range servers are all filled with over-sized ranges waiting for
>> compaction. The situation gets worse and worse as time goes on because
>> workers still accept more updates without knowing that the maintenance
>> tasks are seriously lagged and the memory will to be used out soon. In
>> fact in our application range servers die many times per week due to
>> out of memory, this makes the maintenance a heavy task because
>> Hypertable doesn't have usable auto-recovery functionality yet. To
>> make range servers more reliable, we need a mechanism to slow down.
>
> Issue 84 has to do with request throttling.  Once it's finished, the
> requests will get held up until the RangeServer is ready to service them.
> This will add backpressure to the application generating the updates, so you
> should no longer have any out-of-memory errors.
>
>> On the other hand, why should compactions be handled by background
>> maintenance tasks? IMHO if we do compactions directly in
>> RangeServer::update(), a lot of trouble could be saved. It won't block
>> the client initiating the current update as long as a response message
>> could be sent before starting the compaction. Upcoming updates won't
>> block either because no lock is needed while doing compaction, other
>> workers may handle those updates. The only situation that may block
>> client updates is that all workers are busy doing compactions, which
>> is the situation when clients should definitely slow down.
>
> I think our approach to issue 84 will effectively do the same thing.  The
> nice thing about having a maintenance queue with maintenance threads is that
> the compaction and split tasks can get added to the queue and carried out in
> priority order.
>
>> What do you think?
>>
>> Donald
>>
>> On Dec 4, 9:32 am, "Doug Judd" <[email protected]> wrote:
>> > Hi Donald,
>> >
>> > I've reproduced this problem and have checked in a fix to the 'next'
>> > branch.  This was introduced with the major overhaul.  I have added a
>> > multiple maintenance thread system test to prevent this from happening
>> > in
>> > the future.
>> >
>> > BTW, if you do pull the 'next' branch, it has a number of changes that
>> > make
>> > it incompatible with the previous versions.  You'll have to start with a
>> > clean database.  The 'next' branch will be compatible with 0.9.1.0 which
>> > should get released tomorrow.
>> >
>> > - Doug
>> >
>> > On Tue, Dec 2, 2008 at 7:10 PM, donald <[email protected]> wrote:
>> >
>> > > Hi Doug,
>> >
>> > > I thinks it's better to open a new thread on this topic :)
>> >
>> > > The multiple maintenance thread crash is easy to reproduce: just set
>> > > Hypertable.RangeServer.MaintenanceThreads=2, start all servers locally
>> > > on a single node and run random_write_test 10000000000. The range
>> > > server will crash in a minute. But the reason is sort of hard to
>> > > track.
>> >
>> > > What we know till now:
>> > > 1. The bug is introduced in version 0.9.0.11. Former versions doesn't
>> > > have this problem
>> > > 2. According to RangeServer.log, the crash usually happens when two
>> > > adjacent ranges are both splitting in two maintenance threads
>> > > concurrently. If we forbid this behavior by modifying
>> > > MaintenanceTaskQueue code, the crash problem is gone, but the reason
>> > > is unknown. (Pheonix discovered this)
>> > > 3. Sometimes the Range Server fails at HT_EXPECT
>> > > (m_immutable_cache_ptr, Error::FAILED_EXPECTATION); in
>> > > AccessGroup::run_compaction(). m_immutable_cache_ptr is set to 0 in
>> > > multiple places with m_mutex locked, but not always checked in a
>> > > locked environment, which is doubtable.
>> >
>> > > Do you have any idea based on these facts?
>> >
>> > > Donald
>>
>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to