Ok, I'll look into it. - Doug On Sat, Dec 20, 2008 at 8:59 PM, Liu Kejia (Donald) <[email protected]>wrote:
> > Hi Doug, > > I did pull in your fix, despite I didn't use the newest 0.9.1.0 code: > > commit fb817f12954991d58212fccd5e1fdb1564da123c > Author: Doug Judd <[email protected]> > Date: Wed Dec 3 17:11:49 2008 -0800 > > Fixed crash with > 1 maintenance threads > > Donald > > > On Sun, Dec 21, 2008 at 12:40 PM, Doug Judd <[email protected]> wrote: > > Hi Donald, > > > > This stack trace is from an old version of the code that does not have > the > > fix. Try running with the latest code from the git repository. As far > as > > issue 84 goes, I push it into the beta release. > > > > - Doug > > > > On Sat, Dec 20, 2008 at 6:38 PM, Liu Kejia (Donald) < > [email protected]> > > wrote: > >> > >> Hi Doug, > >> > >> The only related core file I can find was generated on 01:11 Dec 14 > >> Beijing time. The stack trace shows a similar problem as phenomenon 3 > >> in my first post in this thread: > >> > >> #0 0x00000000005763e6 in Hypertable::AccessGroup::run_compaction > >> (this=0x2a962da7b0, major=false) > >> at > >> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/AccessGroup.cc:280 > >> 280 HT_EXPECT(m_immutable_cache_ptr, Error::FAILED_EXPECTATION); > >> (gdb) where > >> #0 0x00000000005763e6 in Hypertable::AccessGroup::run_compaction > >> (this=0x2a962da7b0, major=false) > >> at > >> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/AccessGroup.cc:280 > >> #1 0x00000000005633e9 in Hypertable::Range::run_compaction > >> (this=0x2a962d9190, major=false) > >> at /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/Range.cc:628 > >> #2 0x00000000005632af in Hypertable::Range::compact > >> (this=0x2a962d9190, major=false) > >> at /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/Range.cc:611 > >> #3 0x000000000055ce7d in > >> Hypertable::MaintenanceTaskCompaction::execute (this=0x2a962d7c80) > >> at > >> > /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/MaintenanceTaskCompaction.cc:38 > >> #4 0x0000000000546aef in > >> Hypertable::MaintenanceQueue::Worker::operator() (this=0x5221c1a8) > >> at > >> > /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/MaintenanceQueue.h:108 > >> #5 0x0000000000546925 in > >> > >> > boost::detail::function::void_function_obj_invoker0<Hypertable::MaintenanceQueue::Worker, > >> void>::invoke ( > >> function_obj_p...@0x5221c1a8) at > >> > >> > /home/pp/src/hypertable/src/cc/boost-1_34-fix/boost/function/function_template.hpp:158 > >> #6 0x0000002a95a16dc7 in boost::function0<void, > >> std::allocator<boost::function_base> >::operator() () > >> from /usr/local/lib/libboost_thread-gcc34-mt-1_34_1.so.1.34.1 > >> #7 0x0000002a95a16407 in boost::thread_group::join_all () from > >> /usr/local/lib/libboost_thread-gcc34-mt-1_34_1.so.1.34.1 > >> #8 0x000000302b80610a in ?? () > >> #9 0x0000000000000000 in ?? () > >> > >> Maybe Phoenix may provide you more information. > >> > >> About issue 84, I notice it was planned not to be implemented before > >> release 1.1. Can you give some hints on how you plan to do it so that > >> I may throw in a quick hack? This problem really annoys me these days > >> :( > >> > >> Donald > >> > >> > >> On Sun, Dec 21, 2008 at 5:58 AM, Doug Judd <[email protected]> > wrote: > >> > Hi Donald, > >> > > >> > Comments inline ... > >> > > >> > On Sat, Dec 20, 2008 at 12:12 PM, donald <[email protected]> > wrote: > >> >> > >> >> Hi Doug, > >> >> > >> >> I'm afraid there are still deeper causes of this bug. With your fix > >> >> applied, it doesn't happen that frequently as before, but still > >> >> happens after inserting some hundreds of gigabytes of data. We need > to > >> >> fix this because the maintenance task is currently the bottleneck of > >> >> the Range Server. > >> > > >> > Can you post a stack trace? > >> > > >> >> > >> >> Actually, Range Server workers can accept updates much faster than > >> >> maintenance task compacts them. This fact makes range servers > >> >> unreliable. Consider if we feed Hypertable with MapReduce tasks, very > >> >> soon range servers are all filled with over-sized ranges waiting for > >> >> compaction. The situation gets worse and worse as time goes on > because > >> >> workers still accept more updates without knowing that the > maintenance > >> >> tasks are seriously lagged and the memory will to be used out soon. > In > >> >> fact in our application range servers die many times per week due to > >> >> out of memory, this makes the maintenance a heavy task because > >> >> Hypertable doesn't have usable auto-recovery functionality yet. To > >> >> make range servers more reliable, we need a mechanism to slow down. > >> > > >> > Issue 84 has to do with request throttling. Once it's finished, the > >> > requests will get held up until the RangeServer is ready to service > >> > them. > >> > This will add backpressure to the application generating the updates, > so > >> > you > >> > should no longer have any out-of-memory errors. > >> > > >> >> On the other hand, why should compactions be handled by background > >> >> maintenance tasks? IMHO if we do compactions directly in > >> >> RangeServer::update(), a lot of trouble could be saved. It won't > block > >> >> the client initiating the current update as long as a response > message > >> >> could be sent before starting the compaction. Upcoming updates won't > >> >> block either because no lock is needed while doing compaction, other > >> >> workers may handle those updates. The only situation that may block > >> >> client updates is that all workers are busy doing compactions, which > >> >> is the situation when clients should definitely slow down. > >> > > >> > I think our approach to issue 84 will effectively do the same thing. > >> > The > >> > nice thing about having a maintenance queue with maintenance threads > is > >> > that > >> > the compaction and split tasks can get added to the queue and carried > >> > out in > >> > priority order. > >> > > >> >> What do you think? > >> >> > >> >> Donald > >> >> > >> >> On Dec 4, 9:32 am, "Doug Judd" <[email protected]> wrote: > >> >> > Hi Donald, > >> >> > > >> >> > I've reproduced this problem and have checked in a fix to the > 'next' > >> >> > branch. This was introduced with the major overhaul. I have added > a > >> >> > multiple maintenance thread system test to prevent this from > >> >> > happening > >> >> > in > >> >> > the future. > >> >> > > >> >> > BTW, if you do pull the 'next' branch, it has a number of changes > >> >> > that > >> >> > make > >> >> > it incompatible with the previous versions. You'll have to start > >> >> > with a > >> >> > clean database. The 'next' branch will be compatible with 0.9.1.0 > >> >> > which > >> >> > should get released tomorrow. > >> >> > > >> >> > - Doug > >> >> > > >> >> > On Tue, Dec 2, 2008 at 7:10 PM, donald <[email protected]> > wrote: > >> >> > > >> >> > > Hi Doug, > >> >> > > >> >> > > I thinks it's better to open a new thread on this topic :) > >> >> > > >> >> > > The multiple maintenance thread crash is easy to reproduce: just > >> >> > > set > >> >> > > Hypertable.RangeServer.MaintenanceThreads=2, start all servers > >> >> > > locally > >> >> > > on a single node and run random_write_test 10000000000. The range > >> >> > > server will crash in a minute. But the reason is sort of hard to > >> >> > > track. > >> >> > > >> >> > > What we know till now: > >> >> > > 1. The bug is introduced in version 0.9.0.11. Former versions > >> >> > > doesn't > >> >> > > have this problem > >> >> > > 2. According to RangeServer.log, the crash usually happens when > two > >> >> > > adjacent ranges are both splitting in two maintenance threads > >> >> > > concurrently. If we forbid this behavior by modifying > >> >> > > MaintenanceTaskQueue code, the crash problem is gone, but the > >> >> > > reason > >> >> > > is unknown. (Pheonix discovered this) > >> >> > > 3. Sometimes the Range Server fails at HT_EXPECT > >> >> > > (m_immutable_cache_ptr, Error::FAILED_EXPECTATION); in > >> >> > > AccessGroup::run_compaction(). m_immutable_cache_ptr is set to 0 > in > >> >> > > multiple places with m_mutex locked, but not always checked in a > >> >> > > locked environment, which is doubtable. > >> >> > > >> >> > > Do you have any idea based on these facts? > >> >> > > >> >> > > Donald > >> >> > >> > > >> > > >> > > > >> > > >> > >> > > > > > > > > > > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
