Hi Doug, I did pull in your fix, despite I didn't use the newest 0.9.1.0 code:
commit fb817f12954991d58212fccd5e1fdb1564da123c Author: Doug Judd <[email protected]> Date: Wed Dec 3 17:11:49 2008 -0800 Fixed crash with > 1 maintenance threads Donald On Sun, Dec 21, 2008 at 12:40 PM, Doug Judd <[email protected]> wrote: > Hi Donald, > > This stack trace is from an old version of the code that does not have the > fix. Try running with the latest code from the git repository. As far as > issue 84 goes, I push it into the beta release. > > - Doug > > On Sat, Dec 20, 2008 at 6:38 PM, Liu Kejia (Donald) <[email protected]> > wrote: >> >> Hi Doug, >> >> The only related core file I can find was generated on 01:11 Dec 14 >> Beijing time. The stack trace shows a similar problem as phenomenon 3 >> in my first post in this thread: >> >> #0 0x00000000005763e6 in Hypertable::AccessGroup::run_compaction >> (this=0x2a962da7b0, major=false) >> at >> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/AccessGroup.cc:280 >> 280 HT_EXPECT(m_immutable_cache_ptr, Error::FAILED_EXPECTATION); >> (gdb) where >> #0 0x00000000005763e6 in Hypertable::AccessGroup::run_compaction >> (this=0x2a962da7b0, major=false) >> at >> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/AccessGroup.cc:280 >> #1 0x00000000005633e9 in Hypertable::Range::run_compaction >> (this=0x2a962d9190, major=false) >> at /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/Range.cc:628 >> #2 0x00000000005632af in Hypertable::Range::compact >> (this=0x2a962d9190, major=false) >> at /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/Range.cc:611 >> #3 0x000000000055ce7d in >> Hypertable::MaintenanceTaskCompaction::execute (this=0x2a962d7c80) >> at >> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/MaintenanceTaskCompaction.cc:38 >> #4 0x0000000000546aef in >> Hypertable::MaintenanceQueue::Worker::operator() (this=0x5221c1a8) >> at >> /home/pp/src/hypertable/src/cc/Hypertable/RangeServer/MaintenanceQueue.h:108 >> #5 0x0000000000546925 in >> >> boost::detail::function::void_function_obj_invoker0<Hypertable::MaintenanceQueue::Worker, >> void>::invoke ( >> function_obj_p...@0x5221c1a8) at >> >> /home/pp/src/hypertable/src/cc/boost-1_34-fix/boost/function/function_template.hpp:158 >> #6 0x0000002a95a16dc7 in boost::function0<void, >> std::allocator<boost::function_base> >::operator() () >> from /usr/local/lib/libboost_thread-gcc34-mt-1_34_1.so.1.34.1 >> #7 0x0000002a95a16407 in boost::thread_group::join_all () from >> /usr/local/lib/libboost_thread-gcc34-mt-1_34_1.so.1.34.1 >> #8 0x000000302b80610a in ?? () >> #9 0x0000000000000000 in ?? () >> >> Maybe Phoenix may provide you more information. >> >> About issue 84, I notice it was planned not to be implemented before >> release 1.1. Can you give some hints on how you plan to do it so that >> I may throw in a quick hack? This problem really annoys me these days >> :( >> >> Donald >> >> >> On Sun, Dec 21, 2008 at 5:58 AM, Doug Judd <[email protected]> wrote: >> > Hi Donald, >> > >> > Comments inline ... >> > >> > On Sat, Dec 20, 2008 at 12:12 PM, donald <[email protected]> wrote: >> >> >> >> Hi Doug, >> >> >> >> I'm afraid there are still deeper causes of this bug. With your fix >> >> applied, it doesn't happen that frequently as before, but still >> >> happens after inserting some hundreds of gigabytes of data. We need to >> >> fix this because the maintenance task is currently the bottleneck of >> >> the Range Server. >> > >> > Can you post a stack trace? >> > >> >> >> >> Actually, Range Server workers can accept updates much faster than >> >> maintenance task compacts them. This fact makes range servers >> >> unreliable. Consider if we feed Hypertable with MapReduce tasks, very >> >> soon range servers are all filled with over-sized ranges waiting for >> >> compaction. The situation gets worse and worse as time goes on because >> >> workers still accept more updates without knowing that the maintenance >> >> tasks are seriously lagged and the memory will to be used out soon. In >> >> fact in our application range servers die many times per week due to >> >> out of memory, this makes the maintenance a heavy task because >> >> Hypertable doesn't have usable auto-recovery functionality yet. To >> >> make range servers more reliable, we need a mechanism to slow down. >> > >> > Issue 84 has to do with request throttling. Once it's finished, the >> > requests will get held up until the RangeServer is ready to service >> > them. >> > This will add backpressure to the application generating the updates, so >> > you >> > should no longer have any out-of-memory errors. >> > >> >> On the other hand, why should compactions be handled by background >> >> maintenance tasks? IMHO if we do compactions directly in >> >> RangeServer::update(), a lot of trouble could be saved. It won't block >> >> the client initiating the current update as long as a response message >> >> could be sent before starting the compaction. Upcoming updates won't >> >> block either because no lock is needed while doing compaction, other >> >> workers may handle those updates. The only situation that may block >> >> client updates is that all workers are busy doing compactions, which >> >> is the situation when clients should definitely slow down. >> > >> > I think our approach to issue 84 will effectively do the same thing. >> > The >> > nice thing about having a maintenance queue with maintenance threads is >> > that >> > the compaction and split tasks can get added to the queue and carried >> > out in >> > priority order. >> > >> >> What do you think? >> >> >> >> Donald >> >> >> >> On Dec 4, 9:32 am, "Doug Judd" <[email protected]> wrote: >> >> > Hi Donald, >> >> > >> >> > I've reproduced this problem and have checked in a fix to the 'next' >> >> > branch. This was introduced with the major overhaul. I have added a >> >> > multiple maintenance thread system test to prevent this from >> >> > happening >> >> > in >> >> > the future. >> >> > >> >> > BTW, if you do pull the 'next' branch, it has a number of changes >> >> > that >> >> > make >> >> > it incompatible with the previous versions. You'll have to start >> >> > with a >> >> > clean database. The 'next' branch will be compatible with 0.9.1.0 >> >> > which >> >> > should get released tomorrow. >> >> > >> >> > - Doug >> >> > >> >> > On Tue, Dec 2, 2008 at 7:10 PM, donald <[email protected]> wrote: >> >> > >> >> > > Hi Doug, >> >> > >> >> > > I thinks it's better to open a new thread on this topic :) >> >> > >> >> > > The multiple maintenance thread crash is easy to reproduce: just >> >> > > set >> >> > > Hypertable.RangeServer.MaintenanceThreads=2, start all servers >> >> > > locally >> >> > > on a single node and run random_write_test 10000000000. The range >> >> > > server will crash in a minute. But the reason is sort of hard to >> >> > > track. >> >> > >> >> > > What we know till now: >> >> > > 1. The bug is introduced in version 0.9.0.11. Former versions >> >> > > doesn't >> >> > > have this problem >> >> > > 2. According to RangeServer.log, the crash usually happens when two >> >> > > adjacent ranges are both splitting in two maintenance threads >> >> > > concurrently. If we forbid this behavior by modifying >> >> > > MaintenanceTaskQueue code, the crash problem is gone, but the >> >> > > reason >> >> > > is unknown. (Pheonix discovered this) >> >> > > 3. Sometimes the Range Server fails at HT_EXPECT >> >> > > (m_immutable_cache_ptr, Error::FAILED_EXPECTATION); in >> >> > > AccessGroup::run_compaction(). m_immutable_cache_ptr is set to 0 in >> >> > > multiple places with m_mutex locked, but not always checked in a >> >> > > locked environment, which is doubtable. >> >> > >> >> > > Do you have any idea based on these facts? >> >> > >> >> > > Donald >> >> >> > >> > >> > > >> > >> >> > > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
