P.S. The memory exhaustion problem will be fixed in the 0.9.2.4 release. On Fri, Mar 20, 2009 at 10:37 AM, Doug Judd <[email protected]> wrote:
> With the help of Earle Ady, we've found and fixed the large load corruption > problem with the 0.9.2.2 release. To get the fixed version, please pull the > latest code from the git > repository<http://code.google.com/p/hypertable/wiki/SourceCode?tm=4>. > We'll be releasing 0.9.2.3 soon. > > Here's a summary of the problem: > > With the fix of issue > 246<http://code.google.com/p/hypertable/issues/detail?id=246>, > compactions are now happening regularly as they should. However, this has > added substantial load on the system. When a range split and the master was > notified of the newly split-off range, the master selected (round-robin) a > new RangeServer to own the range. However, due to the increased load on the > system and a 30 second hardcoded timeout in the Master, the > RangeServer::load_range() command was timing out (It was taking 32 to 37 > seconds). This timeout was reported back to the originating RangeServer, > which paused a fifteen seconds and tried it again. But on the second > attempt to notify the Master of the newly split-off range, the Master would > (round-robin) select another RangeServer and invoke > RangeServer::load_range() on that (different) server. This had the effect > of the same range being loaded by three different RangeServers which was > wreaking havoc with the system. There were two fixes for this problem: > > 1. The hardcoded timeout was removed and (almost) all timeouts in the > system are based on the "Hypertable.Request.Timeout" property which now has > a default value of 180 seconds. > > 2. An interim fix was put in place in the Master where upon > RangeServer::load_range() failure, the Master will remember what RangeServer > it attmpted to do the load on. The next time it gets notified and attempts > to load the same range, it will choose the same RangeServer. If it gets an > error message back, RANGE_ALREADY_LOADED, it will interpret that as > success. The reason this fix is interim is because it does not persist the > Range-to-RangeServer mapping information, so if it were to fail at an > inopportune time and come back up, we'd be subject to the same failure. > This will get fixed with Issue 74 - Master directed > RangeServer<http://code.google.com/p/hypertable/issues/detail?id=79>recovery > since the Master will have a meta-log and will be able to persist > this mapping as re-constructible state information. > > After we fixed this problem, the next problem that Earle ran into was that > the RangeServer was exhausting memory and crashing. To fix this, we added > the following property to the hypertable.cfg file on the machine that was > doing the LOAD DATA INFILE: > > Hypertable.Lib.Mutator.FlushDelay=100 > > Keep this in mind if you encounter the same problem. > > - Doug > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
