hi, folks, pretty sure this question has been discussed a few times before, and addressed to some degree. I am wondering whether there is an active JIRA or best practice to improve this? Appreciate if I can get a few pointers.
Currently, if a Region Server is running Out of Memory, checkOOM() is called, and this Region Server will be kill to protect Master. For example: assuming each row of 'usertable' is ~1K, and HBASE_HEAPSIZE is 1GB(as default) @hbase shell> count 'usertable', INTERVAL=>2000000,CACHE =>1000000 the count will bring down one of the region Server. The above problem can be fixed by either use less CACHE, or increase HEAPSIZE. The 1GB heap is small, 1M row cache is kind of large anyway. So this particular example won't make me concern too much, and the region server can be restarted within a minute. What worry me is this example: 1) production system with 20 RegionServer each has a reasonable HeapSize(8~16GB), and increase the heap dynamically won't be a good idea without new physical memory. 2) a few hundreds of client threads, each run a reasonable application, but added up to a large number of memory requested. At a point, the HEAPSIZE is reached on one of the regionserver, and bring it down. This is not too bad as we still have 19 up. However, the problem is that the clients can (and mostlikely will) resubmit their jobs just as I can resubmit the count-cmd by two keystrokes, which brought down the next RegionServer. In this case, I can't stop clients requests, and can't add new hardware immediately(at least not within minutes). Only thing I can do is to watch the whole cluster be brought down from the domino effect. With that, I am wondering: 1) is there an active item to prevent the first RegionServer going down? for example, put a 90% of HEAPSIZE as threshold? 2) or a way to prevent client to resubmit the jobs if system is unhealthy. For example, queue the jobs if a few RegionServers is down? I was able to find some of the discussions back in 2009 and 2011 from the email archive. Wondering anything active/new? I am new in this community, and really appreciate any inputs. Thanks Demai
