Hey Sriram, thanks for the quick response! Looks like I saw this late since you caught me on IM to get the logs.
Josh On Sun, Apr 19, 2009 at 12:18 PM, Sriram Rao <[email protected]> wrote: > Josh, > > I'd like to help you out. What'd be good is if you can mail me the > chunkserver logs (the one that has the problem). The kfs-broker logs > attached here are empty. > > Sriram > > On Sun, Apr 19, 2009 at 10:59 AM, Josh Adams <[email protected]> wrote: >> Hi Doug, >> >> This morning something happened which caused the root RangeServer to >> go down for good (even after multiple attempts to start it with >> Hypertable.CommitLog.SkipErrors=true.) There was no excessive load on >> the system or memory exhaustion this time because I was not performing >> heavy updates, it was just rolling along with realtime and all of a >> sudden croaked. I've narrowed it down to a likely culprit though... >> >> When I approached the wreckage I found at least one KFS chunkserver >> which was exhibiting signs similar to those of a bug recently reported >> to the kosmosfs-users list which results in the chunkserver's vsize >> bloating to 50-100GB and the server becoming locked up using 100% CPU. >> Since the error in the root RangeServer log points to a DFS i/o error >> I feel confident that these two occurrences are probably not >> coincidence. >> >> This, however, makes my life a little more difficult since now I have >> to find a way re-index a large amount of data to prepare for a meeting >> early this week with the founders which is supposed to be the big >> show-and-tell session to prove Hypertable's worthiness to the company. >> I could agree that this is a reasonable setback considering the risk >> I took with my decision to go with the lesser-tested kosmosBroker here >> but I'm frusterated with how things are going nevertheless. >> >> I'm now going to fire up the next iteration on HDFS. Let me know if >> you can think of any suggestions. >> >> Cheers, >> Josh >> >> On Wed, Apr 15, 2009 at 9:52 PM, Josh Adams <[email protected]> wrote: >>> Hey Doug, >>> >>> Yes, that's exactly what was happening. I've since rebuilt everything >>> with tcmalloc/google-perftools according to the docs and the memory >>> usage has become more manageable but I still see high consumption and >>> eventual memory exhaustion during heavy updates. >>> >>> A new problem I've encountered with the tcmalloc-built binaries is >>> that the ThriftBroker hangs soon after it completes some random number >>> of reads or updates, usually within a minute or two of activity. I >>> tried using the non-tcmalloc ThriftBroker binary with the currently >>> running tcmalloc master/rangeservers/kosmosbrokers and it still hung. >>> I'm going to try going back and start a fresh Hypertable instance with >>> the non-tcmalloc binaries for everything to see if the problem goes >>> away. Could be some changes to our app code causing the ThriftBroker >>> hangs, we'll see. >>> >>> Thanks for the update btw! :-) >>> >>> Josh >>> >>> On Wed, Apr 15, 2009 at 9:31 PM, Doug Judd <[email protected]> wrote: >>>> Hi Josh, >>>> >>>> Is it possible that the system underwent heavy update activity during that >>>> time period? We don't have request throttling in place yet (should be out >>>> next week), so it is possible for the RangeServer to exhaust memory under >>>> heavy update workloads. It looks like the commit log got >>>> truncated/corrupted when the machine died. You can tell the RangeServer to >>>> skip commit log errors with the following property: >>>> >>>> Hypertable.CommitLog.SkipErrors=true >>>> >>>> This data in the commit log that is being skipped will most likely be lost. >>>> >>>> - Doug >>>> >>>> On Mon, Apr 13, 2009 at 1:10 PM, Josh Adams <[email protected]> wrote: >>>>> >>>>> On Mon, Apr 13, 2009 at 9:58 AM, Doug Judd <[email protected]> wrote: >>>>> > No, it shouldn't. One thing that might help is to install tcmalloc >>>>> > (google-perftools) and then re-build. You'll need to have tcmalloc >>>>> > installed in all your runtime environments. >>>>> >>>>> Ok thanks, I'll try that out hopefully this week and let you know. >>>>> >>>>> > 157 on it a while back. It would be interesting to know if the disk >>>>> > subsystems on any of your machines are getting saturated during this low >>>>> > throughput condition. If so, then there probably is not much we can do >>>>> >>>>> Good point, I'll keep an eye on that. >>>>> >>>>> I was out of town on a short trip over the weekend and I wasn't >>>>> watching our Hypertable instance very closely. During the early >>>>> morning hours on Saturday it looks like each of the four machines >>>>> running RangeServer/kosmosBroker/ThriftBroker had their memory spike >>>>> heavily for about an hour. The root RangeServer started swapping and >>>>> the machine went down later that day. I can't start the instance back >>>>> up at the moment because the root RangeServer is complaining about >>>>> this error and dies when I try starting it: >>>>> >>>>> 1239651998 ERROR Hypertable.RangeServer : load_next_valid_header >>>>> >>>>> (/data/tmp/dev/src/hypertable/6d5fdd1/src/cc/Hypertable/Lib/CommitLogBlockStream.cc:148): >>>>> Hypertable::Exception: Error reading 34 bytes from DFS fd 1057 - >>>>> HYPERTABLE failed expectation >>>>> at virtual size_t Hypertable::DfsBroker::Client::read(int32_t, >>>>> void*, >>>>> size_t) >>>>> (/data/tmp/dev/src/hypertable/6d5fdd1/src/cc/DfsBroker/Lib/Client.cc:258) >>>>> at size_t Hypertable::ClientBufferedReaderHandler::read(void*, >>>>> size_t) >>>>> (/data/tmp/dev/src/hypertable/6d5fdd1/src/cc/DfsBroker/Lib/ClientBufferedReaderHandler.cc:161): >>>>> empty queue >>>>> >>>>> I've attached a file containing the relevant errors at the end of its >>>>> log and also the whole kosmosBroker log file for that startup attempt. >>>>> >>>>> Cheers, >>>>> Josh >>>>> >>>>> >>>> >>>> >>>> >>>> >>>> >>> >> > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
