Hey Sriram, To follow up on our IM conversation a few minutes ago I'll be adding the debugging line to the chunkserver configs:
chunkServer.loglevel = DEBUG And I'll send you the debug log if/when I can repro. Thanks! Josh On Sun, Apr 19, 2009 at 4:01 PM, Sriram Rao <[email protected]> wrote: > Hey Josh, > > Thanks for the logs. Any chance the chunkserver dropped a core file > when it died? If you can load that into gdb and get me backtrace, > that'd be great. > > Sriram > > On Sun, Apr 19, 2009 at 1:03 PM, Josh Adams <[email protected]> wrote: >> Hey Sriram, thanks for the quick response! Looks like I saw this late >> since you caught me on IM to get the logs. >> >> Josh >> >> On Sun, Apr 19, 2009 at 12:18 PM, Sriram Rao <[email protected]> wrote: >>> Josh, >>> >>> I'd like to help you out. What'd be good is if you can mail me the >>> chunkserver logs (the one that has the problem). The kfs-broker logs >>> attached here are empty. >>> >>> Sriram >>> >>> On Sun, Apr 19, 2009 at 10:59 AM, Josh Adams <[email protected]> wrote: >>>> Hi Doug, >>>> >>>> This morning something happened which caused the root RangeServer to >>>> go down for good (even after multiple attempts to start it with >>>> Hypertable.CommitLog.SkipErrors=true.) There was no excessive load on >>>> the system or memory exhaustion this time because I was not performing >>>> heavy updates, it was just rolling along with realtime and all of a >>>> sudden croaked. I've narrowed it down to a likely culprit though... >>>> >>>> When I approached the wreckage I found at least one KFS chunkserver >>>> which was exhibiting signs similar to those of a bug recently reported >>>> to the kosmosfs-users list which results in the chunkserver's vsize >>>> bloating to 50-100GB and the server becoming locked up using 100% CPU. >>>> Since the error in the root RangeServer log points to a DFS i/o error >>>> I feel confident that these two occurrences are probably not >>>> coincidence. >>>> >>>> This, however, makes my life a little more difficult since now I have >>>> to find a way re-index a large amount of data to prepare for a meeting >>>> early this week with the founders which is supposed to be the big >>>> show-and-tell session to prove Hypertable's worthiness to the company. >>>> I could agree that this is a reasonable setback considering the risk >>>> I took with my decision to go with the lesser-tested kosmosBroker here >>>> but I'm frusterated with how things are going nevertheless. >>>> >>>> I'm now going to fire up the next iteration on HDFS. Let me know if >>>> you can think of any suggestions. >>>> >>>> Cheers, >>>> Josh >>>> >>>> On Wed, Apr 15, 2009 at 9:52 PM, Josh Adams <[email protected]> wrote: >>>>> Hey Doug, >>>>> >>>>> Yes, that's exactly what was happening. I've since rebuilt everything >>>>> with tcmalloc/google-perftools according to the docs and the memory >>>>> usage has become more manageable but I still see high consumption and >>>>> eventual memory exhaustion during heavy updates. >>>>> >>>>> A new problem I've encountered with the tcmalloc-built binaries is >>>>> that the ThriftBroker hangs soon after it completes some random number >>>>> of reads or updates, usually within a minute or two of activity. I >>>>> tried using the non-tcmalloc ThriftBroker binary with the currently >>>>> running tcmalloc master/rangeservers/kosmosbrokers and it still hung. >>>>> I'm going to try going back and start a fresh Hypertable instance with >>>>> the non-tcmalloc binaries for everything to see if the problem goes >>>>> away. Could be some changes to our app code causing the ThriftBroker >>>>> hangs, we'll see. >>>>> >>>>> Thanks for the update btw! :-) >>>>> >>>>> Josh >>>>> >>>>> On Wed, Apr 15, 2009 at 9:31 PM, Doug Judd <[email protected]> wrote: >>>>>> Hi Josh, >>>>>> >>>>>> Is it possible that the system underwent heavy update activity during >>>>>> that >>>>>> time period? We don't have request throttling in place yet (should be >>>>>> out >>>>>> next week), so it is possible for the RangeServer to exhaust memory under >>>>>> heavy update workloads. It looks like the commit log got >>>>>> truncated/corrupted when the machine died. You can tell the RangeServer >>>>>> to >>>>>> skip commit log errors with the following property: >>>>>> >>>>>> Hypertable.CommitLog.SkipErrors=true >>>>>> >>>>>> This data in the commit log that is being skipped will most likely be >>>>>> lost. >>>>>> >>>>>> - Doug >>>>>> >>>>>> On Mon, Apr 13, 2009 at 1:10 PM, Josh Adams <[email protected]> wrote: >>>>>>> >>>>>>> On Mon, Apr 13, 2009 at 9:58 AM, Doug Judd <[email protected]> >>>>>>> wrote: >>>>>>> > No, it shouldn't. One thing that might help is to install tcmalloc >>>>>>> > (google-perftools) and then re-build. You'll need to have tcmalloc >>>>>>> > installed in all your runtime environments. >>>>>>> >>>>>>> Ok thanks, I'll try that out hopefully this week and let you know. >>>>>>> >>>>>>> > 157 on it a while back. It would be interesting to know if the disk >>>>>>> > subsystems on any of your machines are getting saturated during this >>>>>>> > low >>>>>>> > throughput condition. If so, then there probably is not much we can >>>>>>> > do >>>>>>> >>>>>>> Good point, I'll keep an eye on that. >>>>>>> >>>>>>> I was out of town on a short trip over the weekend and I wasn't >>>>>>> watching our Hypertable instance very closely. During the early >>>>>>> morning hours on Saturday it looks like each of the four machines >>>>>>> running RangeServer/kosmosBroker/ThriftBroker had their memory spike >>>>>>> heavily for about an hour. The root RangeServer started swapping and >>>>>>> the machine went down later that day. I can't start the instance back >>>>>>> up at the moment because the root RangeServer is complaining about >>>>>>> this error and dies when I try starting it: >>>>>>> >>>>>>> 1239651998 ERROR Hypertable.RangeServer : load_next_valid_header >>>>>>> >>>>>>> (/data/tmp/dev/src/hypertable/6d5fdd1/src/cc/Hypertable/Lib/CommitLogBlockStream.cc:148): >>>>>>> Hypertable::Exception: Error reading 34 bytes from DFS fd 1057 - >>>>>>> HYPERTABLE failed expectation >>>>>>> at virtual size_t Hypertable::DfsBroker::Client::read(int32_t, >>>>>>> void*, >>>>>>> size_t) >>>>>>> (/data/tmp/dev/src/hypertable/6d5fdd1/src/cc/DfsBroker/Lib/Client.cc:258) >>>>>>> at size_t Hypertable::ClientBufferedReaderHandler::read(void*, >>>>>>> size_t) >>>>>>> (/data/tmp/dev/src/hypertable/6d5fdd1/src/cc/DfsBroker/Lib/ClientBufferedReaderHandler.cc:161): >>>>>>> empty queue >>>>>>> >>>>>>> I've attached a file containing the relevant errors at the end of its >>>>>>> log and also the whole kosmosBroker log file for that startup attempt. >>>>>>> >>>>>>> Cheers, >>>>>>> Josh >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
