Hey Josh, Thanks for the logs. Any chance the chunkserver dropped a core file when it died? If you can load that into gdb and get me backtrace, that'd be great.
Sriram On Sun, Apr 19, 2009 at 1:03 PM, Josh Adams <[email protected]> wrote: > Hey Sriram, thanks for the quick response! Looks like I saw this late > since you caught me on IM to get the logs. > > Josh > > On Sun, Apr 19, 2009 at 12:18 PM, Sriram Rao <[email protected]> wrote: >> Josh, >> >> I'd like to help you out. What'd be good is if you can mail me the >> chunkserver logs (the one that has the problem). The kfs-broker logs >> attached here are empty. >> >> Sriram >> >> On Sun, Apr 19, 2009 at 10:59 AM, Josh Adams <[email protected]> wrote: >>> Hi Doug, >>> >>> This morning something happened which caused the root RangeServer to >>> go down for good (even after multiple attempts to start it with >>> Hypertable.CommitLog.SkipErrors=true.) There was no excessive load on >>> the system or memory exhaustion this time because I was not performing >>> heavy updates, it was just rolling along with realtime and all of a >>> sudden croaked. I've narrowed it down to a likely culprit though... >>> >>> When I approached the wreckage I found at least one KFS chunkserver >>> which was exhibiting signs similar to those of a bug recently reported >>> to the kosmosfs-users list which results in the chunkserver's vsize >>> bloating to 50-100GB and the server becoming locked up using 100% CPU. >>> Since the error in the root RangeServer log points to a DFS i/o error >>> I feel confident that these two occurrences are probably not >>> coincidence. >>> >>> This, however, makes my life a little more difficult since now I have >>> to find a way re-index a large amount of data to prepare for a meeting >>> early this week with the founders which is supposed to be the big >>> show-and-tell session to prove Hypertable's worthiness to the company. >>> I could agree that this is a reasonable setback considering the risk >>> I took with my decision to go with the lesser-tested kosmosBroker here >>> but I'm frusterated with how things are going nevertheless. >>> >>> I'm now going to fire up the next iteration on HDFS. Let me know if >>> you can think of any suggestions. >>> >>> Cheers, >>> Josh >>> >>> On Wed, Apr 15, 2009 at 9:52 PM, Josh Adams <[email protected]> wrote: >>>> Hey Doug, >>>> >>>> Yes, that's exactly what was happening. I've since rebuilt everything >>>> with tcmalloc/google-perftools according to the docs and the memory >>>> usage has become more manageable but I still see high consumption and >>>> eventual memory exhaustion during heavy updates. >>>> >>>> A new problem I've encountered with the tcmalloc-built binaries is >>>> that the ThriftBroker hangs soon after it completes some random number >>>> of reads or updates, usually within a minute or two of activity. I >>>> tried using the non-tcmalloc ThriftBroker binary with the currently >>>> running tcmalloc master/rangeservers/kosmosbrokers and it still hung. >>>> I'm going to try going back and start a fresh Hypertable instance with >>>> the non-tcmalloc binaries for everything to see if the problem goes >>>> away. Could be some changes to our app code causing the ThriftBroker >>>> hangs, we'll see. >>>> >>>> Thanks for the update btw! :-) >>>> >>>> Josh >>>> >>>> On Wed, Apr 15, 2009 at 9:31 PM, Doug Judd <[email protected]> wrote: >>>>> Hi Josh, >>>>> >>>>> Is it possible that the system underwent heavy update activity during that >>>>> time period? We don't have request throttling in place yet (should be out >>>>> next week), so it is possible for the RangeServer to exhaust memory under >>>>> heavy update workloads. It looks like the commit log got >>>>> truncated/corrupted when the machine died. You can tell the RangeServer >>>>> to >>>>> skip commit log errors with the following property: >>>>> >>>>> Hypertable.CommitLog.SkipErrors=true >>>>> >>>>> This data in the commit log that is being skipped will most likely be >>>>> lost. >>>>> >>>>> - Doug >>>>> >>>>> On Mon, Apr 13, 2009 at 1:10 PM, Josh Adams <[email protected]> wrote: >>>>>> >>>>>> On Mon, Apr 13, 2009 at 9:58 AM, Doug Judd <[email protected]> wrote: >>>>>> > No, it shouldn't. One thing that might help is to install tcmalloc >>>>>> > (google-perftools) and then re-build. You'll need to have tcmalloc >>>>>> > installed in all your runtime environments. >>>>>> >>>>>> Ok thanks, I'll try that out hopefully this week and let you know. >>>>>> >>>>>> > 157 on it a while back. It would be interesting to know if the disk >>>>>> > subsystems on any of your machines are getting saturated during this >>>>>> > low >>>>>> > throughput condition. If so, then there probably is not much we can do >>>>>> >>>>>> Good point, I'll keep an eye on that. >>>>>> >>>>>> I was out of town on a short trip over the weekend and I wasn't >>>>>> watching our Hypertable instance very closely. During the early >>>>>> morning hours on Saturday it looks like each of the four machines >>>>>> running RangeServer/kosmosBroker/ThriftBroker had their memory spike >>>>>> heavily for about an hour. The root RangeServer started swapping and >>>>>> the machine went down later that day. I can't start the instance back >>>>>> up at the moment because the root RangeServer is complaining about >>>>>> this error and dies when I try starting it: >>>>>> >>>>>> 1239651998 ERROR Hypertable.RangeServer : load_next_valid_header >>>>>> >>>>>> (/data/tmp/dev/src/hypertable/6d5fdd1/src/cc/Hypertable/Lib/CommitLogBlockStream.cc:148): >>>>>> Hypertable::Exception: Error reading 34 bytes from DFS fd 1057 - >>>>>> HYPERTABLE failed expectation >>>>>> at virtual size_t Hypertable::DfsBroker::Client::read(int32_t, >>>>>> void*, >>>>>> size_t) >>>>>> (/data/tmp/dev/src/hypertable/6d5fdd1/src/cc/DfsBroker/Lib/Client.cc:258) >>>>>> at size_t Hypertable::ClientBufferedReaderHandler::read(void*, >>>>>> size_t) >>>>>> (/data/tmp/dev/src/hypertable/6d5fdd1/src/cc/DfsBroker/Lib/ClientBufferedReaderHandler.cc:161): >>>>>> empty queue >>>>>> >>>>>> I've attached a file containing the relevant errors at the end of its >>>>>> log and also the whole kosmosBroker log file for that startup attempt. >>>>>> >>>>>> Cheers, >>>>>> Josh >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >> > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
