Hey Sriram,

To follow up on our IM conversation a few minutes ago I'll be adding
the debugging line to the chunkserver configs:

chunkServer.loglevel = DEBUG

And I'll send you the debug log if/when I can repro.

Thanks!

Josh

On Sun, Apr 19, 2009 at 4:01 PM, Sriram Rao <[email protected]> wrote:
> Hey Josh,
>
> Thanks for the logs. Any chance the chunkserver dropped a core file
> when it died?  If you can load that into gdb and get me backtrace,
> that'd be great.
>
> Sriram
>
> On Sun, Apr 19, 2009 at 1:03 PM, Josh Adams <[email protected]> wrote:
>> Hey Sriram, thanks for the quick response!  Looks like I saw this late
>> since you caught me on IM to get the logs.
>>
>> Josh
>>
>> On Sun, Apr 19, 2009 at 12:18 PM, Sriram Rao <[email protected]> wrote:
>>> Josh,
>>>
>>> I'd like to help you out.  What'd be good is if you can mail me the
>>> chunkserver logs (the one that has the problem).  The kfs-broker logs
>>> attached here are empty.
>>>
>>> Sriram
>>>
>>> On Sun, Apr 19, 2009 at 10:59 AM, Josh Adams <[email protected]> wrote:
>>>> Hi Doug,
>>>>
>>>> This morning something happened which caused the root RangeServer to
>>>> go down for good (even after multiple attempts to start it with
>>>> Hypertable.CommitLog.SkipErrors=true.)  There was no excessive load on
>>>> the system or memory exhaustion this time because I was not performing
>>>> heavy updates, it was just rolling along with realtime and all of a
>>>> sudden croaked.  I've narrowed it down to a likely culprit though...
>>>>
>>>> When I approached the wreckage I found at least one KFS chunkserver
>>>> which was exhibiting signs similar to those of a bug recently reported
>>>> to the kosmosfs-users list which results in the chunkserver's vsize
>>>> bloating to 50-100GB and the server becoming locked up using 100% CPU.
>>>>  Since the error in the root RangeServer log points to a DFS i/o error
>>>> I feel confident that these two occurrences are probably not
>>>> coincidence.
>>>>
>>>> This, however, makes my life a little more difficult since now I have
>>>> to find a way re-index a large amount of data to prepare for a meeting
>>>> early this week with the founders which is supposed to be the big
>>>> show-and-tell session to prove Hypertable's worthiness to the company.
>>>>  I could agree that this is a reasonable setback considering the risk
>>>> I took with my decision to go with the lesser-tested kosmosBroker here
>>>> but I'm frusterated with how things are going nevertheless.
>>>>
>>>> I'm now going to fire up the next iteration on HDFS.  Let me know if
>>>> you can think of any suggestions.
>>>>
>>>> Cheers,
>>>> Josh
>>>>
>>>> On Wed, Apr 15, 2009 at 9:52 PM, Josh Adams <[email protected]> wrote:
>>>>> Hey Doug,
>>>>>
>>>>> Yes, that's exactly what was happening.  I've since rebuilt everything
>>>>> with tcmalloc/google-perftools according to the docs and the memory
>>>>> usage has become more manageable but I still see high consumption and
>>>>> eventual memory exhaustion during heavy updates.
>>>>>
>>>>> A new problem I've encountered with the tcmalloc-built binaries is
>>>>> that the ThriftBroker hangs soon after it completes some random number
>>>>> of reads or updates, usually within a minute or two of activity.  I
>>>>> tried using the non-tcmalloc ThriftBroker binary with the currently
>>>>> running tcmalloc master/rangeservers/kosmosbrokers and it still hung.
>>>>> I'm going to try going back and start a fresh Hypertable instance with
>>>>> the non-tcmalloc binaries for everything to see if the problem goes
>>>>> away.  Could be some changes to our app code causing the ThriftBroker
>>>>> hangs, we'll see.
>>>>>
>>>>> Thanks for the update btw! :-)
>>>>>
>>>>> Josh
>>>>>
>>>>> On Wed, Apr 15, 2009 at 9:31 PM, Doug Judd <[email protected]> wrote:
>>>>>> Hi Josh,
>>>>>>
>>>>>> Is it possible that the system underwent heavy update activity during 
>>>>>> that
>>>>>> time period?  We don't have request throttling in place yet (should be 
>>>>>> out
>>>>>> next week), so it is possible for the RangeServer to exhaust memory under
>>>>>> heavy update workloads.  It looks like the commit log got
>>>>>> truncated/corrupted when the machine died.  You can tell the RangeServer 
>>>>>> to
>>>>>> skip commit log errors with the following property:
>>>>>>
>>>>>> Hypertable.CommitLog.SkipErrors=true
>>>>>>
>>>>>> This data in the commit log that is being skipped will most likely be 
>>>>>> lost.
>>>>>>
>>>>>> - Doug
>>>>>>
>>>>>> On Mon, Apr 13, 2009 at 1:10 PM, Josh Adams <[email protected]> wrote:
>>>>>>>
>>>>>>> On Mon, Apr 13, 2009 at 9:58 AM, Doug Judd <[email protected]> 
>>>>>>> wrote:
>>>>>>> > No, it shouldn't.  One thing that might help is to install tcmalloc
>>>>>>> > (google-perftools) and then re-build.  You'll need to have tcmalloc
>>>>>>> > installed in all your runtime environments.
>>>>>>>
>>>>>>> Ok thanks, I'll try that out hopefully this week and let you know.
>>>>>>>
>>>>>>> > 157 on it a while back.  It would be interesting to know if the disk
>>>>>>> > subsystems on any of your machines are getting saturated during this 
>>>>>>> > low
>>>>>>> > throughput condition.  If so, then there probably is not much we can 
>>>>>>> > do
>>>>>>>
>>>>>>> Good point, I'll keep an eye on that.
>>>>>>>
>>>>>>> I was out of town on a short trip over the weekend and I wasn't
>>>>>>> watching our Hypertable instance very closely.  During the early
>>>>>>> morning hours on Saturday it looks like each of the four machines
>>>>>>> running RangeServer/kosmosBroker/ThriftBroker had their memory spike
>>>>>>> heavily for about an hour.  The root RangeServer started swapping and
>>>>>>> the machine went down later that day.  I can't start the instance back
>>>>>>> up at the moment because the root RangeServer is complaining about
>>>>>>> this error and dies when I try starting it:
>>>>>>>
>>>>>>> 1239651998 ERROR Hypertable.RangeServer : load_next_valid_header
>>>>>>>
>>>>>>> (/data/tmp/dev/src/hypertable/6d5fdd1/src/cc/Hypertable/Lib/CommitLogBlockStream.cc:148):
>>>>>>> Hypertable::Exception: Error reading 34 bytes from DFS fd 1057 -
>>>>>>> HYPERTABLE failed expectation
>>>>>>>        at virtual size_t Hypertable::DfsBroker::Client::read(int32_t,
>>>>>>> void*,
>>>>>>> size_t)
>>>>>>> (/data/tmp/dev/src/hypertable/6d5fdd1/src/cc/DfsBroker/Lib/Client.cc:258)
>>>>>>>        at size_t Hypertable::ClientBufferedReaderHandler::read(void*,
>>>>>>> size_t)
>>>>>>> (/data/tmp/dev/src/hypertable/6d5fdd1/src/cc/DfsBroker/Lib/ClientBufferedReaderHandler.cc:161):
>>>>>>> empty queue
>>>>>>>
>>>>>>> I've attached a file containing the relevant errors at the end of its
>>>>>>> log and also the whole kosmosBroker log file for that startup attempt.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Josh
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> >>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to