[hypertable-dev] Re: RangeServer spending a lot of time in local_recover()

Joshua Taylor Wed, 10 Sep 2008 17:18:47 -0700

Thanks Doug, I'll give it a try.

On Wed, Sep 10, 2008 at 4:53 PM, Doug Judd <[EMAIL PROTECTED]> wrote:


> Hi Josh,
>
> If you're just trying to get the system up and running and don't mind if
> you potentially lose some data, you could try this.  Do a directory listing
> in the /hypertable/tables/X/default/AB2A0D28DE6B77FFDD6C72AF directory and
> find the newest CellStore file csNNNN and remember the creation time t.
> Then, in the log/user/ directory of the server that is handling all of the
> load, delete all of the log fragments that have a creation time that is less
> than t.  I think that should actually work without data loss.
>
> - Doug
>
>
> On Wed, Sep 10, 2008 at 4:37 PM, Joshua Taylor <[EMAIL PROTECTED]>wrote:
>
>> I'm still trying to get my Hypertable cluster running again.  After seeing
>> half the RangeServers die because they lost their Hyperspace session when
>> loading with 5 concurrent clients, I decided to take Donald's advice and
>> give the Master processes (Hypertable+Hyperspace) a dedicated node.  Then I
>> tried restarting the failed RangeServers.  This time the one with the 350+
>> GB of commit logs spent 6 hours trying to recover before I noticed it had
>> grown to 15 GB of memory (7 GB RSS).  I shot it down since it was just
>> thrashing at that point.
>>
>> So now I seem to have two problems:
>>
>> 1) Log cleanup doesn't seem to be working so I have to replay 350+ GB when
>> I restart.
>>
>> 2) When replaying the logs, I run out of memory.
>>
>> I've been trying to figure out #2, since I can no longer keep the servers
>> running long enough to address #1.  It looks like all compactions are
>> deferred until the recovery is done.  Commits get loaded into memory until
>> the machine runs out, then boom.  I don't have the best understanding of the
>> recovery strategy, but I'd guess that fixing this problem would require some
>> major surgery.
>>
>> One argument is that #2 isn't worth fixing.  If #1 were working properly,
>> the system wouldn't get itself into such a bad state.  The recovery can just
>> assume there's enough memory most of the time.
>>
>> Most of the time is not all of the time, though.  I can imagine some
>> normal use cases where this problem would pop up:
>>
>> A) One server is falling behind on compactions due to hardware issues or
>> resource contention and it eventually crashes for lack of memory.  When
>> another server comes up to recover, it has to recover the same memory load
>> that just caused the last process to crash.
>>
>> B) Cluster management software decides to take a RangeServer machine out
>> of service.  Say it's a machine with 8 GB of RAM and Hypertable has buffered
>> up 5 GB in memory.  It doesn't get a chance to compact before being taken
>> down.  The machine chosen as a replacement server only has 4 GB of of
>> available RAM.  It will somehow have to recover the 5 GB memory state of the
>> old server.
>>
>> Maybe these are post-1.0 concerns.  I'm wondering what I can do now.  The
>> "solution" I'm looking at is to wipe out my entire Hypertable installation
>> and try to isolate #1 from a clean slate.  Any suggestions for a less
>> drastic fix?
>>
>> Josh
>>
>>
>>
>> On Mon, Sep 8, 2008 at 11:38 AM, Luke <[EMAIL PROTECTED]> wrote:
>>
>>>
>>> Maybe we should consider an option to split off (moving to another
>>> range server) lower/higher half of a range, depending on the loading
>>> pattern of the data. The range server can dynamically detect if the
>>> row key is in ascending order and split off the higher half range, or
>>> vice versa, to balance the data better (it's better than rebalancing
>>> the data later, as it involves extra copies.)
>>>
>>> __Luke
>>>
>>> On Sep 8, 9:14 am, "Doug Judd" <[EMAIL PROTECTED]> wrote:
>>> > Hi Josh,
>>> >
>>> > The problem here is that this particular workload (loading data in
>>> ascending
>>> > order of primary key) is worst-case from Hypertable's perspective.  It
>>> works
>>> > optimally with random updates that are uniform across the primary key
>>> space.
>>> >
>>> > The way the system works is that a single range server ends up handling
>>> all
>>> > of the load.  When a range fills up and splits, the lower half will get
>>> > re-assigned to another range server.  However, since there will be no
>>> more
>>> > updates to that lower half, there will be not activity on that range.
>>>  When
>>> > a range splits, it first does a major compaction.  After the split,
>>> both
>>> > ranges (lower half and upper half), will share the same CellStore file
>>> in
>>> > the DFS.  This is why you see 4313 range directories that are empty
>>> (their
>>> > key/value pairs are inside a CellStore file that is shared with range
>>> > AB2A0D28DE6B77FFDD6C72AF and is inside this range's directory).  So,
>>> the
>>> > ranges are getting round-robin assigned to all of the RangeServers,
>>> it's
>>> > just that the RangeServer that holds range AB2A0D28DE6B77FFDD6C72AF is
>>> doing
>>> > all of the work.
>>> >
>>> > There is probably a bug that is preventing the Commit log from getting
>>> > garbage collected in this scenario.  I have a couple of high priority
>>> things
>>> > on my stack right now, so I probably won't get to it until later this
>>> week
>>> > or early next week.  If you have any time to investigate, the place to
>>> look
>>> > would be RangeServer::log_cleanup().  This method gets called once per
>>> > minute to do log fragment garbage collection.
>>> >
>>> > Also, this workload seems like it is more common than we initially
>>> > expected.  In fact, it is the same workload that we here at Zvents see
>>> in
>>> > our production log processing deployment.  We should definitely spend
>>> some
>>> > time optimizing Hypertable for this type of workload.
>>> >
>>> > - Doug
>>> >
>>> > On Sun, Sep 7, 2008 at 1:11 PM, Joshua Taylor <[EMAIL PROTECTED]
>>> >wrote:
>>> >
>>> > > Hi Donald,
>>> >
>>> > > Thanks for the insights!  That's interesting that the server has so
>>> many
>>> > > ranges loaded on it.  Does Hypertable not yet redistribute ranges for
>>> > > balancing?
>>> >
>>> > > Looking in /hypertable/tables/X/default/, I see 4313 directories,
>>> which I
>>> > > guess correspond to the ranges.  If what you're saying is true, then
>>> that
>>> > > one server has all the ranges.  When I was looking at the METADATA
>>> table
>>> > > earlier, I seem to remember that ranges seemed to be spread around as
>>> far as
>>> > > the METADATA table was concerned.  I can't verify that now because
>>> half of
>>> > > the RangeServers in the cluster went down after I tried the 15-way
>>> load last
>>> > > night.  Maybe these log directories indicate that each range was
>>> created on
>>> > > this one server, but isn't necessarily still hosted there.
>>> >
>>> > > Looking in table range directories, I see that most of them are
>>> empty.  Of
>>> > > the 4313 table range directories only 12 have content, with the
>>> following
>>> > > size distribution:
>>> >
>>> > > Name                     Size in bytes
>>> > > 71F33965BA815E48705DB484 772005
>>> > > D611DD0EE66B8CF9FB4AA997 40917711
>>> > > 38D1E3EA8AD2F6D4BA9A4DF8 74199178
>>> > > AB2A0D28DE6B77FFDD6C72AF 659455660576
>>> > > 4F07C111DD9998285C68F405 900
>>> > > F449F89DDE481715AE83F46C 29046097
>>> > > 1A0950A7883F9AC068C6B5FD 54621737
>>> > > 9213BEAADBFF69E633617D98 900
>>> > > 6224D36D9A7D3C5B4AE941B2 131677668
>>> > > 6C33339858EDF470B771637C 132973214
>>> > > 64365528C0D82ED25FC7FFB0 170159530
>>> > > C874EFC44725DB064046A0FF 900
>>> >
>>> > > It's really skewed, but maybe this isn't a big deal.  I'm going to
>>> guess
>>> > > that the 650 GB slice corresponds to the end range of the table.
>>>  Most of
>>> > > the data gets created here.  When a split happens, the new range
>>> holds a
>>> > > reference to the files in the original range and never has the need
>>> to do a
>>> > > compaction into its own data space.
>>> >
>>> > > As for the log recovery process...  when I wrote the last message,
>>> the
>>> > > recovery was still happening and had been running for 115 minutes.  I
>>> let it
>>> > > continue to run to see if it would actually finish, and it did.
>>>  Looking at
>>> > > the log, it appears that it actually took around 180 minutes to
>>> complete and
>>> > > get back to the outstanding scanner request, which had long since
>>> timed
>>> > > out.  After the recovery, the server is back up to 2.8 GB of memory.
>>>  The
>>> > > log directory still contains the 4300+ split directories, and the
>>> user
>>> > > commit log directory still contains 350+ GB of data.
>>> >
>>> > > You suggest that the log data is supposed to be cleaned up.  I'm
>>> using a
>>> > > post-0.9.0.10 build (v0.9.0.10-14-g50e5f71 to be exact).  It contains
>>> what I
>>> > > think is the patch you're referencing:
>>> > > commit 38bbfd60d1a52aff3230dea80aa4f3c0c07daae4
>>> > > Author: Donald <[EMAIL PROTECTED]>
>>> > >     Fixed a bug in RangeServer::schedule_log_cleanup_compactions that
>>> > > prevents log cleanup com...
>>> >
>>> > > I'm hoping the maintenance task threads weren't too busy for this
>>> workload,
>>> > > as it was pretty light.  This is a 15 server cluster with a single
>>> active
>>> > > client writing to the table and nobody reading from the table.  Like
>>> I said
>>> > > earlier, I tried a 15-way write after the recovery completed and half
>>> the
>>> > > RangeServers died.  It looks like they all lost their Hyperspace
>>> lease, and
>>> > > the Hyperspace.master machine was 80% in the iowait state with an
>>> load
>>> > > average of 20 for a while.  The server hosts a HDFS data node, a
>>> > > RangeServer, and Hyperspace.master.  Maybe Hyperspace.master needs a
>>> > > dedicated server?  I should probably take that issue to another
>>> thread.
>>> >
>>> > > I'll look into it further, probably tomorrow.
>>> >
>>> > > Josh
>>> >
>>> > > On Sat, Sep 6, 2008 at 9:29 PM, Liu Kejia(Donald) <
>>> [EMAIL PROTECTED]>wrote:
>>> >
>>> > >> Hi Josh,
>>> >
>>> > >> The 4311 directories are for split logs, they are used while a range
>>> > >> is splitting into two. This indicates at least you have 4K+ ranges
>>> on
>>> > >> that server, which is pretty big (I usually have several hundreds
>>> per
>>> > >> server). The 3670 files are commit log files, I think it's actually
>>> > >> quite good performance to take 115 minutes to replay a total of 3.5G
>>> > >> logs, you get 50MB/s throughput anyway. The problem is many of these
>>> > >> commit log files should be removed over time, after compactions of
>>> the
>>> > >> ranges take place. Ideally you'll only have 1 or 2 of these files
>>> left
>>> > >> after all the maintenance tasks are done. If so, the replay process
>>> > >> only costs several seconds.
>>> >
>>> > >> One reason why the commit log files are not getting reclaimed is due
>>> > >> to a bug in the range server code, I've pushed out a fix for it and
>>> it
>>> > >> should be included in the latest 0.9.0.10 release. Another reason
>>> > >> could be that your maintenance task threads are too busy to get the
>>> > >> work done in time, you may try to increase the number of maintenance
>>> > >> tasks by setting Hypertable.RangeServer.MaintenanceThreads in your
>>> > >> hypertable.cfg file.
>>> >
>>> > >> About load balance, I think your guess is right. About HDFS, it
>>> seems
>>> > >> HDFS always tries to put one copy of the file block on the local
>>> > >> datanode. This has good performance, but certainly bad load balance
>>> if
>>> > >> you keep writing from one server.
>>> >
>>> > >> Donald
>>> >
>>> > >> On Sun, Sep 7, 2008 at 10:20 AM, Joshua Taylor <
>>> [EMAIL PROTECTED]>
>>> > >> wrote:
>>> > >> > I had a RangeServer process that was taking up around 5.8 GB of
>>> memory
>>> > >> so I
>>> > >> > shot it down and restarted it.  The RangeServer has spent the last
>>> 80
>>> > >> > CPU-minutes (>115 minutes on the clock) in local_recover().  Is
>>> this
>>> > >> normal?
>>> >
>>> > >> > Looking around HDFS, I see around 3670 files in server's
>>> /.../log/user/
>>> > >> > directory, most of which are around 100 MB in size (total
>>> directory
>>> > >> size:
>>> > >> > 351,031,700,665 bytes).  I also see 4311 directories in the parent
>>> > >> > directory, of which 4309 are named with a 24 character hex string.
>>>  Spot
>>> > >> > inspection of these shows that most (all?) of these contain a
>>> single 0
>>> > >> byte
>>> > >> > file named "0".
>>> >
>>> > >> > The RangeServer log file since the restart currently contains over
>>> > >> 835,000
>>> > >> > lines.  The bulk seems to be lines like:
>>> >
>>> > >> > 1220752472 INFO Hypertable.RangeServer :
>>> >
>>> > >>
>>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
>>> > >> > replay_update - length=30
>>> > >> > 1220752472 INFO Hypertable.RangeServer :
>>> >
>>> > >>
>>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
>>> > >> > replay_update - length=30
>>> > >> > 1220752472 INFO Hypertable.RangeServer :
>>> >
>>> > >>
>>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
>>> > >> > replay_update - length=30
>>> > >> > 1220752472 INFO Hypertable.RangeServer :
>>> >
>>> > >>
>>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
>>> > >> > replay_update - length=30
>>> > >> > 1220752472 INFO Hypertable.RangeServer :
>>> >
>>> > >>
>>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
>>> > >> > replay_update - length=30
>>> >
>>> > >> > The memory usage may be the same issue that Donald was reporting
>>> earlier
>>> > >> in
>>> > >> > his discussion of fragmentation.  The new RangeServer process has
>>> grown
>>> > >> up
>>> > >> > to 1.5 GB of memory again, but the max cache size is 200 MB
>>> (default).
>>> >
>>> > >> > I'd been loading into a 15-node Hypertable cluster all week using
>>> a
>>> > >> single
>>> > >> > loader process.  I'd loaded about 5 billion cells, or around 1.5
>>> TB of
>>> > >> data
>>> > >> > before I decided to kill the loader because it was taking too long
>>> (and
>>> > >> that
>>> > >> > one server was getting huge).  The total data set size is around
>>> 3.5 TB
>>> > >> and
>>> > >> > it took under a week to generate the original set (using 15-way
>>> > >> parallelism,
>>> > >> > not just a single loader), so I decided to trying to load the rest
>>> in a
>>> > >> > distributed manner.
>>> >
>>> > >> > The loading was happening in ascending row order.  It seems like
>>> all of
>>> > >> the
>>> > >> > loading was happening on the same server.  I'm guessing that when
>>> splits
>>> > >> > happened, the low range got moved off, and the same server
>>> continued to
>>> > >> load
>>> > >> > the end range.  That might explain why one server was getting all
>>> the
>>> > >> > traffic.
>>> >
>>> > >> > Looking at HDFS disk usage, the loaded server has 954 GB of disk
>>> used
>>> > >> for
>>> > >> > Hadoop and the other 14 all have around 140 GB of disk usage.
>>>  This
>>> > >> behavior
>>> > >> > also has me wondering what happens when that one machine fills up
>>> > >> (another
>>> > >> > couple hundred GB).  Does the whole system crash, or does HDFS get
>>> > >> smarter
>>> > >> > about balancing?
>>> >
>>> > ...
>>> >
>>> > read more »
>>>
>>>
>>
>>
>>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[hypertable-dev] Re: RangeServer spending a lot of time in local_recover()

Reply via email to