[hypertable-dev] Re: RangeServer spending a lot of time in local_recover()

Doug Judd Wed, 10 Sep 2008 16:53:17 -0700

Hi Josh,

If you're just trying to get the system up and running and don't mind if you
potentially lose some data, you could try this.  Do a directory listing in
the /hypertable/tables/X/default/AB2A0D28DE6B77FFDD6C72AF directory and find
the newest CellStore file csNNNN and remember the creation time t.  Then, in
the log/user/ directory of the server that is handling all of the load,
delete all of the log fragments that have a creation time that is less than
t.  I think that should actually work without data loss.


- Doug

On Wed, Sep 10, 2008 at 4:37 PM, Joshua Taylor <[EMAIL PROTECTED]>wrote:

> I'm still trying to get my Hypertable cluster running again.  After seeing
> half the RangeServers die because they lost their Hyperspace session when
> loading with 5 concurrent clients, I decided to take Donald's advice and
> give the Master processes (Hypertable+Hyperspace) a dedicated node.  Then I
> tried restarting the failed RangeServers.  This time the one with the 350+
> GB of commit logs spent 6 hours trying to recover before I noticed it had
> grown to 15 GB of memory (7 GB RSS).  I shot it down since it was just
> thrashing at that point.
>
> So now I seem to have two problems:
>
> 1) Log cleanup doesn't seem to be working so I have to replay 350+ GB when
> I restart.
>
> 2) When replaying the logs, I run out of memory.
>
> I've been trying to figure out #2, since I can no longer keep the servers
> running long enough to address #1.  It looks like all compactions are
> deferred until the recovery is done.  Commits get loaded into memory until
> the machine runs out, then boom.  I don't have the best understanding of the
> recovery strategy, but I'd guess that fixing this problem would require some
> major surgery.
>
> One argument is that #2 isn't worth fixing.  If #1 were working properly,
> the system wouldn't get itself into such a bad state.  The recovery can just
> assume there's enough memory most of the time.
>
> Most of the time is not all of the time, though.  I can imagine some normal
> use cases where this problem would pop up:
>
> A) One server is falling behind on compactions due to hardware issues or
> resource contention and it eventually crashes for lack of memory.  When
> another server comes up to recover, it has to recover the same memory load
> that just caused the last process to crash.
>
> B) Cluster management software decides to take a RangeServer machine out of
> service.  Say it's a machine with 8 GB of RAM and Hypertable has buffered up
> 5 GB in memory.  It doesn't get a chance to compact before being taken
> down.  The machine chosen as a replacement server only has 4 GB of of
> available RAM.  It will somehow have to recover the 5 GB memory state of the
> old server.
>
> Maybe these are post-1.0 concerns.  I'm wondering what I can do now.  The
> "solution" I'm looking at is to wipe out my entire Hypertable installation
> and try to isolate #1 from a clean slate.  Any suggestions for a less
> drastic fix?
>
> Josh
>
>
>
> On Mon, Sep 8, 2008 at 11:38 AM, Luke <[EMAIL PROTECTED]> wrote:
>
>>
>> Maybe we should consider an option to split off (moving to another
>> range server) lower/higher half of a range, depending on the loading
>> pattern of the data. The range server can dynamically detect if the
>> row key is in ascending order and split off the higher half range, or
>> vice versa, to balance the data better (it's better than rebalancing
>> the data later, as it involves extra copies.)
>>
>> __Luke
>>
>> On Sep 8, 9:14 am, "Doug Judd" <[EMAIL PROTECTED]> wrote:
>> > Hi Josh,
>> >
>> > The problem here is that this particular workload (loading data in
>> ascending
>> > order of primary key) is worst-case from Hypertable's perspective.  It
>> works
>> > optimally with random updates that are uniform across the primary key
>> space.
>> >
>> > The way the system works is that a single range server ends up handling
>> all
>> > of the load.  When a range fills up and splits, the lower half will get
>> > re-assigned to another range server.  However, since there will be no
>> more
>> > updates to that lower half, there will be not activity on that range.
>>  When
>> > a range splits, it first does a major compaction.  After the split, both
>> > ranges (lower half and upper half), will share the same CellStore file
>> in
>> > the DFS.  This is why you see 4313 range directories that are empty
>> (their
>> > key/value pairs are inside a CellStore file that is shared with range
>> > AB2A0D28DE6B77FFDD6C72AF and is inside this range's directory).  So, the
>> > ranges are getting round-robin assigned to all of the RangeServers, it's
>> > just that the RangeServer that holds range AB2A0D28DE6B77FFDD6C72AF is
>> doing
>> > all of the work.
>> >
>> > There is probably a bug that is preventing the Commit log from getting
>> > garbage collected in this scenario.  I have a couple of high priority
>> things
>> > on my stack right now, so I probably won't get to it until later this
>> week
>> > or early next week.  If you have any time to investigate, the place to
>> look
>> > would be RangeServer::log_cleanup().  This method gets called once per
>> > minute to do log fragment garbage collection.
>> >
>> > Also, this workload seems like it is more common than we initially
>> > expected.  In fact, it is the same workload that we here at Zvents see
>> in
>> > our production log processing deployment.  We should definitely spend
>> some
>> > time optimizing Hypertable for this type of workload.
>> >
>> > - Doug
>> >
>> > On Sun, Sep 7, 2008 at 1:11 PM, Joshua Taylor <[EMAIL PROTECTED]
>> >wrote:
>> >
>> > > Hi Donald,
>> >
>> > > Thanks for the insights!  That's interesting that the server has so
>> many
>> > > ranges loaded on it.  Does Hypertable not yet redistribute ranges for
>> > > balancing?
>> >
>> > > Looking in /hypertable/tables/X/default/, I see 4313 directories,
>> which I
>> > > guess correspond to the ranges.  If what you're saying is true, then
>> that
>> > > one server has all the ranges.  When I was looking at the METADATA
>> table
>> > > earlier, I seem to remember that ranges seemed to be spread around as
>> far as
>> > > the METADATA table was concerned.  I can't verify that now because
>> half of
>> > > the RangeServers in the cluster went down after I tried the 15-way
>> load last
>> > > night.  Maybe these log directories indicate that each range was
>> created on
>> > > this one server, but isn't necessarily still hosted there.
>> >
>> > > Looking in table range directories, I see that most of them are empty.
>>  Of
>> > > the 4313 table range directories only 12 have content, with the
>> following
>> > > size distribution:
>> >
>> > > Name                     Size in bytes
>> > > 71F33965BA815E48705DB484 772005
>> > > D611DD0EE66B8CF9FB4AA997 40917711
>> > > 38D1E3EA8AD2F6D4BA9A4DF8 74199178
>> > > AB2A0D28DE6B77FFDD6C72AF 659455660576
>> > > 4F07C111DD9998285C68F405 900
>> > > F449F89DDE481715AE83F46C 29046097
>> > > 1A0950A7883F9AC068C6B5FD 54621737
>> > > 9213BEAADBFF69E633617D98 900
>> > > 6224D36D9A7D3C5B4AE941B2 131677668
>> > > 6C33339858EDF470B771637C 132973214
>> > > 64365528C0D82ED25FC7FFB0 170159530
>> > > C874EFC44725DB064046A0FF 900
>> >
>> > > It's really skewed, but maybe this isn't a big deal.  I'm going to
>> guess
>> > > that the 650 GB slice corresponds to the end range of the table.  Most
>> of
>> > > the data gets created here.  When a split happens, the new range holds
>> a
>> > > reference to the files in the original range and never has the need to
>> do a
>> > > compaction into its own data space.
>> >
>> > > As for the log recovery process...  when I wrote the last message, the
>> > > recovery was still happening and had been running for 115 minutes.  I
>> let it
>> > > continue to run to see if it would actually finish, and it did.
>>  Looking at
>> > > the log, it appears that it actually took around 180 minutes to
>> complete and
>> > > get back to the outstanding scanner request, which had long since
>> timed
>> > > out.  After the recovery, the server is back up to 2.8 GB of memory.
>>  The
>> > > log directory still contains the 4300+ split directories, and the user
>> > > commit log directory still contains 350+ GB of data.
>> >
>> > > You suggest that the log data is supposed to be cleaned up.  I'm using
>> a
>> > > post-0.9.0.10 build (v0.9.0.10-14-g50e5f71 to be exact).  It contains
>> what I
>> > > think is the patch you're referencing:
>> > > commit 38bbfd60d1a52aff3230dea80aa4f3c0c07daae4
>> > > Author: Donald <[EMAIL PROTECTED]>
>> > >     Fixed a bug in RangeServer::schedule_log_cleanup_compactions that
>> > > prevents log cleanup com...
>> >
>> > > I'm hoping the maintenance task threads weren't too busy for this
>> workload,
>> > > as it was pretty light.  This is a 15 server cluster with a single
>> active
>> > > client writing to the table and nobody reading from the table.  Like I
>> said
>> > > earlier, I tried a 15-way write after the recovery completed and half
>> the
>> > > RangeServers died.  It looks like they all lost their Hyperspace
>> lease, and
>> > > the Hyperspace.master machine was 80% in the iowait state with an load
>> > > average of 20 for a while.  The server hosts a HDFS data node, a
>> > > RangeServer, and Hyperspace.master.  Maybe Hyperspace.master needs a
>> > > dedicated server?  I should probably take that issue to another
>> thread.
>> >
>> > > I'll look into it further, probably tomorrow.
>> >
>> > > Josh
>> >
>> > > On Sat, Sep 6, 2008 at 9:29 PM, Liu Kejia(Donald) <
>> [EMAIL PROTECTED]>wrote:
>> >
>> > >> Hi Josh,
>> >
>> > >> The 4311 directories are for split logs, they are used while a range
>> > >> is splitting into two. This indicates at least you have 4K+ ranges on
>> > >> that server, which is pretty big (I usually have several hundreds per
>> > >> server). The 3670 files are commit log files, I think it's actually
>> > >> quite good performance to take 115 minutes to replay a total of 3.5G
>> > >> logs, you get 50MB/s throughput anyway. The problem is many of these
>> > >> commit log files should be removed over time, after compactions of
>> the
>> > >> ranges take place. Ideally you'll only have 1 or 2 of these files
>> left
>> > >> after all the maintenance tasks are done. If so, the replay process
>> > >> only costs several seconds.
>> >
>> > >> One reason why the commit log files are not getting reclaimed is due
>> > >> to a bug in the range server code, I've pushed out a fix for it and
>> it
>> > >> should be included in the latest 0.9.0.10 release. Another reason
>> > >> could be that your maintenance task threads are too busy to get the
>> > >> work done in time, you may try to increase the number of maintenance
>> > >> tasks by setting Hypertable.RangeServer.MaintenanceThreads in your
>> > >> hypertable.cfg file.
>> >
>> > >> About load balance, I think your guess is right. About HDFS, it seems
>> > >> HDFS always tries to put one copy of the file block on the local
>> > >> datanode. This has good performance, but certainly bad load balance
>> if
>> > >> you keep writing from one server.
>> >
>> > >> Donald
>> >
>> > >> On Sun, Sep 7, 2008 at 10:20 AM, Joshua Taylor <
>> [EMAIL PROTECTED]>
>> > >> wrote:
>> > >> > I had a RangeServer process that was taking up around 5.8 GB of
>> memory
>> > >> so I
>> > >> > shot it down and restarted it.  The RangeServer has spent the last
>> 80
>> > >> > CPU-minutes (>115 minutes on the clock) in local_recover().  Is
>> this
>> > >> normal?
>> >
>> > >> > Looking around HDFS, I see around 3670 files in server's
>> /.../log/user/
>> > >> > directory, most of which are around 100 MB in size (total directory
>> > >> size:
>> > >> > 351,031,700,665 bytes).  I also see 4311 directories in the parent
>> > >> > directory, of which 4309 are named with a 24 character hex string.
>>  Spot
>> > >> > inspection of these shows that most (all?) of these contain a
>> single 0
>> > >> byte
>> > >> > file named "0".
>> >
>> > >> > The RangeServer log file since the restart currently contains over
>> > >> 835,000
>> > >> > lines.  The bulk seems to be lines like:
>> >
>> > >> > 1220752472 INFO Hypertable.RangeServer :
>> >
>> > >>
>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
>> > >> > replay_update - length=30
>> > >> > 1220752472 INFO Hypertable.RangeServer :
>> >
>> > >>
>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
>> > >> > replay_update - length=30
>> > >> > 1220752472 INFO Hypertable.RangeServer :
>> >
>> > >>
>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
>> > >> > replay_update - length=30
>> > >> > 1220752472 INFO Hypertable.RangeServer :
>> >
>> > >>
>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
>> > >> > replay_update - length=30
>> > >> > 1220752472 INFO Hypertable.RangeServer :
>> >
>> > >>
>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553)
>> > >> > replay_update - length=30
>> >
>> > >> > The memory usage may be the same issue that Donald was reporting
>> earlier
>> > >> in
>> > >> > his discussion of fragmentation.  The new RangeServer process has
>> grown
>> > >> up
>> > >> > to 1.5 GB of memory again, but the max cache size is 200 MB
>> (default).
>> >
>> > >> > I'd been loading into a 15-node Hypertable cluster all week using a
>> > >> single
>> > >> > loader process.  I'd loaded about 5 billion cells, or around 1.5 TB
>> of
>> > >> data
>> > >> > before I decided to kill the loader because it was taking too long
>> (and
>> > >> that
>> > >> > one server was getting huge).  The total data set size is around
>> 3.5 TB
>> > >> and
>> > >> > it took under a week to generate the original set (using 15-way
>> > >> parallelism,
>> > >> > not just a single loader), so I decided to trying to load the rest
>> in a
>> > >> > distributed manner.
>> >
>> > >> > The loading was happening in ascending row order.  It seems like
>> all of
>> > >> the
>> > >> > loading was happening on the same server.  I'm guessing that when
>> splits
>> > >> > happened, the low range got moved off, and the same server
>> continued to
>> > >> load
>> > >> > the end range.  That might explain why one server was getting all
>> the
>> > >> > traffic.
>> >
>> > >> > Looking at HDFS disk usage, the loaded server has 954 GB of disk
>> used
>> > >> for
>> > >> > Hadoop and the other 14 all have around 140 GB of disk usage.  This
>> > >> behavior
>> > >> > also has me wondering what happens when that one machine fills up
>> > >> (another
>> > >> > couple hundred GB).  Does the whole system crash, or does HDFS get
>> > >> smarter
>> > >> > about balancing?
>> >
>> > ...
>> >
>> > read more »
>>
>>
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Hypertable Development" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/hypertable-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[hypertable-dev] Re: RangeServer spending a lot of time in local_recover()

Reply via email to