Thanks Doug, I'll give it a try. On Wed, Sep 10, 2008 at 4:53 PM, Doug Judd <[EMAIL PROTECTED]> wrote:
> Hi Josh, > > If you're just trying to get the system up and running and don't mind if > you potentially lose some data, you could try this. Do a directory listing > in the /hypertable/tables/X/default/AB2A0D28DE6B77FFDD6C72AF directory and > find the newest CellStore file csNNNN and remember the creation time t. > Then, in the log/user/ directory of the server that is handling all of the > load, delete all of the log fragments that have a creation time that is less > than t. I think that should actually work without data loss. > > - Doug > > > On Wed, Sep 10, 2008 at 4:37 PM, Joshua Taylor <[EMAIL PROTECTED]>wrote: > >> I'm still trying to get my Hypertable cluster running again. After seeing >> half the RangeServers die because they lost their Hyperspace session when >> loading with 5 concurrent clients, I decided to take Donald's advice and >> give the Master processes (Hypertable+Hyperspace) a dedicated node. Then I >> tried restarting the failed RangeServers. This time the one with the 350+ >> GB of commit logs spent 6 hours trying to recover before I noticed it had >> grown to 15 GB of memory (7 GB RSS). I shot it down since it was just >> thrashing at that point. >> >> So now I seem to have two problems: >> >> 1) Log cleanup doesn't seem to be working so I have to replay 350+ GB when >> I restart. >> >> 2) When replaying the logs, I run out of memory. >> >> I've been trying to figure out #2, since I can no longer keep the servers >> running long enough to address #1. It looks like all compactions are >> deferred until the recovery is done. Commits get loaded into memory until >> the machine runs out, then boom. I don't have the best understanding of the >> recovery strategy, but I'd guess that fixing this problem would require some >> major surgery. >> >> One argument is that #2 isn't worth fixing. If #1 were working properly, >> the system wouldn't get itself into such a bad state. The recovery can just >> assume there's enough memory most of the time. >> >> Most of the time is not all of the time, though. I can imagine some >> normal use cases where this problem would pop up: >> >> A) One server is falling behind on compactions due to hardware issues or >> resource contention and it eventually crashes for lack of memory. When >> another server comes up to recover, it has to recover the same memory load >> that just caused the last process to crash. >> >> B) Cluster management software decides to take a RangeServer machine out >> of service. Say it's a machine with 8 GB of RAM and Hypertable has buffered >> up 5 GB in memory. It doesn't get a chance to compact before being taken >> down. The machine chosen as a replacement server only has 4 GB of of >> available RAM. It will somehow have to recover the 5 GB memory state of the >> old server. >> >> Maybe these are post-1.0 concerns. I'm wondering what I can do now. The >> "solution" I'm looking at is to wipe out my entire Hypertable installation >> and try to isolate #1 from a clean slate. Any suggestions for a less >> drastic fix? >> >> Josh >> >> >> >> On Mon, Sep 8, 2008 at 11:38 AM, Luke <[EMAIL PROTECTED]> wrote: >> >>> >>> Maybe we should consider an option to split off (moving to another >>> range server) lower/higher half of a range, depending on the loading >>> pattern of the data. The range server can dynamically detect if the >>> row key is in ascending order and split off the higher half range, or >>> vice versa, to balance the data better (it's better than rebalancing >>> the data later, as it involves extra copies.) >>> >>> __Luke >>> >>> On Sep 8, 9:14 am, "Doug Judd" <[EMAIL PROTECTED]> wrote: >>> > Hi Josh, >>> > >>> > The problem here is that this particular workload (loading data in >>> ascending >>> > order of primary key) is worst-case from Hypertable's perspective. It >>> works >>> > optimally with random updates that are uniform across the primary key >>> space. >>> > >>> > The way the system works is that a single range server ends up handling >>> all >>> > of the load. When a range fills up and splits, the lower half will get >>> > re-assigned to another range server. However, since there will be no >>> more >>> > updates to that lower half, there will be not activity on that range. >>> When >>> > a range splits, it first does a major compaction. After the split, >>> both >>> > ranges (lower half and upper half), will share the same CellStore file >>> in >>> > the DFS. This is why you see 4313 range directories that are empty >>> (their >>> > key/value pairs are inside a CellStore file that is shared with range >>> > AB2A0D28DE6B77FFDD6C72AF and is inside this range's directory). So, >>> the >>> > ranges are getting round-robin assigned to all of the RangeServers, >>> it's >>> > just that the RangeServer that holds range AB2A0D28DE6B77FFDD6C72AF is >>> doing >>> > all of the work. >>> > >>> > There is probably a bug that is preventing the Commit log from getting >>> > garbage collected in this scenario. I have a couple of high priority >>> things >>> > on my stack right now, so I probably won't get to it until later this >>> week >>> > or early next week. If you have any time to investigate, the place to >>> look >>> > would be RangeServer::log_cleanup(). This method gets called once per >>> > minute to do log fragment garbage collection. >>> > >>> > Also, this workload seems like it is more common than we initially >>> > expected. In fact, it is the same workload that we here at Zvents see >>> in >>> > our production log processing deployment. We should definitely spend >>> some >>> > time optimizing Hypertable for this type of workload. >>> > >>> > - Doug >>> > >>> > On Sun, Sep 7, 2008 at 1:11 PM, Joshua Taylor <[EMAIL PROTECTED] >>> >wrote: >>> > >>> > > Hi Donald, >>> > >>> > > Thanks for the insights! That's interesting that the server has so >>> many >>> > > ranges loaded on it. Does Hypertable not yet redistribute ranges for >>> > > balancing? >>> > >>> > > Looking in /hypertable/tables/X/default/, I see 4313 directories, >>> which I >>> > > guess correspond to the ranges. If what you're saying is true, then >>> that >>> > > one server has all the ranges. When I was looking at the METADATA >>> table >>> > > earlier, I seem to remember that ranges seemed to be spread around as >>> far as >>> > > the METADATA table was concerned. I can't verify that now because >>> half of >>> > > the RangeServers in the cluster went down after I tried the 15-way >>> load last >>> > > night. Maybe these log directories indicate that each range was >>> created on >>> > > this one server, but isn't necessarily still hosted there. >>> > >>> > > Looking in table range directories, I see that most of them are >>> empty. Of >>> > > the 4313 table range directories only 12 have content, with the >>> following >>> > > size distribution: >>> > >>> > > Name Size in bytes >>> > > 71F33965BA815E48705DB484 772005 >>> > > D611DD0EE66B8CF9FB4AA997 40917711 >>> > > 38D1E3EA8AD2F6D4BA9A4DF8 74199178 >>> > > AB2A0D28DE6B77FFDD6C72AF 659455660576 >>> > > 4F07C111DD9998285C68F405 900 >>> > > F449F89DDE481715AE83F46C 29046097 >>> > > 1A0950A7883F9AC068C6B5FD 54621737 >>> > > 9213BEAADBFF69E633617D98 900 >>> > > 6224D36D9A7D3C5B4AE941B2 131677668 >>> > > 6C33339858EDF470B771637C 132973214 >>> > > 64365528C0D82ED25FC7FFB0 170159530 >>> > > C874EFC44725DB064046A0FF 900 >>> > >>> > > It's really skewed, but maybe this isn't a big deal. I'm going to >>> guess >>> > > that the 650 GB slice corresponds to the end range of the table. >>> Most of >>> > > the data gets created here. When a split happens, the new range >>> holds a >>> > > reference to the files in the original range and never has the need >>> to do a >>> > > compaction into its own data space. >>> > >>> > > As for the log recovery process... when I wrote the last message, >>> the >>> > > recovery was still happening and had been running for 115 minutes. I >>> let it >>> > > continue to run to see if it would actually finish, and it did. >>> Looking at >>> > > the log, it appears that it actually took around 180 minutes to >>> complete and >>> > > get back to the outstanding scanner request, which had long since >>> timed >>> > > out. After the recovery, the server is back up to 2.8 GB of memory. >>> The >>> > > log directory still contains the 4300+ split directories, and the >>> user >>> > > commit log directory still contains 350+ GB of data. >>> > >>> > > You suggest that the log data is supposed to be cleaned up. I'm >>> using a >>> > > post-0.9.0.10 build (v0.9.0.10-14-g50e5f71 to be exact). It contains >>> what I >>> > > think is the patch you're referencing: >>> > > commit 38bbfd60d1a52aff3230dea80aa4f3c0c07daae4 >>> > > Author: Donald <[EMAIL PROTECTED]> >>> > > Fixed a bug in RangeServer::schedule_log_cleanup_compactions that >>> > > prevents log cleanup com... >>> > >>> > > I'm hoping the maintenance task threads weren't too busy for this >>> workload, >>> > > as it was pretty light. This is a 15 server cluster with a single >>> active >>> > > client writing to the table and nobody reading from the table. Like >>> I said >>> > > earlier, I tried a 15-way write after the recovery completed and half >>> the >>> > > RangeServers died. It looks like they all lost their Hyperspace >>> lease, and >>> > > the Hyperspace.master machine was 80% in the iowait state with an >>> load >>> > > average of 20 for a while. The server hosts a HDFS data node, a >>> > > RangeServer, and Hyperspace.master. Maybe Hyperspace.master needs a >>> > > dedicated server? I should probably take that issue to another >>> thread. >>> > >>> > > I'll look into it further, probably tomorrow. >>> > >>> > > Josh >>> > >>> > > On Sat, Sep 6, 2008 at 9:29 PM, Liu Kejia(Donald) < >>> [EMAIL PROTECTED]>wrote: >>> > >>> > >> Hi Josh, >>> > >>> > >> The 4311 directories are for split logs, they are used while a range >>> > >> is splitting into two. This indicates at least you have 4K+ ranges >>> on >>> > >> that server, which is pretty big (I usually have several hundreds >>> per >>> > >> server). The 3670 files are commit log files, I think it's actually >>> > >> quite good performance to take 115 minutes to replay a total of 3.5G >>> > >> logs, you get 50MB/s throughput anyway. The problem is many of these >>> > >> commit log files should be removed over time, after compactions of >>> the >>> > >> ranges take place. Ideally you'll only have 1 or 2 of these files >>> left >>> > >> after all the maintenance tasks are done. If so, the replay process >>> > >> only costs several seconds. >>> > >>> > >> One reason why the commit log files are not getting reclaimed is due >>> > >> to a bug in the range server code, I've pushed out a fix for it and >>> it >>> > >> should be included in the latest 0.9.0.10 release. Another reason >>> > >> could be that your maintenance task threads are too busy to get the >>> > >> work done in time, you may try to increase the number of maintenance >>> > >> tasks by setting Hypertable.RangeServer.MaintenanceThreads in your >>> > >> hypertable.cfg file. >>> > >>> > >> About load balance, I think your guess is right. About HDFS, it >>> seems >>> > >> HDFS always tries to put one copy of the file block on the local >>> > >> datanode. This has good performance, but certainly bad load balance >>> if >>> > >> you keep writing from one server. >>> > >>> > >> Donald >>> > >>> > >> On Sun, Sep 7, 2008 at 10:20 AM, Joshua Taylor < >>> [EMAIL PROTECTED]> >>> > >> wrote: >>> > >> > I had a RangeServer process that was taking up around 5.8 GB of >>> memory >>> > >> so I >>> > >> > shot it down and restarted it. The RangeServer has spent the last >>> 80 >>> > >> > CPU-minutes (>115 minutes on the clock) in local_recover(). Is >>> this >>> > >> normal? >>> > >>> > >> > Looking around HDFS, I see around 3670 files in server's >>> /.../log/user/ >>> > >> > directory, most of which are around 100 MB in size (total >>> directory >>> > >> size: >>> > >> > 351,031,700,665 bytes). I also see 4311 directories in the parent >>> > >> > directory, of which 4309 are named with a 24 character hex string. >>> Spot >>> > >> > inspection of these shows that most (all?) of these contain a >>> single 0 >>> > >> byte >>> > >> > file named "0". >>> > >>> > >> > The RangeServer log file since the restart currently contains over >>> > >> 835,000 >>> > >> > lines. The bulk seems to be lines like: >>> > >>> > >> > 1220752472 INFO Hypertable.RangeServer : >>> > >>> > >> >>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) >>> > >> > replay_update - length=30 >>> > >> > 1220752472 INFO Hypertable.RangeServer : >>> > >>> > >> >>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) >>> > >> > replay_update - length=30 >>> > >> > 1220752472 INFO Hypertable.RangeServer : >>> > >>> > >> >>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) >>> > >> > replay_update - length=30 >>> > >> > 1220752472 INFO Hypertable.RangeServer : >>> > >>> > >> >>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) >>> > >> > replay_update - length=30 >>> > >> > 1220752472 INFO Hypertable.RangeServer : >>> > >>> > >> >>> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) >>> > >> > replay_update - length=30 >>> > >>> > >> > The memory usage may be the same issue that Donald was reporting >>> earlier >>> > >> in >>> > >> > his discussion of fragmentation. The new RangeServer process has >>> grown >>> > >> up >>> > >> > to 1.5 GB of memory again, but the max cache size is 200 MB >>> (default). >>> > >>> > >> > I'd been loading into a 15-node Hypertable cluster all week using >>> a >>> > >> single >>> > >> > loader process. I'd loaded about 5 billion cells, or around 1.5 >>> TB of >>> > >> data >>> > >> > before I decided to kill the loader because it was taking too long >>> (and >>> > >> that >>> > >> > one server was getting huge). The total data set size is around >>> 3.5 TB >>> > >> and >>> > >> > it took under a week to generate the original set (using 15-way >>> > >> parallelism, >>> > >> > not just a single loader), so I decided to trying to load the rest >>> in a >>> > >> > distributed manner. >>> > >>> > >> > The loading was happening in ascending row order. It seems like >>> all of >>> > >> the >>> > >> > loading was happening on the same server. I'm guessing that when >>> splits >>> > >> > happened, the low range got moved off, and the same server >>> continued to >>> > >> load >>> > >> > the end range. That might explain why one server was getting all >>> the >>> > >> > traffic. >>> > >>> > >> > Looking at HDFS disk usage, the loaded server has 954 GB of disk >>> used >>> > >> for >>> > >> > Hadoop and the other 14 all have around 140 GB of disk usage. >>> This >>> > >> behavior >>> > >> > also has me wondering what happens when that one machine fills up >>> > >> (another >>> > >> > couple hundred GB). Does the whole system crash, or does HDFS get >>> > >> smarter >>> > >> > about balancing? >>> > >>> > ... >>> > >>> > read more ยป >>> >>> >> >> >> > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
