Hi Josh, If you're just trying to get the system up and running and don't mind if you potentially lose some data, you could try this. Do a directory listing in the /hypertable/tables/X/default/AB2A0D28DE6B77FFDD6C72AF directory and find the newest CellStore file csNNNN and remember the creation time t. Then, in the log/user/ directory of the server that is handling all of the load, delete all of the log fragments that have a creation time that is less than t. I think that should actually work without data loss.
- Doug On Wed, Sep 10, 2008 at 4:37 PM, Joshua Taylor <[EMAIL PROTECTED]>wrote: > I'm still trying to get my Hypertable cluster running again. After seeing > half the RangeServers die because they lost their Hyperspace session when > loading with 5 concurrent clients, I decided to take Donald's advice and > give the Master processes (Hypertable+Hyperspace) a dedicated node. Then I > tried restarting the failed RangeServers. This time the one with the 350+ > GB of commit logs spent 6 hours trying to recover before I noticed it had > grown to 15 GB of memory (7 GB RSS). I shot it down since it was just > thrashing at that point. > > So now I seem to have two problems: > > 1) Log cleanup doesn't seem to be working so I have to replay 350+ GB when > I restart. > > 2) When replaying the logs, I run out of memory. > > I've been trying to figure out #2, since I can no longer keep the servers > running long enough to address #1. It looks like all compactions are > deferred until the recovery is done. Commits get loaded into memory until > the machine runs out, then boom. I don't have the best understanding of the > recovery strategy, but I'd guess that fixing this problem would require some > major surgery. > > One argument is that #2 isn't worth fixing. If #1 were working properly, > the system wouldn't get itself into such a bad state. The recovery can just > assume there's enough memory most of the time. > > Most of the time is not all of the time, though. I can imagine some normal > use cases where this problem would pop up: > > A) One server is falling behind on compactions due to hardware issues or > resource contention and it eventually crashes for lack of memory. When > another server comes up to recover, it has to recover the same memory load > that just caused the last process to crash. > > B) Cluster management software decides to take a RangeServer machine out of > service. Say it's a machine with 8 GB of RAM and Hypertable has buffered up > 5 GB in memory. It doesn't get a chance to compact before being taken > down. The machine chosen as a replacement server only has 4 GB of of > available RAM. It will somehow have to recover the 5 GB memory state of the > old server. > > Maybe these are post-1.0 concerns. I'm wondering what I can do now. The > "solution" I'm looking at is to wipe out my entire Hypertable installation > and try to isolate #1 from a clean slate. Any suggestions for a less > drastic fix? > > Josh > > > > On Mon, Sep 8, 2008 at 11:38 AM, Luke <[EMAIL PROTECTED]> wrote: > >> >> Maybe we should consider an option to split off (moving to another >> range server) lower/higher half of a range, depending on the loading >> pattern of the data. The range server can dynamically detect if the >> row key is in ascending order and split off the higher half range, or >> vice versa, to balance the data better (it's better than rebalancing >> the data later, as it involves extra copies.) >> >> __Luke >> >> On Sep 8, 9:14 am, "Doug Judd" <[EMAIL PROTECTED]> wrote: >> > Hi Josh, >> > >> > The problem here is that this particular workload (loading data in >> ascending >> > order of primary key) is worst-case from Hypertable's perspective. It >> works >> > optimally with random updates that are uniform across the primary key >> space. >> > >> > The way the system works is that a single range server ends up handling >> all >> > of the load. When a range fills up and splits, the lower half will get >> > re-assigned to another range server. However, since there will be no >> more >> > updates to that lower half, there will be not activity on that range. >> When >> > a range splits, it first does a major compaction. After the split, both >> > ranges (lower half and upper half), will share the same CellStore file >> in >> > the DFS. This is why you see 4313 range directories that are empty >> (their >> > key/value pairs are inside a CellStore file that is shared with range >> > AB2A0D28DE6B77FFDD6C72AF and is inside this range's directory). So, the >> > ranges are getting round-robin assigned to all of the RangeServers, it's >> > just that the RangeServer that holds range AB2A0D28DE6B77FFDD6C72AF is >> doing >> > all of the work. >> > >> > There is probably a bug that is preventing the Commit log from getting >> > garbage collected in this scenario. I have a couple of high priority >> things >> > on my stack right now, so I probably won't get to it until later this >> week >> > or early next week. If you have any time to investigate, the place to >> look >> > would be RangeServer::log_cleanup(). This method gets called once per >> > minute to do log fragment garbage collection. >> > >> > Also, this workload seems like it is more common than we initially >> > expected. In fact, it is the same workload that we here at Zvents see >> in >> > our production log processing deployment. We should definitely spend >> some >> > time optimizing Hypertable for this type of workload. >> > >> > - Doug >> > >> > On Sun, Sep 7, 2008 at 1:11 PM, Joshua Taylor <[EMAIL PROTECTED] >> >wrote: >> > >> > > Hi Donald, >> > >> > > Thanks for the insights! That's interesting that the server has so >> many >> > > ranges loaded on it. Does Hypertable not yet redistribute ranges for >> > > balancing? >> > >> > > Looking in /hypertable/tables/X/default/, I see 4313 directories, >> which I >> > > guess correspond to the ranges. If what you're saying is true, then >> that >> > > one server has all the ranges. When I was looking at the METADATA >> table >> > > earlier, I seem to remember that ranges seemed to be spread around as >> far as >> > > the METADATA table was concerned. I can't verify that now because >> half of >> > > the RangeServers in the cluster went down after I tried the 15-way >> load last >> > > night. Maybe these log directories indicate that each range was >> created on >> > > this one server, but isn't necessarily still hosted there. >> > >> > > Looking in table range directories, I see that most of them are empty. >> Of >> > > the 4313 table range directories only 12 have content, with the >> following >> > > size distribution: >> > >> > > Name Size in bytes >> > > 71F33965BA815E48705DB484 772005 >> > > D611DD0EE66B8CF9FB4AA997 40917711 >> > > 38D1E3EA8AD2F6D4BA9A4DF8 74199178 >> > > AB2A0D28DE6B77FFDD6C72AF 659455660576 >> > > 4F07C111DD9998285C68F405 900 >> > > F449F89DDE481715AE83F46C 29046097 >> > > 1A0950A7883F9AC068C6B5FD 54621737 >> > > 9213BEAADBFF69E633617D98 900 >> > > 6224D36D9A7D3C5B4AE941B2 131677668 >> > > 6C33339858EDF470B771637C 132973214 >> > > 64365528C0D82ED25FC7FFB0 170159530 >> > > C874EFC44725DB064046A0FF 900 >> > >> > > It's really skewed, but maybe this isn't a big deal. I'm going to >> guess >> > > that the 650 GB slice corresponds to the end range of the table. Most >> of >> > > the data gets created here. When a split happens, the new range holds >> a >> > > reference to the files in the original range and never has the need to >> do a >> > > compaction into its own data space. >> > >> > > As for the log recovery process... when I wrote the last message, the >> > > recovery was still happening and had been running for 115 minutes. I >> let it >> > > continue to run to see if it would actually finish, and it did. >> Looking at >> > > the log, it appears that it actually took around 180 minutes to >> complete and >> > > get back to the outstanding scanner request, which had long since >> timed >> > > out. After the recovery, the server is back up to 2.8 GB of memory. >> The >> > > log directory still contains the 4300+ split directories, and the user >> > > commit log directory still contains 350+ GB of data. >> > >> > > You suggest that the log data is supposed to be cleaned up. I'm using >> a >> > > post-0.9.0.10 build (v0.9.0.10-14-g50e5f71 to be exact). It contains >> what I >> > > think is the patch you're referencing: >> > > commit 38bbfd60d1a52aff3230dea80aa4f3c0c07daae4 >> > > Author: Donald <[EMAIL PROTECTED]> >> > > Fixed a bug in RangeServer::schedule_log_cleanup_compactions that >> > > prevents log cleanup com... >> > >> > > I'm hoping the maintenance task threads weren't too busy for this >> workload, >> > > as it was pretty light. This is a 15 server cluster with a single >> active >> > > client writing to the table and nobody reading from the table. Like I >> said >> > > earlier, I tried a 15-way write after the recovery completed and half >> the >> > > RangeServers died. It looks like they all lost their Hyperspace >> lease, and >> > > the Hyperspace.master machine was 80% in the iowait state with an load >> > > average of 20 for a while. The server hosts a HDFS data node, a >> > > RangeServer, and Hyperspace.master. Maybe Hyperspace.master needs a >> > > dedicated server? I should probably take that issue to another >> thread. >> > >> > > I'll look into it further, probably tomorrow. >> > >> > > Josh >> > >> > > On Sat, Sep 6, 2008 at 9:29 PM, Liu Kejia(Donald) < >> [EMAIL PROTECTED]>wrote: >> > >> > >> Hi Josh, >> > >> > >> The 4311 directories are for split logs, they are used while a range >> > >> is splitting into two. This indicates at least you have 4K+ ranges on >> > >> that server, which is pretty big (I usually have several hundreds per >> > >> server). The 3670 files are commit log files, I think it's actually >> > >> quite good performance to take 115 minutes to replay a total of 3.5G >> > >> logs, you get 50MB/s throughput anyway. The problem is many of these >> > >> commit log files should be removed over time, after compactions of >> the >> > >> ranges take place. Ideally you'll only have 1 or 2 of these files >> left >> > >> after all the maintenance tasks are done. If so, the replay process >> > >> only costs several seconds. >> > >> > >> One reason why the commit log files are not getting reclaimed is due >> > >> to a bug in the range server code, I've pushed out a fix for it and >> it >> > >> should be included in the latest 0.9.0.10 release. Another reason >> > >> could be that your maintenance task threads are too busy to get the >> > >> work done in time, you may try to increase the number of maintenance >> > >> tasks by setting Hypertable.RangeServer.MaintenanceThreads in your >> > >> hypertable.cfg file. >> > >> > >> About load balance, I think your guess is right. About HDFS, it seems >> > >> HDFS always tries to put one copy of the file block on the local >> > >> datanode. This has good performance, but certainly bad load balance >> if >> > >> you keep writing from one server. >> > >> > >> Donald >> > >> > >> On Sun, Sep 7, 2008 at 10:20 AM, Joshua Taylor < >> [EMAIL PROTECTED]> >> > >> wrote: >> > >> > I had a RangeServer process that was taking up around 5.8 GB of >> memory >> > >> so I >> > >> > shot it down and restarted it. The RangeServer has spent the last >> 80 >> > >> > CPU-minutes (>115 minutes on the clock) in local_recover(). Is >> this >> > >> normal? >> > >> > >> > Looking around HDFS, I see around 3670 files in server's >> /.../log/user/ >> > >> > directory, most of which are around 100 MB in size (total directory >> > >> size: >> > >> > 351,031,700,665 bytes). I also see 4311 directories in the parent >> > >> > directory, of which 4309 are named with a 24 character hex string. >> Spot >> > >> > inspection of these shows that most (all?) of these contain a >> single 0 >> > >> byte >> > >> > file named "0". >> > >> > >> > The RangeServer log file since the restart currently contains over >> > >> 835,000 >> > >> > lines. The bulk seems to be lines like: >> > >> > >> > 1220752472 INFO Hypertable.RangeServer : >> > >> > >> >> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) >> > >> > replay_update - length=30 >> > >> > 1220752472 INFO Hypertable.RangeServer : >> > >> > >> >> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) >> > >> > replay_update - length=30 >> > >> > 1220752472 INFO Hypertable.RangeServer : >> > >> > >> >> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) >> > >> > replay_update - length=30 >> > >> > 1220752472 INFO Hypertable.RangeServer : >> > >> > >> >> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) >> > >> > replay_update - length=30 >> > >> > 1220752472 INFO Hypertable.RangeServer : >> > >> > >> >> (/home/josh/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:1553) >> > >> > replay_update - length=30 >> > >> > >> > The memory usage may be the same issue that Donald was reporting >> earlier >> > >> in >> > >> > his discussion of fragmentation. The new RangeServer process has >> grown >> > >> up >> > >> > to 1.5 GB of memory again, but the max cache size is 200 MB >> (default). >> > >> > >> > I'd been loading into a 15-node Hypertable cluster all week using a >> > >> single >> > >> > loader process. I'd loaded about 5 billion cells, or around 1.5 TB >> of >> > >> data >> > >> > before I decided to kill the loader because it was taking too long >> (and >> > >> that >> > >> > one server was getting huge). The total data set size is around >> 3.5 TB >> > >> and >> > >> > it took under a week to generate the original set (using 15-way >> > >> parallelism, >> > >> > not just a single loader), so I decided to trying to load the rest >> in a >> > >> > distributed manner. >> > >> > >> > The loading was happening in ascending row order. It seems like >> all of >> > >> the >> > >> > loading was happening on the same server. I'm guessing that when >> splits >> > >> > happened, the low range got moved off, and the same server >> continued to >> > >> load >> > >> > the end range. That might explain why one server was getting all >> the >> > >> > traffic. >> > >> > >> > Looking at HDFS disk usage, the loaded server has 954 GB of disk >> used >> > >> for >> > >> > Hadoop and the other 14 all have around 140 GB of disk usage. This >> > >> behavior >> > >> > also has me wondering what happens when that one machine fills up >> > >> (another >> > >> > couple hundred GB). Does the whole system crash, or does HDFS get >> > >> smarter >> > >> > about balancing? >> > >> > ... >> > >> > read more ยป >> >> > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Hypertable Development" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/hypertable-dev?hl=en -~----------~----~----~----~------~----~------~--~---
