FYI - changing the subject line puts the email in a different thread. Probably best to avoid that.
-- Christopher L Tubbs II http://gravatar.com/ctubbsii On Mon, Jan 28, 2013 at 11:24 AM, David Medinets <[email protected]>wrote: > Accumulo fully recovered when I restarted the loggers. Very impressive. > > On Mon, Jan 28, 2013 at 9:32 AM, John Vines <[email protected]> wrote: > > And make sure the loggers didn't fill up their disk. > > > > Sent from my phone, please pardon the typos and brevity. > > On Jan 28, 2013 8:54 AM, "Eric Newton" <[email protected]> wrote: > > > >> What version of accumulo was this? > >> > >> So, you have evidence (such as a message in a log) that the tablet > server > >> ran out of memory? Can you post that information? > >> > >> The ingested data should have been captured in the write-ahead log, and > >> recovered when the server was restarted. There should never be any data > >> loss. > >> > >> You should be able to ingest like this without a problem. It is a basic > >> test. "Hold time" is the mechanism by which ingest is pushed back so > that > >> the tserver can get the data written to disk. You should not have to > >> manually back off. Also, the tserver dynamically changes the point at > >> which it flushes data from memory, so you should see less and less hold > >> time. > >> > >> The garbage collector cannot run if the METADATA table is not online, or > >> has an inconsistent state. > >> > >> You are probably seeing a lower number of tablets because not all the > >> tablets are online. They are probably offline due to failed recoveries. > >> > >> If you are running Accumulo 1.4, make sure you have stopped and > restarted > >> all the loggers on the system. > >> > >> -Eric > >> > >> On Mon, Jan 28, 2013 at 8:28 AM, David Medinets < > [email protected] > >> >wrote: > >> > >> > I had a plain Java program, single-threaded, that read an HDFS > >> > Sequence File with fairly small Sqoop records (probably under 200 > >> > bytes each). As each record was read a Mutation was created, then > >> > written via Batch Writer to Accumulo. This program was as simple as it > >> > gets. Read a record, Write a mutation. The Row Id used YYYYMMDD (a > >> > date) so the ingest targeted one tablet. The ingest rate was over 150 > >> > million entries for about 19 hours. Everything seemed fine. Over 3.5 > >> > Billion entries were written. Then the nodes ran out of memory and > >> > Accumulo nodes went dead. 90% of the server was lost. And data poofed > >> > out of existence. Only 800M entries are visible now. > >> > > >> > We restarted the data node processes and the cluster has been running > >> > garbage collection for over 2 days. > >> > > >> > I did not expect this simple approach to cause an issue. From looking > >> > at the logs file, I think that at least two compactions were being run > >> > while still ingested those 176 million entries per hour. The hold > >> > times started rising and eventually the system simply ran out of > >> > memory. I have no certainty about this explanation though. > >> > > >> > My current thinking is to re-initialize Accumulo and find some way to > >> > programatically monitoring the hold time. The add a delay to the > >> > ingest process whenever the hold time rises over 30 seconds. Does that > >> > sound feasible? > >> > > >> > I know there are other approaches to ingest and I might give up this > >> > method and use another. I was trying to get some kind of baseline for > >> > analysis reasons with this approach. > >> > > >> >
