New install: mogstored dying, and mogtool problems

Greg Connor Wed, 30 Apr 2008 08:50:09 -0700

Hello... I'm new to mogilefs and the list, so please feel free toredirect me to other resources that may be out there which I haven'tfound yet.

I just now finished setting up MogileFS for the first time. I have runinto some problems that hopefully others have seen before and can helpme with.

First, I had to make some changes to the code (even after getting thecurrent subversion trunk) which blocked me from running make or startingthe daemons for the first time. They seemed like they had been there awhile but also seemed pretty easy to fix. I'll post diffs if folks areinterested but I suspect almost everyone has fixed these on their own orthe cluster wouldn't be operational. I'm only mentioning it here incase this indicates *I've* done something wrong. These were:

MogileFS/Worker.pm
  #warn "proc ${self}[$$] read: [$out]\n";
  # fails because @self doesn't exist (should be $self-> ?)
  warn "proc \${self}[\$\$] read: [$out]\n";
Gearman/Client/Async/Connection.pm
  #socket my $sock, PF_INET, SOCK_STREAM, IPPROTO_TCP;
  #missing parens around "my"
  # changed to
  my $sock ; socket $sock, PF_INET, SOCK_STREAM, IPPROTO_TCP;
...and there may be one other change I'm now forgetting.

I'm now using mogtool to store contents of an entire directory, andencountering some problems. (Using a 90G directory to start buteventually this will be 1.5T directories).

First problem was that when I first used mogtool to start injectingfiles, I got a lot of errors (now scrolled off my screen, but it wassomething like "could not put file, unknown_fileid") and I observed thatmogstored had stopped running on all 16 nodes. After mogtool waskilled, fsck reported that there were a lot of files, but listing thedomain showed 0 files. I could not figure out how to hunt down anddelete the chunks that mogtool had already uploaded, since the target(only) domain seemed to be empty. Since I could not figure out how todelete the files cleanly, I opted to drop database mogilefs, nuke dev*/0and restart everything from scratch.

On the second attempt, mogstored now stays up, and the upload completedquickly, but after a 52 minute upload mogtool then proceeded to checksumeverything and got stuck on checksumming the same 6 blocks over andover, for 18 more hours before I stopped it. It was saying somethinglike "retrying checksum for chunks: blah... md5sum mismatch" over andover. I'm not sure what the correct behavior here should be, but ifboth copies of a chunk have failed checksum, and the original file(stream) is no longer available, at some point it should probablydeclare failure and stop fetching the bad chunks repeatedly.

My first priority here would be to figure out why mogstored died andkeep it from dying. Has this happened before/frequently? Is it commonpractice to put a wrapper or sentinel on mogstored to start it when itfails? Is there a log file where mogstored shows any warn or diemessages? (I used an /etc/init.d/mogstored start script found in thearchives of this list, so perhaps I just need to replace the >/dev/nullin that script with an actual file)

My second priority would be to figure out how to recover from a failedmogtool injection. I'm pretty sure files exist in the tracker,definitely they exist on nodes, but if mogtool list domain doesn't showthem, how can I find and delete them? (I probably will try theMogileFS::Client direct interface next). If I ask mogtool to store thesame ID again, also using --bigfile, will it overwrite the chunks itstored the first time or will I need to invent something to findorphaned bigfile chunks and remove them after a certain time?

Thinking ahead to the fix, what's the correct/desired behavior in caseswhere a bigfile fails to inject... would it be fairly easy to makemogtool aware of the incomplete bigfile and its chunks (possibly under adifferent master fileid?) so future invocations of mogtool can deletethem as expected? In the case where we have put the chunks and fetchingthem back gives us a bad checksum, what's the proper behavior there?Would it be feasible to make the spawned child process wait a short timeand then fetch its own chunk back, so that it has a chance to put thedata up again if there is a mismatch? I'm willing to spend some extramemory to have threads wait around and checksum before freeing the memory.

At this point I'm not sure if I'm doing something wrong or if myexperience is expected/typical, so any feedback (even if it's not ananswer/suggestion) would be helpful. Have people used mogtool as partof a production system for storing huge files? Is it more common forpeople to implement their own chunking/splitting?


Thanks for any feedback.
gregc

New install: mogstored dying, and mogtool problems

Reply via email to