I wrote a week or two ago and asked for help with my "mogstored dying" problem. Thanks to those who responded at that time. Since then, I have upgraded all my nodes (16 storage nodes with 2 also acting as trackers) to CentOS5.1 which runs perl 5.8.8. (The client machine has perl 5.8.5). I'm using the current subversion tree (1177) for trackers, storage nodes and clients/utils.

Unfortunately I'm still having a problem with mogstored just dying, and I can't figure out why. Any help or pointers would be appreciated.

I'm currently using mogtool to push a large amount of data: 5 bigfiles with a total size of 2454G. I'm expecting that to be broken up into 39269 chunks of 64M each, and right now I've got about 19000 chunks copied.

My biggest problem right now is that mogstored just plain dies. It just stops with no message to either syslog or to its output. Of my 16 nodes, they have all stopped running mogstored between 4 and 10 times. In order to keep the copy going, I have to check for mogstored running every minute and restart it if not running. The only thing appearing in syslog is after it starts up again, it says perlbal[pid]: beginning run.

The start script I have been using says --daemonize so I ran mogstored without --daemonize flag and got a bit more output:
        Running.
        Out of memory!
        Out of memory!
        Callback called exit.
        Callback called exit.
        END failed--call queue aborted.
        beginning run
        Running.



There's a bit more information in mogtool's output but I don't know if these coincide with the mogstored crashes. Here are a few:

WARNING: Unable to save file 'collect-20080516-vol6,280': Close failed at /usr/bin/mogtool line 816, <Sock_minime336:7001> line 283.
MogileFS backend error message: unknown_key unknown_key
System error message: Close failed at /usr/bin/mogtool line 816, <Sock_minime336:7001> line 283.

WARNING: Unable to save file 'collect-20080516-vol6,311': MogileFS::NewHTTPFile: error reading from node for device 337007: Connection reset by peer at (eval 18) line 1
MogileFS backend error message: unknown_key unknown_key
System error message: MogileFS::NewHTTPFile: error reading from node for device 337007: Connection reset by peer at (eval 18) line 1

WARNING: Unable to save file 'collect-20080516-vol6,1341': MogileFS::NewHTTPFile: error writing to node for device 343012: Connection reset by peer at /usr/lib64/perl5/5.8.5/x86_64-linux-thread- multi/IO/Handle.pm line 399
MogileFS backend error message: unknown_key unknown_key
System error message: MogileFS::NewHTTPFile: error writing to node for device 343012: Connection reset by peer at /usr/lib64/perl5/5.8.5/ x86_64-linux-thread-multi/IO/Handle.pm line 399

WARNING: Unable to save file 'collect-20080516-vol6,1736': Close failed at /usr/bin/mogtool line 816, <Sock_minime336:7001> line 1739.
MogileFS backend error message: unknown_key unknown_key
System error message: Close failed at /usr/bin/mogtool line 816, <Sock_minime336:7001> line 1739.

WARNING: Unable to save file 'collect-20080516-vol6,2373': MogileFS::NewHTTPFile: unable to write to any allocated storage node at /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/IO/Handle.pm line 399
MogileFS backend error message: unknown_key unknown_key
System error message: MogileFS::NewHTTPFile: unable to write to any allocated storage node at /usr/lib64/perl5/5.8.5/x86_64-linux-thread- multi/IO/Handle.pm line 399

A few times I observed mogstored not responding to the tracker (mogadm check just pauses when listing that host) and in that case, killing and restarting mogstored brings it back. I could probably check for this condition too, but now we're getting beyond a "simple" wrapper/ restart/sentinel script.



Is the experience of mogstored just plain dying a common one, or is it pretty rare? If that were the only thing wrong I could get around it by wrapping mogstored with a shell script that relaunches it as soon as it quits, but I'd rather not have to do that... I'd rather get at the root of the problem and make it not die in the first place.


A more important question I have is: Am I trying to do something with MogileFS that it's totally not designed for? Is anyone else out there known to be using mogile for really huge files, chunked like mogtool does, and if so, were people happy with the results? If it's really minor problems, I could probably fix them myself, but I'm concerned that the lack of documentation about mogile's internals would hamper self-support efforts.


Thanks
gregc

Reply via email to