I'm a little amazed you've been having so many problems :( Big files or not I've never really seen this happen. This doesn't happen rarely to me, big files or not. Doesn't happen at all.

Is your system/OS still weird? Are you missing huge OS errors with write abortions/etc?

Can you run the mogilefsd tracker with the debug level raised, in screen, in the foreground, with output redirected to a file? Then mail/uploaded it somewhere so we can see?

I'm somewhat highly doubting it's a code problem. The errors really do suck a bit, but I can't imagine why you'd be getting that without having hardware trouble or really gummed up OS/perl libraries.

-Dormando


OK that definitely helps.  Lighttpd is back on, and it doesn't look
like mogstored/lighttpd is dying.  Actually it looks like something
wrong with the trackers now.

The various system errors are similar to the ones I've seen before.
mogtool transfers about 70 chunks successfully and then starts giving
this error (over and over).

MogileFS backend error message: unknown_key unknown_key
System error message: Close failed at /usr/bin/mogtool line 816,
<Sock_minime336:7001> line 78.
This was try #1 and it's been 1.06 seconds since we first tried.
Retrying...

The failed chunks are tried over and over until I kill the job. Even after killing the mogtool job, I couldn't push any file, with mogtool or with a simple script similar to Mark's test.

At this point I don't know if mogtool is at fault, but I don't
actually have any other way of getting large files into the system
other than rolling my own mogtool, which is likely to be bad for all
its own reasons and confuse matters further.

I am also seeing a large number of these errors:

System error message: MogileFS::Backend: tracker socket never became readable (minime336:7001) when sending command: [create_open domain=dbbackups&fid=0&class=dbbackups-recent&multi_dest=1&key=dwh-20080519-vol9,99 ] at /usr/lib/perl5/site_perl/5.8.5/MogileFS/Client.pm line 268

Thankfully these seem to be recoverable by mogtool retrying, but they seem to be disturbingly frequent. I will need to understand what underlying condition is causing this error... I know it's entirely possible that it's a network/machine/storage/whatever problem but I'm not sure where to look for more information or how to start troubleshooting. I guess this concern applies to all the errors I've seen including:
 > Close failed at /usr/bin/mogtool line 816
> unable to write to any allocated storage node at /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/IO/Handle.pm line 399
 > Connection reset by peer
 > tracker socket never became readable
> socket closed on read at /usr/lib/perl5/site_perl/5.8.5/MogileFS/NewHTTPFile.pm line 335 > couldn't connect to mogilefsd backend at /usr/lib/perl5/site_perl/5.8.5/MogileFS/Client.pm line 268

If they happen rarely, and the tool I'm using can recover, it's not a huge concern, but if they happen frequently enough to raise eyebrows (like 1 transaction in 10) or if they cause an endless loop where we can't recover, then it's a showstopper.

The general trend here seems to be that errors happen, and MogileFS just plain aborts the current transaction with an unhelpful "die" type of message. I am reasonably good with perl but I'm not familiar with this code base enough to dive into all the errors listed above and get down to the root cause. There could be any number of underlying causes, but because error handling is not great, I'm being forced to look at Perl code and guess at root cause instead of going right to the underlying cause.

Is there some log I should be looking at for more info, or some debugging flag I need to turn on?

Does anyone else have a winning strategy for dealing with very-large files other than mogtool --bigfile?

Thanks again for the help and patience.

Reply via email to