Thanks Bart. In your example, what are the names and ports of each of
the servers involved? Are they all on the same node (with different
ports) by any chance?
thanks,
-Phil
On 05/04/2010 09:50 AM, Bart Taylor wrote:
The log file is attached.
I upgraded, let the file system start responding to pvfs2-ping, gave
both servers a sighup to pick up the logging update, cleared the log
file, and gave it the dd command listed below. The second server never
logged anything.
$ dd if=/dev/zero of=/mnt/pvfs2/10M.zeros.3 bs=1M count=10
dd: writing `/mnt/pvfs2/10M.zeros.3': Connection timed out
1+0 records in
0+0 records out
Bart.
On Mon, May 3, 2010 at 11:22 AM, Phil Carns <[email protected]
<mailto:[email protected]>> wrote:
Can you get a server into this state (where everything works
except for > strip size files), turn on verbose logging, and then
try to create a big file?
I'd like to see the log file from the metadata server for the file
in question. That server is the one that has to come up with the
pre-created file handles at that point and must be having a
problem. Even if the pre-create requests had failed up until
then, it is supposed to eventually sort things out.
thanks,
-Phil
On 04/29/2010 04:51 PM, Bart Taylor wrote:
Yes, it does finish the Trove Migration and print similar
messages. The file system responds to requests; I just can't
create files larger than one strip size. Once I restart the file
system I can, but on first start, they fail.
Bart.
On Thu, Apr 29, 2010 at 1:50 PM, Kevin Harms <[email protected]
<mailto:[email protected]>> wrote:
Bart,
I think the server should print out when conversion starts
and ends.
examples:
Trove Migration Started: Ver=2.6.3
Trove Migration Complete: Ver=2.6.3
Trove Migration Set: 2.8.1
Does is get that far?
kevin
On Apr 29, 2010, at 1:55 PM, Bart Taylor wrote:
> Thanks for the information and suggestion Phil.
Unfortunately, I didn't get a different result after moving
that BMI init block. I also managed to reproduce this once
while leaving the trove method to alt-aio although that
doesn't seem directly related to the direction you were going.
>
> Another thing I noticed is that I can create files
successfully after the upgrade as long as the size is within
64k which is the value of my strip_size distribution param.
Once the size exceeds that value, I start running into this
problem again.
>
> Does that help shed any more light on my situation?
>
> Bart.
>
>
> On Fri, Apr 16, 2010 at 1:39 PM, Phil Carns
<[email protected] <mailto:[email protected]>> wrote:
> Sadly none of my test boxes will run 2.6 any more, but I
have a theory about what the problem might be here.
>
> For some background, the pvfs2-server daemon does these
steps in order (among others): initializes BMI (networking),
initializes Trove (storage), and then finally starts
processing requests.
>
> In your case, two extra things are going on:
>
> - the trove initialization may take a while, because it has
to do a conversion of the
> format for all objects from v. 2.6 to 2.8, especially if it
is also switching to o_direct format at the same time.
>
> - whichever server gets done first is going to immediately
contact the other servers in order to precreate handles for
new files (a new feature in 2.8)
>
> I'm guessing that one server finished the trove conversion
before the others and started its pre-create requests. The
other servers can't answer yet (because they are still busy
with trove), but since BMI is already running the incoming
precreate requests just get queued up on the socket. When
the slow server finally does try to service them, the
requests are way out of date and have since been retried by
the fast server.
>
> I'm not sure exactly what goes wrong from there, but if
that's the cause, the solution might be relatively simple.
If you look in pvfs2-server.c, you can take the block of
code from "BMI_initialize(...)" to "*server_status_flag |=
SERVER_BMI_INIT;" and try moving that whole block to _after_
the "*server_status_flag |= SERVER_TROVE_INIT;" line that
indicates that trove is done.
>
> -Phil
>
>
> On 03/30/2010 06:23 PM, Bart Taylor wrote:
>>
>> I am having some problems upgrading existing file systems
to 2.8. After I finish the upgrade and start the file system,
I cannot create files. Simple commands like dd and cp stall
until they timeout and leave partial dirents like this:
>>
>> [bat...@client t]$ dd if=/dev/zero
of=/mnt/pvfs28/10MB.dat.6 bs=1M count=10
>> dd: writing `/mnt/pvfs28/10MB.dat.6': Connection timed out
>> 1+0 records in
>> 0+0 records out
>> 0 bytes (0 B) copied, 180.839 seconds, 0.0 kB/s
>>
>>
>> [r...@client ~]# ls -alh /mnt/pvfs28/
>> total 31M
>> drwxrwxrwt 1 root root 4.0K Mar 30 11:24 .
>> drwxr-xr-x 4 root root 4.0K Mar 23 13:38 ..
>> -rw-rw-r-- 1 batayl batayl 10M Mar 30 08:44 10MB.dat.1
>> -rw-rw-r-- 1 batayl batayl 10M Mar 30 08:44 10MB.dat.2
>> -rw-rw-r-- 1 batayl batayl 10M Mar 30 08:44 10MB.dat.3
>> ?--------- ? ? ? ? ? 10MB.dat.5
>> drwxrwxrwx 1 root root 4.0K Mar 29 14:06 lost+found
>>
>>
>> This happens both on local disk and on network storage,
but it only happens if the upgraded file system starts up the
first time using directio. If it is started with alt-aio as
the TroveMethod, everything works as expected. It also only
happens the first time the file system is started; if I stop
the server daemons and restart them, everything operates as
expected. I do have to kill -9 the server deamons, since they
will not exit gracefully.
>>
>> My test is running on RHEL4 U8 i386 with kernel version
2.6.9-89.ELsmp with two server nodes and one client. I was
unable to recreate the problem with a single server.
>>
>> I attached verbose server logs from the time the daemon
was started after the upgrade until the client failed as well
as client logs from mount until the returned error. The
cliffs notes are that one of the servers logs as many unstuff
requests as we have client retries configured. The client
fails at the end of the allotted retries. The other server
doesn't log anythign after starting.
>>
>> Has anyone seen anything similar or know what might be
going on?
>>
>> Bart.
>>
>>
>>
>>
>> _______________________________________________
>> Pvfs2-developers mailing list
>>
>> [email protected]
<mailto:[email protected]>
>>
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>>
>>
>>
>
>
> _______________________________________________
> Pvfs2-developers mailing list
> [email protected]
<mailto:[email protected]>
>
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers