After more testing, I was able to reproduce the problem. As far as 
I can tell, it seems to be a minor bug (or you could call it a quirk) in
Lustre with writes using DIRECT_IO if the file system is full.

With DIRECT_IO, doing a write() after seek() to a location with in an
existing file gives this error. The identical write() without DIRECT_IO
works just fine.

The Lustre error (-28) indicates out of memory, but the file is already
much bigger than the location the code is trying to write to. So there
is enough disk space to write into the file. The same test does work on
top of ext3.

I expect the problem is that somehow Lustre needs to take some of the
disk space for a temporary structure, and cannot get it. The disk is
full. 

The work around is easy - leave some space on the disk. Or don't use
DIRECT_IO.

Andreas asked:
> Why are you exactly creating a loopback file on top of the shared
file?
> That can only hurt performance.

We have some applications that read and write directly to a block device
driver. They do this for performance.
If we replace the underlying file system with Lustre shared storage, we
either need to tell them to change the code to use a file, carve out a
non-shared volume for them, or we could provide a block device on a
file. 
I am investigating the performance hit for this approach.

Thanks for your help, 

-David


-----Original Message-----
From: Andreas Dilger [mailto:[EMAIL PROTECTED] 
Sent: Sunday, January 28, 2007 7:17 PM
To: David Ramsthaler (dramstha)
Cc: [email protected]
Subject: Re: [Lustre-discuss] Error PTL_RPC_MSG_ERR in
ptlrpc_check_status()

On Jan 26, 2007  23:41 -0800, David Ramsthaler (dramstha) wrote:
> I am trying to run a performance test on Lustre, running Beta 5
> software. I am getting the following error message:
> 
>       LustreError: 4504:0:(client.c:579:ptlrpc_check_status()) @@@
type
> == PTL_RPC_MSG_ERR, err == -28

-28 = -ENOSPC (per /usr/include/asm/errno.h)

> I have a simple 2-node setup running beta 5 software. One node is
> exporting a 1 Gig disk. I have created a single file which is as big
> as I could make it before running out of disk space.
> 
> On the second node, I have used losetup to create a loop0 device on
top
> of that same shared file. Then I run xdd device test program to read
> and write to sectors on that loop0 device.

Why are you exactly creating a loopback file on top of the shared file?
That can only hurt performance.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to