Well, to my mind the problem is that the client won't know to divy up
the requests to more than 1 server with the basic dist. But I'm not
100% certain that's accurate.
If you have the client sending the work to all of the servers, then the
basic dist will translate logical to physical offsets exactly like you
want. It's easy enough to check, should just require a single line
change in the sys-create.sm client state machine.
Cheers,
Brad
John Bent wrote:
On Wed, 28 Jun 2006, Bradley W Settlemyer wrote:
Hmm, its an interesting idea. You could try modifying the basic dist to
return all the servers rather than just one -- but I'm not sure if that
works exactly the way you want still. You would probably need to modify
contiguous length also. Its not obviously workable, but doesn't sound
impossible either.
Thanks. Not too thrilled about the "not obviously workable" part though.
:)
Definitely trying to find an easy way to do this. I'm trying to get PVFS2
to just transparently use the underlying file system. I got the filenames
correct. I changed the trove layer in the server to get the write offsets
correct but now the reading is all messed up.
Maybe the distribution idea would be good . . . but if you have any ideas
for any easier way to make this work, I'd love to hear them.
Thanks,
John
Cheers,
Brad
John Bent wrote:
Thanks Sam,
Sadly, however, your suggestion did not just work directly out of the box.
The servers are not at the correct logical offsets and are overwriting
each other. Additionally, your approach suffers the same problem that
mine did which is that each datafile (which is actually now the same
shared datafile) is read completely by the client from each server.
Therefore when reading a file, the client actually creates a new file
that is comprised of N copies of the file where N is the number of
datafiles.
Perhaps the cleanest solution would be to create a new distribution and
pass this to the client. This distribution would simply instruct the
client to use the same physical offsets as logical. This should then work
equally well for reads and writes. I see in the io/description directory
that there are already distributions for simple-stripe, basic, and
varstrip. Is this a workable, good approach?
John
On Tue, 27 Jun 2006, Sam Lang wrote:
Hi John,
I think the best way (others can correct me) to modify the pvfs2 code
to get the trove layer to operate on the logical offsets and sizes is
in the flow code (flowproto_multiqueue.c). My reasoning is that the
flow layer converts the PVFS_Request structure into the physical
offsets and sizes and passes them on to the trove layer. The trove
layer doesn't care whether the offsets and sizes passed to it (via
trove_bstream_{read|write}_list) are logical or physical offsets, it
just uses those values to operate directly on the bstream file in the
normal case. In your case the offsets and sizes passed to trove
could be logical, and trove wouldn't know the difference. In other
words, you shouldn't have to modify any of the distribution code, or
manipulate the offsets and sizes in the trove code, just use what the
flow layer gives you.
The changes to the flow layer require that PINT_process_request
return logical offsets instead of physical offsets (for both reads
and writes). It will do this if you pass PINT_CLIENT instead of
PINT_SERVER as the mode (5th argument). You will need to do this in
each instance of PINT_process_request where PINT_SERVER is used.
PINT_process_request is a bit hard to use in some cases, but these
changes are simple enough, and the offsets and sizes are treated
opaquely everywhere outside of the function (except in AIO of
course), which turns out to be a nice design of the framework in my
view.
One caveat: You will probably want to either turn off small IO,
which doesn't use the flow layer, or make the same modifications to
PINT_process_request in the small-io.sm. You can just turn it off by
compiling with CFLAGS=-DPVFS2_SMALL_IO_OFF.
-sam
On June 27, 2006, at 11:28AM, John Bent wrote:
Ok, I've removed the footnote. Now I'm doing everything within the
new
trove layer and no longer doing it in the PINT_distribute although
I did
change some things slightly. The problem was that PINT_ADD_SEGMENT
was
combining the segments assuming they were in their own individual
stripe.
However, since they now must be interspersed with segments from other
servers, they can no longer be combined. (Obviously, this will
adversely
affect performance of the old trove layer so it calls for some layer
violation to only turn off merging depending on the trove layer
selected.
I guess later if I care about this, I call add a trove function to the
trove function table to this effect.)
I'm still however unable to read the files back correctly using the
pvfs2
servers.
John
On Tue, 27 Jun 2006, John Bent wrote:
Hello,
I'm working on a pet research project in which I'm (somewhat
abashedly)
actually _removing_ functionality from PVFS2. What I'm trying to
do is
create a new trove interface in which requests to disk are no longer
logically striped across multiple PVFS2 servers each with its own
physical storage but are rather passed transparently from client
through
PVFS2 onto a second and underlying shared file system on which
each PVFS2
server is mounted.
In order to do this, I have extended IO requests to pass the logical
filenames along with the handles and I have further modified
PINT_distribute(footnote 1) to use the file distribution info to
translate
its physical offset into the actual logical file offset and then
pass this
logical offset to the PINT_ADD_SEGMENT macro.
This works in that files written to pvfs2 servers are transparently
created in the pvfs2 storage space. These files can then be
correctly
read directly from the other underlying shared file system.
However, they can no longer be read correctly through the PVFS2
servers.
Perhaps when I write to the actual logical offsets instead of to the
striped offsets, I am fooling the pvfs2 servers into thinking those
logical offsets are actually the striped ones? When I try to read
the
file back, I get a file that is N times the correct size where N
is the
number of data servers. What happens is that each server gives me
each
segment of the file thinking that segment is unique to it. (at
least this
is what I think is happening)
Does anyone have any suggestions where else I should look in the
code to
modify this?
Thanks,
John
footnote 1: This is not very clean to do this in the PINT_distribute
function. I did try to keep my changes isolated with the new
trove layer
by passing the distribution info to the trove_bstream_[read|write]
_list
functions but this had the same problem when I did the readback
through
the PVFS2 servers as well as having the additional problem that the
readback directly through the other shared file system was _almost_
correct but somehow off by a little bit (seemingly at the end of the
file).
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers