Re: [kvm-devel] [Qemu-devel] Re: Feedback and errors

2008-05-02 Thread Jamie Lokier
Daniel P. Berrange wrote:
   2/ two instances of kvm can be passed the same -hda. There is no locking 
   whatsoever. This messes up things seriously.
 
 That depends entirely on what you are doing with the disk in the guest OS.
 
 The disk could be hosting a cluster filesystem. The guest OS could be
 running on a read-only root FS. The disk could be application raw data
 storage which can be shared (eg Oracle RAC). 

That reminds me, a read-only option for disk images would be handy
occasionally.  Writes would return errors, rather than to an expanding
snapshot file.

And then, logically, any default locking for disk images (if you don't
disable it) would use shared locking for a read-only disk image.

  These two are upstream qemu problems. Copying qemu-devel.
  
  I guess using file locking by default would improve the situation, and 
  we can add a -drive ...,exclusive=no option for people playing with 
  cluster filesystems.
 
 Turning on file locking by default will break existing apps / deployments
 using shared disks. IMHO this is a policy decision that should be solved 
 at ahigher level in the management stack where a whole world view is 
 available rather than QEMU which only knows about its own VM  host.

Imho disk locking should be on by default and easy to turn off.

Casual small scale use of QEMU doesn't use a management stack, it
uses the command line directly invoking qemu or kvm, or a one-line
script.  In that scenario locking against running two instances _by
mistake_ is most useful (it's easy to accidentally forget you have one
running when hidden on the desktop), and least likely for someone to
use a wrapper.

The few cluster deployments using shared disks will notice very
quickly that an additional option is needed for the new QEMU version.
It won't be the first time a new version has required a change to the
command line options to keep an existing deployment working.

-- Jamie

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] [PATCH 1/4] [PATCH] introduce QEMUAccel and fill it with interrupt specific driver

2008-05-02 Thread Jamie Lokier
Glauber Costa wrote:
 This patch introduces QEMUAccel, a placeholder for function pointers
 that aims at helping qemu to abstract accelerators such as kqemu and
 kvm (actually, the 'accelerator' name was proposed by avi kivity, since
 he loves referring to kvm that way).

Just a little thought...

Maybe 'VCPU' would be a clearer name?  QEMU provides its own VCPU, and
KQEMU+QEMU also provide one toegether.  While KVM provides essentially
one or more whole VCPUs by itself and uses QEMU's drivers only doesn't
it?

-- Jamie

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Jamie Lokier
Anthony Liguori wrote:
 Perhaps.  This raises another point about AIO vs. threads:
 
 If I submit sequential O_DIRECT reads with aio_read(), will they enter
 the device read queue in the same order, and reach the disk in that
 order (allowing for reordering when worthwhile by the elevator)?
 
 There's no guarantee that any sort of order will be preserved by AIO 
 requests.  The same is true with writes.  This is what fdsync is for, to 
 guarantee ordering.

You misunderstand.  I'm not talking about guarantees, I'm talking
about expectations for the performance effect.

Basically, to do performant streaming read with O_DIRECT you need two
things:

   1. Overlap at least 2 requests, so the device is kept busy.

   2. Requests be sent to the disk in a good order, which is usually
  (but not always) sequential offset order.

The kernel does this itself with buffered reads, doing readahead.
It works very well, unless you have other problems caused by readahead.

With O_DIRECT, an application has to do the equivalent of readahead
itself to get performant streaming.

If the app uses two threads calling pread(), it's hard to ensure the
kernel even _sees_ the first two calls in sequential offset order.
You spawn two threads, and then both threads call pread() with
non-deterministic scheduling.  The problem starts before even entering
the kernel.

Then, depending on I/O scheduling in the kernel, it might send the
less good pread() to the disk immediately, then later a backward head
seek and the other one.  The elevator cannot fix this: it doesn't have
enough information, unless it adds artificial delays.  But artificial
delays may harm too; it's not optimal.

After that, the two threads tend to call pread() in the best order
provided there's no scheduling conflicts, but are easily disrupted by
other tasks, especially on SMP (one reading thread per CPU, so when
one of them is descheduled, the other continues and issues a request
in the 'wrong' order.)

With AIO, even though you can't be sure what the kernel does, you can
be sure the kernel receives aio_read() calls in the exact order which
is most likely to perform well.  Application knowledge of it's access
pattern is passed along better.

As I've said, I saw a man page which described why this makes AIO
superior to using threads for reading tapes on that OS.  So it's not a
completely spurious point.

This has nothing to do with guarantees.

-- Jamie

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Jamie Lokier
Avi Kivity wrote:
 Anthony Liguori wrote:
 If I submit sequential O_DIRECT reads with aio_read(), will they enter
 the device read queue in the same order, and reach the disk in that
 order (allowing for reordering when worthwhile by the elevator)?
   
 There's no guarantee that any sort of order will be preserved by AIO 
 requests.  The same is true with writes.  This is what fdsync is for, 
 to guarantee ordering.
 
 I believe he'd like a hint to get good scheduling, not a guarantee.  
 With a thread pool if the threads are scheduled out of order, so are 
 your requests.

 If the elevator doesn't plug the queue, the first few requests may
 not be optimally sorted.

That's right.  Then they tend to settle to a good order.  But any
delay in scheduling one of the threads, or a signal received by one of
them, can make it lose order briefly, making the streaming stutter as
the disk performes a few local seeks until it settles to good order
again.

You can mitigate the disruption in various ways.

  1. If all threads share an offset variable, and reads and
 increments that atomically just prior to calling pread(), that helps
 especially at the start.  (If threaded I/O is used for QEMU disk
 emulation, I would suggest doing that, in the more general form
 of popping a request from QEMU's internal shared queue at the last
 moment.)

  2. Using more threads helps keep it sustained, at the cost of more
 wasted I/O when there's a cancellation (changed mind), and more
 memory.

However, AIO, in principle (if not implementations...) could be better
at keeping the suggested I/O order than thread, without special tricks.

-- Jamie

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Jamie Lokier
Avi Kivity wrote:
 Perhaps.  This raises another point about AIO vs. threads:
 
 If I submit sequential O_DIRECT reads with aio_read(), will they enter
 the device read queue in the same order, and reach the disk in that
 order (allowing for reordering when worthwhile by the elevator)?
 
 Yes, unless the implementation in the kernel (or glibc) is threaded.

 
 With threads this isn't guaranteed and scheduling makes it quite
 likely to issue the parallel synchronous reads out of order, and for
 them to reach the disk out of order because the elevator doesn't see
 them simultaneously.
 
 If the disk is busy, it doesn't matter.  The requests will queue and the 
 elevator will sort them out.  So it's just the first few requests that 
 may get to disk out of order.

There's two cases where it matters to a read-streaming app:

1. Disk isn't busy with anything else, maximum streaming
   performance is desired.

2. Disk is busy with unrelated things, but you're using I/O
   priorities to give the streaming app near-absolute priority.
   Then you need to maintain overlapped streaming requests,
   otherwise disk is given to a lower priority I/O.  If that
   happens often, you lose, priority is ineffective.  Because one
   of the streaming requests is usually being serviced, elevator
   has similar limitations as for a disk which is not busy with
   anything else.

 I haven't considered tape, but this is a good point indeed.  I expect it 
 doesn't make much of a difference for a loaded disk.

Yes, as long as it's loaded with unrelated requests at the same I/O
priority, the elevator has time to sort requests and hide thread
scheduling artifacts.

Btw, regarding QEMU: QEMU gets requests _after_ sorting by the guest's
elevator, then submits them to the host's elevator.  If the guest and
host elevators are both configured 'anticipatory', do the anticipatory
delays add up?

-- Jamie

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Jamie Lokier
Avi Kivity wrote:
 And video streaming on some embedded devices with no MMU!  (Due to the
 page cache heuristics working poorly with no MMU, sustained reliable
 streaming is managed with O_DIRECT and the app managing cache itself
 (like a database), and that needs AIO to keep the request queue busy.
 At least, that's the theory.)
 
 Could use threads as well, no?

Perhaps.  This raises another point about AIO vs. threads:

If I submit sequential O_DIRECT reads with aio_read(), will they enter
the device read queue in the same order, and reach the disk in that
order (allowing for reordering when worthwhile by the elevator)?

With threads this isn't guaranteed and scheduling makes it quite
likely to issue the parallel synchronous reads out of order, and for
them to reach the disk out of order because the elevator doesn't see
them simultaneously.

With AIO (non-Glibc! (and non-kthreads)) it might be better at
keeping the intended issue order, I'm not sure.

It is highly desirable: O_DIRECT streaming performance depends on
avoiding seeks (no reordering) and on keeping the request queue
non-empty (no gap).

I read a man page for some other unix, describing AIO as better than
threaded parallel reads for reading tape drives because of this (tape
seeks are very expensive).  But the rest of the man page didn't say
anything more.  Unfortunately I don't remember where I read it.  I
have no idea whether AIO submission order is nearly always preserved
in general, or expected to be.

 It's me at fault here.  I just assumed that because it's easy to do aio 
 in a thread pool efficiently, that's what glibc does.
 
 Unfortunately the code does some ridiculous things like not service 
 multiple requests on a single fd in parallel.  I see absolutely no 
 reason for it (the code says fight for resources).

Ouch.  Perhaps that relates to my thought above, about multiple
requests to the same file causing seek storms when thread scheduling
is unlucky?

 So my comments only apply to linux-aio vs a sane thread pool.  Sorry for 
 spreading confusion.

Thanks.  I thought you'd measured it :-)

 It could and should.  It probably doesn't.
 
 A simple thread pool implementation could come within 10% of Linux aio 
 for most workloads.  It will never be exactly, but for small numbers 
 of disks, close enough.

I would wait for benchmark results for I/O patterns like sequential
reading and writing, because of potential for seeks caused by request
reordering, before being confident of that.

 Hmm.  Thanks.  I may consider switching to XFS now
 
 I'm rooting for btrfs myself.

In the unlikely event they backport btrfs to kernel 2.4.26-uc0, I'll
be happy to give it a try! :-)

-- Jamie

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-21 Thread Jamie Lokier
Avi Kivity wrote:
 At such a tiny difference, I'm wondering why Linux-AIO exists at all,
 as it complicates the kernel rather a lot.  I can see the theoretical
 appeal, but if performance is so marginal, I'm surprised it's in
 there.
 
 Linux aio exists, but that's all that can be said for it.  It works 
 mostly for raw disks, doesn't integrate with networking, and doesn't 
 advance at the same pace as the rest of the kernel.  I believe only 
 databases use it (and a userspace filesystem I wrote some time ago).

And video streaming on some embedded devices with no MMU!  (Due to the
page cache heuristics working poorly with no MMU, sustained reliable
streaming is managed with O_DIRECT and the app managing cache itself
(like a database), and that needs AIO to keep the request queue busy.
At least, that's the theory.)

 I'm also surprised the Glibc implementation of AIO using ordinary
 threads is so close to it.  
 
 Why are you surprised?

Because I've read that Glibc AIO (which uses a thread pool) is a
relatively poor performer as AIO implementations go, and is only there
for API compatibility, not suggested for performance.

But I read that quite a while ago, perhaps it's changed.

 Actually the glibc implementation could be improved from what I've 
 heard.  My estimates are for a thread pool implementation, but there is 
 not reason why glibc couldn't achieve exactly the same performance.

Erm...  I thought you said it _does_ achieve nearly the same
performance, not that it _could_.

Do you mean it could achieve exactly the same performance by using
Linux AIO when possible?

 And then, I'm wondering why use AIO it
 all: it suggests QEMU would run about as fast doing synchronous I/O in
 a few dedicated I/O threads.
 
 Posix aio is the unix API for this, why not use it?

Because far more host platforms have threads than have POSIX AIO.  (I
suspect both options will end up supported in the end, as dedicated
I/O threads were already suggested for other things.)

 Also, I'd presume that those that need 10K IOPS and above will not place 
 their high throughput images on a filesystem; rather on a separate SAN 
 LUN.
 
 Does the separate LUN make any difference?  I thought O_DIRECT on a
 filesystem was meant to be pretty close to block device performance.
 
 On a good extent-based filesystem like XFS you will get good performance 
 (though more cpu overhead due to needing to go through additional 
 mapping layers.  Old clunkers like ext3 will require additional seeks or 
 a ton of cache (1 GB per 1 TB).

Hmm.  Thanks.  I may consider switching to XFS now

-- Jamie

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-20 Thread Jamie Lokier
Avi Kivity wrote:
 For the majority of deployments posix aio should be sufficient.  The few 
 that need something else can use Linux aio.

Does that mean for the majority of deployments, the slow version is
sufficient.  The few that care about performance can use Linux AIO?

I'm under the impression that the entire and only point of Linux AIO
is that it's faster than POSIX AIO on Linux.

 Of course, a managed environment can use Linux aio unconditionally if 
 knows the kernel has all the needed goodies.

Does that mean a managed environment can have some code which check
the host kernel version + filesystem type holding the VM image, to
conditionally enable Linux AIO?  (Since if you care about
performance, which is the sole reason for using Linux AIO, you
wouldn't want to enable Linux AIO on any host in your cluster where it
will trash performance.)

Just wondering.

Thanks,
-- Jamie

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-18 Thread Jamie Lokier
Daniel P. Berrange wrote:
  Those cases aren't always discoverable.  Linux-aio just falls back to 
  using synchronous IO.  It's pretty terrible.  We need a new AIO 
  interface for Linux (and yes, we're working on this).  Once we have 
  something better, we'll change that to be the default and things will 
  Just Work for most users.
 
 If QEMU can't discover cases where it won't work, what criteria should
 the end user use to decide between the impls, or for that matter, what
 criteria should a management api/app like libvirt use ? If the only decision
 logic is  'try it  benchmark your VM' then its not a particularly useful
 option.

Good use of Linux-AIO requires that you basically know which cases
it handles well, and which ones it doesn't.  Falling back to
synchronous I/O with no indication (except speed) is a pretty
atrocious API imho.  But that's what the Linux folks decided to do.

I suspect what you have to do is:

1. Try opening the file with O_DIRECT.
2. Use fstat to check the filesystem type and block device type.
3. If it's on a whitelist of filesystem types,
4. and a whitelist of block device types,
5. and the kernel version is later than an fs+bd-dependent value,
6. then select an alignment size (kernel version dependent)
   and use Linux-AIO with it.

Otherwise don't use Linux-AIO.  You may then decide to use Glibc's
POSIX-AIO (which uses threads), or use threads for I/O yourself.

In future, the above recipe will be more complicated, in that you have
to use the same decision tree to decide between:

- Synchronous IO.
- Your own thread based IO.
- Glibc POSIX-AIO using threads.
- Linux-AIO.
- Virtio thing or whatever is based around vringfd.
- Syslets if they gain traction and perform well.

 I've basically got a choice of making libvirt always ad '-aio linux'
 or never add it at all. My inclination is to the latter since it is
 compatible with existing QEMU which has no -aio option. Presumably
 '-aio linux' is intended to provide some performance benefit so it'd
 be nice to use it. If we can't express some criteria under which it
 should be turned on, I can't enable it; where as if you can express
 some criteria, then QEMU should apply them automatically.

I'm of the view that '-aio auto' would be a really good option - and
when it's proven itself, it should be the default.  It could work on
all QEMU hosts: it would pick synchronous IO when there is nothing else.

The criteria for selecting a good AIO strategy on Linux are quite
complex, and might be worth hard coding.  In that case, putting that
into QEMU itself would be much better than every program which
launches QEMU having it's own implementation of the criteria.

 Pushing this choice of AIO impls to the app or user invoking QEMU just
 does not seem like a win here.

I think having the choice is very good, because whatever the hard
coded selection criteria, there will be times when it's wrong (ideally
in conservative ways - it should always be functional, just suboptimal).

So I do support this patch to add the switch.

But _forcing_ the user to decide is not good, since the criteria are
rather obscure and change with things like filesystem.  At least, a
set of command line options to QEMU ought to work when you copy a VM
to another machine!

So I think '-aio auto', which invokes the selection criteria of the
day and is guaranteed to work (conservatively picking a slower method
if it cannot be sure a faster one will work) would be the most useful
option of all.

-- Jamie

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-18 Thread Jamie Lokier
Anthony Liguori wrote:
 I'm of the view that '-aio auto' would be a really good option - and
 when it's proven itself, it should be the default.  It could work on
 all QEMU hosts: it would pick synchronous IO when there is nothing else.
 
 Right now, not specifying the -aio option is equivalent to your proposed 
 -aio auto.
 
 I guess I should include an info aio to let the user know what type of 
 aio they are using.  We can add selection criteria later but 
 semantically, not specifying an explicit -aio option allows QEMU to 
 choose whichever one it thinks is best.

Great.  I guess the next step is to add selection criteria, otherwise
a million Wikis will tell everyone to use '-aio linux' :-)

Do you know what the selection criteria should be - or is there a
document/paper somewhere which says (ideally from benchmarks)?  I'm
interested for an unrelated project using AIO - so I'm willing to help
get this right to some extent.

-- Jamie

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] [PATCH] QEMU: fsync AIO writes on flush request

2008-03-28 Thread Jamie Lokier
Marcelo Tosatti wrote:
 Its necessary to guarantee that pending AIO writes have reached stable
 storage when the flush request returns.
 
 Also change fsync() to fdatasync(), since the modification time is not
 critical data.
 +if (aio_fsync(O_DSYNC, acb-aiocb)  0) {

  BDRVRawState *s = bs-opaque;
 -fsync(s-fd);
 +raw_aio_flush(bs);
 +fdatasync(s-fd);
 +
 +/* We rely on the fact that no other AIO will be submitted
 + * in parallel, but this should be fixed by per-device
 + * AIO queues when allowing multiple CPU's to process IO
 + * in QEMU.
 + */
 +qemu_aio_flush();

I'm a bit confused by this.  Why do you need aio_fsync(O_DSYNC) _and_
synchronous fdatasync() calls?  Aren't they equivalent?

-- Jamie

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] [PATCH] QEMU: fsync AIO writes on flush request

2008-03-28 Thread Jamie Lokier
Marcelo Tosatti wrote:
 On Fri, Mar 28, 2008 at 03:07:03PM +, Jamie Lokier wrote:
  Marcelo Tosatti wrote:
   Its necessary to guarantee that pending AIO writes have reached stable
   storage when the flush request returns.
   
   Also change fsync() to fdatasync(), since the modification time is not
   critical data.
   +if (aio_fsync(O_DSYNC, acb-aiocb)  0) {
  
BDRVRawState *s = bs-opaque;
   -fsync(s-fd);
   +raw_aio_flush(bs);
   +fdatasync(s-fd);
   +
   +/* We rely on the fact that no other AIO will be submitted
   + * in parallel, but this should be fixed by per-device
   + * AIO queues when allowing multiple CPU's to process IO
   + * in QEMU.
   + */
   +qemu_aio_flush();
  
  I'm a bit confused by this.  Why do you need aio_fsync(O_DSYNC) _and_
  synchronous fdatasync() calls?  Aren't they equivalent?
 
 fdatasync() will write and wait for completion of dirty file data
 present in memory.
 
 aio_write() only queues data for submission:
 
The asynchronous means that this call returns as soon as the  request
has  been  enqueued;  the  write may or may not have completed when the
call returns. One tests for completion using aio_error(3).
 

 So fdatasync() is not enough because data written via AIO may not
 have been reflected as dirty file data through write() by the time
 raw_flush() is called.

Sure.  But why isn't the aio_fsync(O_DSYNC) enough by itself?

It seems to me you should have something like this:

/* Flush pending aio_writes until they are dirty data,
   and wait before the aio_fsync. */
qemu_aio_flush();

/* Call aio_fsync(O_DSYNC). */
raw_aio_flush(bs);

/* Wait for the aio_fsync to complete. */
qemu_aio_flush();

What am I missing?

-- Jamie

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] [PATCH] QEMU: fsync AIO writes on flush request

2008-03-28 Thread Jamie Lokier
Marcelo Tosatti wrote:
 I don't think the first qemu_aio_flush() is necessary because the fsync
 request will be enqueued after pending ones: 
 
aio_fsync() function does a sync on all outstanding
asynchronous I/O operations associated with
aiocbp-aio_fildes.

More precisely, if op is O_SYNC, then all currently queued
I/O operations shall be completed as if by a call of
fsync(2), and if op is O_DSYNC, this call is the asynchronous
analog of fdatasync(2).  Note that this is a request only —
this call does not wait for I/O completion.
 
 glibc sets the priority for fsync as 0, which is the same priority AIO
 reads and writes are submitted by QEMU.

Do AIO operations always get executed in the order they are submitted?

I was under the impression this is not guaranteed, as relaxed ordering
permits better I/O scheduling (e.g. to reduce disk seeks) - which is
one of the most useful points of AIO.  (Otherwise you might as well
just have one worker thread doing synchronous IO in order).

And because of that, I was under the impression the only way to
implement a write barrier+flush in AIO was (1) wait for pending writes
to complete, then (2) aio_fsync, then (3) wait for the aio_fsync.

I could be wrong, but I haven't seen any documentation which says
otherwise, and it's what I'd expect of an implementation.  I.e. it's
just an asynchronous version of fsync().

The quoted man page doesn't convince me.  It says all currently
queued I/O operations shall be completed which _could_ mean that
aio_fsync is an AIO barrier too.

But then if by a call of fsync(2) implies that aio_fsync+aio_suspend
could just be replaced by fsync() with no change of semantics.  So
queued I/O operations means what fsync() handles: dirty file data,
not in-flight AIO writes.

And you already noticed that fsync() is _not_ guaranteed to flush
in-flight AIO writes.  Being the asynchronous analog, aio_fsync()
would not either.

-- Jamie

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] [PATCH] QEMU: fsync AIO writes on flush request

2008-03-28 Thread Jamie Lokier
Marcelo Tosatti wrote:
  static void raw_flush(BlockDriverState *bs)
  {
  BDRVRawState *s = bs-opaque;
 -fsync(s-fd);
 +raw_aio_flush(bs);
 +
 +/* We rely on the fact that no other AIO will be submitted
 + * in parallel, but this should be fixed by per-device
 + * AIO queues when allowing multiple CPU's to process IO
 + * in QEMU.
 + */
 +qemu_aio_flush();
  }

It depends what raw_flush() is used for.

If you want to be sure this flushes AIO writes in-flight at the time
of the call, I still reckon you need an extra qemu_aio_flush() before
raw_aio_flush() - on at least some POSIX AIO implementations.  (I'd be
very interested to find out otherwise, if you know better).

But if, as Ian Jackson suggests, raw_flush() is _only_ used when the
guest driver issues a CACHE FLUSH command _and_ the guest driver
either cannot overlap operations, or cannot depend on overlapping
operations occuring in order, then you don't need it.

That'll depend on what kind of device is emulated.  Does the SCSI
emulation handle multiple in-flight commands with any guarantee on
order?  To be on the safe side, I'd include the extra qemu_aio_flush,
as I expect it's very unlikely to harm performance and might save
someone's data.

-- Jamie


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] [PATCH] QEMU: fsync AIO writes on flush request

2008-03-28 Thread Jamie Lokier
Paul Brook wrote:
  That'll depend on what kind of device is emulated.  Does the SCSI
  emulation handle multiple in-flight commands with any guarantee on
  order?
 
 SCSI definitely allows (and we emulate) multiple in flight commands.
 I can't find any requirement that writes must complete before a
 subsequent SYNCHRONISE_CACHE. However I don't claim to know the spec
 that well,

Aren't there SCSI tagged barrier commands or something like that,
which allow a host to request ordering between commands?

 it's probably not a bad idea have them complete anyway. Preferably
 this would be a completely asynchronous operation. i.e. the sync
 command returns immediately, but only completes when all preceding
 writes have completed and been flushed to disk.

I agree, that seems the optimal implementation.

-- Jamie

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] [PATCH] use a thread id variable

2008-03-09 Thread Jamie Lokier
Gilad Ben-Yossef wrote:
 Glauber Costa wrote:
 This patch introduces a thread_id variable to CPUState.
 It's duty will be to hold the process, or more generally, thread
 id of the current executing cpu
 
  env-nb_watchpoints = 0;
 +#ifdef __WIN32
 +env-thread_id = GetCurrentProcessId();
 +#else
 +env-thread_id = getpid();
 +#endif
  *penv = env;
 
 hmm... maybe I'm missing something, but in Linux at least I think you 
 would prefer this to be gettid() rather then getpid as each CPU has it's 
 own thread, not a different process.

On most platforms, getpid() returns the same value for all threads, so
it's not useful as a thread id.

On Linux, it depends which version of threads.  The old package,
LinuxThreads, has different getpid() for each thread.  The current one,
NPTL, has them all the same.

What you're supposed to do with pthreads in general is use pthread_self().

Btw, unfortunately pthread_self() is not safe to call from signal
handlers.

-- Jamie

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] [PATCH] use a thread id variable

2008-03-09 Thread Jamie Lokier
M. Warner Losh wrote:
 In message: [EMAIL PROTECTED]
 Jamie Lokier [EMAIL PROTECTED] writes:
 : Btw, unfortunately pthread_self() is not safe to call from signal
 : handlers.
 
 And also often times meaningless, as signal handlers can run in
 arbitrary threads...

That's usually the case, but sometimes it is useful.  Some causes of
signals are thread specific, or can be asked to be, and it's nice to
know which thread is receiving them (e.g. thread specific timers,
SIGIOs, write-protection SEGVs, and even sending messages with good
old pthread_kill (same reason as kernel uses IPIs)).

GCC's Boehm garbage collector uses pthread_self() from a signal
handler.  I've used gettid() in a signal handler on a few occasions.


-- Jamie

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: expose host CPU features to guests

2007-09-10 Thread Jamie Lokier
Paul Brook wrote:
   What you really want to do is ask your virtualization module what
   features it supports.
 
  Yes, that needs to be an additional filter.
 
 I'd have thought that would be the *only* interesting set for autodetection.

If that means the same as the features which are efficient for the
guest, then I agree.  If there's a difference, I'd have thought you'd
normally want the guest to use only those features which work at
near-native performance, not those which involve a trap and long path
through the virtualisation/emulation, even if they're supported.  No
example comes to mind, but that seems like the principle to go for, to
me.

-- Jamie

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: expose host CPU features to guests

2007-09-09 Thread Jamie Lokier
Avi Kivity wrote:
 Well, the guest will invoke its own workaround logic to disable buggy 
 features, so I see no issue here.

The guest can only do this if it has exactly the correct id
information for the host processor (e.g. This is an Intel Pentium Pro
model XXX), not just the list of safe to use CPU features.

This isn't possible in a virtualisation cluster of many different CPUs
where the point is to advertise the common set of cpuid features, and
for the guest images to be able to migrate between different CPUs in
the cluster.

Then, the common cpuid features must be found by combining the
/proc/cpuinfo from each node in the cluster.  But I guess that's
separate from the part of Qemu we are discussing; it would be done by
another program, preparing the -cpuid argument.

But sometimes it's good to run a simple guest (e.g. someone's pet OS
project, or anything written for Intel only which was more common in
the past) which doesn't have all the detailed obscure workarounds of
something like Linux, but still be able to take advantage of the
workarounds and obscure knowledge in the host.

The alternative is Qemu itself may end up having to have some of these
obscure workarounds :/

For example, the sysenter instruction is advertised on early Pentium
Pros, but it doesn't work.  Linux removes it from the features in
/proc/cpuinfo, and doesn't use it.  But some guests simply don't get
that obscure, and use it if cpuid advertises it.  Of course, they
don't work on a _real_ early Pentium Pro.  But it would be nice if
they did work without anything special when run in Qemu on such a
host.  That's an old chip which I happen to know about, but I'm sure
there are more modern similar issues.

Perhaps there could be two options then: -cpuid host-os, and -cpuid
host-cpuid.  I would suggest making host an alias for host-os,
but I wouldn't object if it were an alias for host-cpuid or didn't
exist at all, so you had to choose one.

-- Jamie

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: expose host CPU features to guests

2007-09-09 Thread Jamie Lokier
Avi Kivity wrote:
 Let's start with '-cpu host' as 'cpu host-cpuid' and implement '-cpu 
 host-os' on the first bug report?  I have a feeling we won't ever see it.

I have a feeling you won't ever see it either, but not because it's a
missing feature.

Instead, I think a very small number of users will spend hours
frustrated that some obscure guest doesn't work properly on their
obscure x86 hardware, then they will learn that they should not use
-cpuid host for that guest on that hardware, even though it works
fine with other guests, and then their problem will be solved (albeit
at a cost), and seen as such an obscure combination that it might
never be reported to Qemu developers.

In other words, host-os is what _I'd_ implement because I care too
much about the poor obscure users and think it's the safe option, but
I'm not doing the implementing here ;-)

If you are curious what the differences are, do this in a current
Linux source tree:

egrep -R '(set|clear)_bit\(X86_FEATURE' arch/{i386,x86_64}/kernel

-- Jamie

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: expose host CPU features to guests

2007-09-09 Thread Jamie Lokier
Avi Kivity wrote:
 I agree. If the host OS has disabled a feature, it's a fair bet it's done 
 that for a reason.
 
 The reason may not be relevant to the guest.

For most guests the relevant features are those which work correctly
and efficiently on the virtual CPU.

If the host OS has disabled a feature, most often that's because the
feature doesn't work on the specific host CPU model, but not always.
It might be emulated well, but probably not efficiently.

In some cases, you might want a guest to see features supported by the
host CPU which the host OS has disabled, but I imagine those are
unusual - it doesn't seem very likely.  Can you give an example?

They can be enabled explicitly by the -cpuid flag if needed.

-- Jamie

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: expose host CPU features to guests

2007-09-07 Thread Jamie Lokier
Anthony Liguori wrote:
 I like this idea but I have some suggestions about the general approach.
 I think instead of defining another machine type, it would be better to
 just have a command line option like -cpuid that took a comma separate
 string of features with all meaning all features that the host has.

I like the idea of a flag to enable specific features, but I think
host would be a better name for the features of the host.

all seems more appropriate to enable all the features the emulator
can support, which can include features which the host does not
support itself.

If it's a comma separated list, it would be good to be able to write
something like this example, which selects all the host features but
then overrides it by disabling the psn feature:

   -cpuid host,-psn

Is it intended that these flags will also control the actual features
which Qemu allows or emulates, or only what cpuid reports to the guest?

 I also think it would be nicer to use cpuid() directly instead of
 attempting to parse /proc/cpuinfo.

Occasionally the features in /proc/cpuinfo differ from what the cpuid
instruction reports.  They are CPU bug workarounds (features disabled
intentionally even though cpuid reports them), CPU features which
aren't properly reported (enabled intentionally in cpuinfo), and boot
flag requests (features disabled due to request from the boot command
line).

I'm inclined to think the feature list in /proc/cpuinfo is more
appropriate, for choosing the best set of host features to make
available to guests.  It's unlikely that Qemu itself will duplicate
the logic of known workarounds for specific, obscure, buggy host CPUs.

There is also /dev/cpu/%d/cpuinfo (for %d = 0, 1, etc.) on some Linux
distros, I think.

-- Jamie

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now   http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 0/4] Rework alarm timer infrastrucure - take2

2007-08-19 Thread Jamie Lokier
Avi Kivity wrote:
 In this case the dyn-tick minimum res will be 1msec. I believe it should
 work ok since this is the case without any dyn-tick.
   
 Actually minimum resolution depends on host HZ setting, but - yes -
 essentially you have the same behaviour of the unix timer, plus the
 overhead of reprogramming the timer. 
 
 
 Is this significant?  At a high guest HZ, this is could be quite a lot
 of additional syscalls right?
 
 At HZ=1000, this adds a small multiple of 1000 syscalls, which is a 
 fairly small overhead.

Small, but maybe measurable.

That overhead could be removed if the dyn-tick code were to
predictively set the host timer into a repeating mode when guests do
actually require a regular tick.

I'm thinking when it detects it needed a tick a small number of times
in a row, with the same interval, it could set the host timer to
trigger repeatedly at that interval.  Then it wouldn't need to reprogram
if that stayed the same (except maybe to correct for drift?)

If a tick then _wasn't_ required for that interval (i.e. it was
required for less, more, or not at all), then it would have to
reprogram the host timer.  If it really mattered, it wouldn't have to
reprogram the host timer when the next required tick is further in the
future or not at all; it would simply be a redundant SIGALRM.  In
weird cases that's worthwhile, but I suspect it generally isn't.

-- Jamie

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now   http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel