Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-22 Thread Chris Webb
Chris Webb ch...@arachsys.com writes:

 Okay. What I was driving at in describing these systems as 'already broken'
 is that they will already lose data (in this sense) if they're run on bare
 metal with normal commodity SATA disks with their 32MB write caches on. That
 configuration surely describes the vast majority of PC-class desktops and
 servers!
 
 If I understand correctly, your point here is that the small cache on a real
 SATA drive gives a relatively small time window for data loss, whereas the
 worry with cache=writeback is that the host page cache can be gigabytes, so
 the time window for unsynced data to be lost is potentially enormous.
 
 Isn't the fix for that just forcing periodic sync on the host to bound-above
 the time window for unsynced data loss in the guest?

For the benefit of the archives, it turns out the simplest fix for this is
already implemented as a vm sysctl in linux. Set vm.dirty_bytes to 3220,
and the size of dirty page cache is bounded above by 32MB, so we are
simulating exactly the case of a SATA drive with a 32MB writeback-cache.

Unless I'm missing something, the risk to guest OSes in this configuration
should therefore be exactly the same as the risk from running on normal
commodity hardware with such drives and no expensive battery-backed RAM.

Cheers,

Chris.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-22 Thread Avi Kivity

On 03/22/2010 11:04 PM, Chris Webb wrote:

Chris Webbch...@arachsys.com  writes:

   

Okay. What I was driving at in describing these systems as 'already broken'
is that they will already lose data (in this sense) if they're run on bare
metal with normal commodity SATA disks with their 32MB write caches on. That
configuration surely describes the vast majority of PC-class desktops and
servers!

If I understand correctly, your point here is that the small cache on a real
SATA drive gives a relatively small time window for data loss, whereas the
worry with cache=writeback is that the host page cache can be gigabytes, so
the time window for unsynced data to be lost is potentially enormous.

Isn't the fix for that just forcing periodic sync on the host to bound-above
the time window for unsynced data loss in the guest?
 

For the benefit of the archives, it turns out the simplest fix for this is
already implemented as a vm sysctl in linux. Set vm.dirty_bytes to 3220,
and the size of dirty page cache is bounded above by 32MB, so we are
simulating exactly the case of a SATA drive with a 32MB writeback-cache.

Unless I'm missing something, the risk to guest OSes in this configuration
should therefore be exactly the same as the risk from running on normal
commodity hardware with such drives and no expensive battery-backed RAM.
   


A host crash will destroy your data.  If  your machine is connected to a 
UPS, only a firmware crash can destroy your data.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-22 Thread Chris Webb
Avi Kivity a...@redhat.com writes:

 On 03/22/2010 11:04 PM, Chris Webb wrote:

 Unless I'm missing something, the risk to guest OSes in this configuration
 should therefore be exactly the same as the risk from running on normal
 commodity hardware with such drives and no expensive battery-backed RAM.
 
 A host crash will destroy your data.  If  your machine is connected
 to a UPS, only a firmware crash can destroy your data.

Yes, that's a good point: in this configuration a host crash is equivalent
to a power failure rather than a OS crash in terms of data loss.

Cheers,

Chris.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-19 Thread Dave Hansen
On Tue, 2010-03-16 at 11:05 +0200, Avi Kivity wrote:
  Not really.  In many cloud environments, there's a set of common 
  images that are instantiated on each node.  Usually this is because 
  you're running a horizontally scalable application or because you're 
  supporting an ephemeral storage model.
 
 But will these servers actually benefit from shared cache?  So the 
 images are shared, they boot up, what then?
 
 - apache really won't like serving static files from the host pagecache
 - dynamic content (java, cgi) will be mostly in anonymous memory, not 
 pagecache
 - ditto for application servers
 - what else are people doing?

Think of an OpenVZ-style model where you're renting out a bunch of
relatively tiny VMs and they're getting used pretty sporadically.  They
either have relatively little memory, or they've been ballooned down to
a pretty small footprint.

The more you shrink them down, the more similar they become.  You'll end
up having things like init, cron, apache, bash and libc start to
dominate the memory footprint in the VM.

That's *certainly* a case where this makes a lot of sense.

-- Dave

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Christoph Hellwig
On Tue, Mar 16, 2010 at 01:08:28PM +0200, Avi Kivity wrote:
 If the batch size is larger than the virtio queue size, or if there are 
 no flushes at all, then yes the huge write cache gives more opportunity 
 for reordering.  But we're already talking hundreds of requests here.

Yes.  And rememember those don't have to come from the same host.  Also
remember that we rather limit execssive reodering of O_DIRECT requests
in the I/O scheduler because they are synchronous type I/O while
we don't do that for pagecache writeback.

And we don't have unlimited virtio queue size, in fact it's quite
limited.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Avi Kivity

On 03/17/2010 10:49 AM, Christoph Hellwig wrote:

On Tue, Mar 16, 2010 at 01:08:28PM +0200, Avi Kivity wrote:
   

If the batch size is larger than the virtio queue size, or if there are
no flushes at all, then yes the huge write cache gives more opportunity
for reordering.  But we're already talking hundreds of requests here.
 

Yes.  And rememember those don't have to come from the same host.  Also
remember that we rather limit execssive reodering of O_DIRECT requests
in the I/O scheduler because they are synchronous type I/O while
we don't do that for pagecache writeback.
   


Maybe we should relax that for kvm.  Perhaps some of the problem comes 
from the fact that we call io_submit() once per request.



And we don't have unlimited virtio queue size, in fact it's quite
limited.
   


That can be extended easily if it fixes the problem.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Chris Webb
Anthony Liguori anth...@codemonkey.ws writes:

 This really gets down to your definition of safe behaviour.  As it
 stands, if you suffer a power outage, it may lead to guest
 corruption.
 
 While we are correct in advertising a write-cache, write-caches are
 volatile and should a drive lose power, it could lead to data
 corruption.  Enterprise disks tend to have battery backed write
 caches to prevent this.
 
 In the set up you're emulating, the host is acting as a giant write
 cache.  Should your host fail, you can get data corruption.

Hi Anthony. I suspected my post might spark an interesting discussion!

Before considering anything like this, we did quite a bit of testing with
OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool
power off to kill the host. I didn't manage to corrupt any ext3, ext4 or
NTFS filesystems despite these efforts.

Is your claim here that:-

  (a) qemu doesn't emulate a disk write cache correctly; or

  (b) operating systems are inherently unsafe running on top of a disk with
  a write-cache; or

  (c) installations that are already broken and lose data with a physical
  drive with a write-cache can lose much more in this case because the
  write cache is much bigger?

Following Christoph Hellwig's patch series from last September, I'm pretty
convinced that (a) isn't true apart from the inability to disable the
write-cache at run-time, which is something that neither recent linux nor
windows seem to want to do out-of-the box.

Given that modern SATA drives come with fairly substantial write-caches
nowadays which operating systems leave on without widespread disaster, I
don't really believe in (b) either, at least for the ide and scsi case.
Filesystems know they have to flush the disk cache to avoid corruption.
(Virtio makes the write cache invisible to the OS except in linux 2.6.32+ so
I know virtio-blk has to be avoided for current windows and obsolete linux
when writeback caching is on.)

I can certainly imagine (c) might be the case, although when I use strace to
watch the IO to the block device, I see pretty regular fdatasyncs being
issued by the guests, interleaved with the writes, so I'm not sure how
likely the problem would be in practice. Perhaps my test guests were
unrepresentatively well-behaved.

However, the potentially unlimited time-window for loss of incorrectly
unsynced data is also something one could imagine fixing at the qemu level.
Perhaps I should be implementing something like
cache=writeback,flushtimeout=N which, upon a write being issued to the block
device, starts an N second timer if it isn't already running. The timer is
destroyed on flush, and if it expires before it's destroyed, a gratuitous
flush is sent. Do you think this is worth doing? Just a simple 'while sleep
10; do sync; done' on the host even!

We've used cache=none and cache=writethrough, and whilst performance is fine
with a single guest accessing a disk, when we chop the disks up with LVM and
run a even a small handful of guests, the constant seeking to serve tiny
synchronous IOs leads to truly abysmal throughput---we've seen less than
700kB/s streaming write rates within guests when the backing store is
capable of 100MB/s.

With cache=writeback, there's still IO contention between guests, but the
write granularity is a bit coarser, so the host's elevator seems to get a
bit more of a chance to help us out and we can at least squeeze out 5-10MB/s
from two or three concurrently running guests, getting a total of 20-30% of
the performance of the underlying block device rather than a total of around
5%.

Cheers,

Chris.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Chris Webb
Avi Kivity a...@redhat.com writes:

 On 03/15/2010 10:23 PM, Chris Webb wrote:

 Wasteful duplication of page cache between guest and host notwithstanding,
 turning on cache=writeback is a spectacular performance win for our guests.
 
 Is this with qcow2, raw file, or direct volume access?

This is with direct access to logical volumes. No file systems or qcow2 in
the stack. Our typical host has a couple of SATA disks, combined in md
RAID1, chopped up into volumes with LVM2 (really just dm linear targets).
The performance measured outside qemu is excellent, inside qemu-kvm is fine
too until multiple guests are trying to access their drives at once, but
then everything starts to grind badly.

 I can understand it for qcow2, but for direct volume access this
 shouldn't happen.  The guest schedules as many writes as it can,
 followed by a sync.  The host (and disk) can then reschedule them
 whether they are in the writeback cache or in the block layer, and
 must sync in the same way once completed.

I don't really understand what's going on here, but I wonder if the
underlying problem might be that all the O_DIRECT/O_SYNC writes from the
guests go down into the same block device at the bottom of the device mapper
stack, and thus can't be reordered with respect to one another. For our
purposes,

  Guest AA   Guest BB   Guest AA   Guest BB   Guest AA   Guest BB
  write A1  write A1 write B1
 write B1   write A2  write A1
  write A2 write B1   write A2

are all equivalent, but the system isn't allowed to reorder in this way
because there isn't a separate request queue for each logical volume, just
the one at the bottom. (I don't know whether nested request queues would
behave remotely reasonably either, though!)

Also, if my guest kernel issues (say) three small writes, one at the start
of the disk, one in the middle, one at the end, and then does a flush, can
virtio really express this as one non-contiguous O_DIRECT write (the three
components of which can be reordered by the elevator with respect to one
another) rather than three distinct O_DIRECT writes which can't be permuted?
Can qemu issue a write like that? cache=writeback + flush allows this to be
optimised by the block layer in the normal way.

Cheers,

Chris.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Anthony Liguori

On 03/17/2010 10:14 AM, Chris Webb wrote:

Anthony Liguorianth...@codemonkey.ws  writes:

   

This really gets down to your definition of safe behaviour.  As it
stands, if you suffer a power outage, it may lead to guest
corruption.

While we are correct in advertising a write-cache, write-caches are
volatile and should a drive lose power, it could lead to data
corruption.  Enterprise disks tend to have battery backed write
caches to prevent this.

In the set up you're emulating, the host is acting as a giant write
cache.  Should your host fail, you can get data corruption.
 

Hi Anthony. I suspected my post might spark an interesting discussion!

Before considering anything like this, we did quite a bit of testing with
OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool
power off to kill the host. I didn't manage to corrupt any ext3, ext4 or
NTFS filesystems despite these efforts.

Is your claim here that:-

   (a) qemu doesn't emulate a disk write cache correctly; or

   (b) operating systems are inherently unsafe running on top of a disk with
   a write-cache; or

   (c) installations that are already broken and lose data with a physical
   drive with a write-cache can lose much more in this case because the
   write cache is much bigger?
   


This is the closest to the most accurate.

It basically boils down to this: most enterprises use a disks with 
battery backed write caches.  Having the host act as a giant write cache 
means that you can lose data.


I agree that a well behaved file system will not become corrupt, but my 
contention is that for many types of applications, data lose == 
corruption and not all file systems are well behaved.  And it's 
certainly valid to argue about whether common filesystems are broken 
but from a purely pragmatic perspective, this is going to be the case.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Avi Kivity

On 03/17/2010 05:24 PM, Chris Webb wrote:

Avi Kivitya...@redhat.com  writes:

   

On 03/15/2010 10:23 PM, Chris Webb wrote:

 

Wasteful duplication of page cache between guest and host notwithstanding,
turning on cache=writeback is a spectacular performance win for our guests.
   

Is this with qcow2, raw file, or direct volume access?
 

This is with direct access to logical volumes. No file systems or qcow2 in
the stack. Our typical host has a couple of SATA disks, combined in md
RAID1, chopped up into volumes with LVM2 (really just dm linear targets).
The performance measured outside qemu is excellent, inside qemu-kvm is fine
too until multiple guests are trying to access their drives at once, but
then everything starts to grind badly.

   


OK.


I can understand it for qcow2, but for direct volume access this
shouldn't happen.  The guest schedules as many writes as it can,
followed by a sync.  The host (and disk) can then reschedule them
whether they are in the writeback cache or in the block layer, and
must sync in the same way once completed.
 

I don't really understand what's going on here, but I wonder if the
underlying problem might be that all the O_DIRECT/O_SYNC writes from the
guests go down into the same block device at the bottom of the device mapper
stack, and thus can't be reordered with respect to one another.


They should be reorderable.  Otherwise host filesystems on several 
volumes would suffer the same problems.


Whether the filesystem is in the host or guest shouldn't matter.


For our
purposes,

   Guest AA   Guest BB   Guest AA   Guest BB   Guest AA   Guest BB
   write A1  write A1 write B1
  write B1   write A2  write A1
   write A2 write B1   write A2

are all equivalent, but the system isn't allowed to reorder in this way
because there isn't a separate request queue for each logical volume, just
the one at the bottom. (I don't know whether nested request queues would
behave remotely reasonably either, though!)

Also, if my guest kernel issues (say) three small writes, one at the start
of the disk, one in the middle, one at the end, and then does a flush, can
virtio really express this as one non-contiguous O_DIRECT write (the three
components of which can be reordered by the elevator with respect to one
another) rather than three distinct O_DIRECT writes which can't be permuted?
Can qemu issue a write like that? cache=writeback + flush allows this to be
optimised by the block layer in the normal way.
   


Guest side virtio will send this as three requests followed by a flush.  
Qemu will issue these as three distinct requests and then flush.  The 
requests are marked, as Christoph says, in a way that limits their 
reorderability, and perhaps if we fix these two problems performance 
will improve.


Something that comes to mind is merging of flush requests.  If N guests 
issue one write and one flush each, we should issue N writes and just 
one flush - a flush for the disk applies to all volumes on that disk.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Balbir Singh
* Anthony Liguori anth...@codemonkey.ws [2010-03-17 10:55:47]:

 On 03/17/2010 10:14 AM, Chris Webb wrote:
 Anthony Liguorianth...@codemonkey.ws  writes:
 
 This really gets down to your definition of safe behaviour.  As it
 stands, if you suffer a power outage, it may lead to guest
 corruption.
 
 While we are correct in advertising a write-cache, write-caches are
 volatile and should a drive lose power, it could lead to data
 corruption.  Enterprise disks tend to have battery backed write
 caches to prevent this.
 
 In the set up you're emulating, the host is acting as a giant write
 cache.  Should your host fail, you can get data corruption.
 Hi Anthony. I suspected my post might spark an interesting discussion!
 
 Before considering anything like this, we did quite a bit of testing with
 OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool
 power off to kill the host. I didn't manage to corrupt any ext3, ext4 or
 NTFS filesystems despite these efforts.
 
 Is your claim here that:-
 
(a) qemu doesn't emulate a disk write cache correctly; or
 
(b) operating systems are inherently unsafe running on top of a disk with
a write-cache; or
 
(c) installations that are already broken and lose data with a physical
drive with a write-cache can lose much more in this case because the
write cache is much bigger?
 
 This is the closest to the most accurate.
 
 It basically boils down to this: most enterprises use a disks with
 battery backed write caches.  Having the host act as a giant write
 cache means that you can lose data.
 

Dirty limits can help control how much we lose, but also affect how
much we write out.

 I agree that a well behaved file system will not become corrupt, but
 my contention is that for many types of applications, data lose ==
 corruption and not all file systems are well behaved.  And it's
 certainly valid to argue about whether common filesystems are
 broken but from a purely pragmatic perspective, this is going to
 be the case.


I think it is a trade-off for end users to decide on. cache=writeback
does provide performance benefits, but can cause data loss.


-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Chris Webb
Anthony Liguori anth...@codemonkey.ws writes:

 On 03/17/2010 10:14 AM, Chris Webb wrote:
(c) installations that are already broken and lose data with a physical
drive with a write-cache can lose much more in this case because the
write cache is much bigger?
 
 This is the closest to the most accurate.
 
 It basically boils down to this: most enterprises use a disks with
 battery backed write caches.  Having the host act as a giant write
 cache means that you can lose data.
 
 I agree that a well behaved file system will not become corrupt, but
 my contention is that for many types of applications, data lose ==
 corruption and not all file systems are well behaved.  And it's
 certainly valid to argue about whether common filesystems are
 broken but from a purely pragmatic perspective, this is going to
 be the case.

Okay. What I was driving at in describing these systems as 'already broken'
is that they will already lose data (in this sense) if they're run on bare
metal with normal commodity SATA disks with their 32MB write caches on. That
configuration surely describes the vast majority of PC-class desktops and
servers!

If I understand correctly, your point here is that the small cache on a real
SATA drive gives a relatively small time window for data loss, whereas the
worry with cache=writeback is that the host page cache can be gigabytes, so
the time window for unsynced data to be lost is potentially enormous.

Isn't the fix for that just forcing periodic sync on the host to bound-above
the time window for unsynced data loss in the guest?

Cheers,

Chris.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Avi Kivity

On 03/17/2010 06:22 PM, Avi Kivity wrote:
Also, if my guest kernel issues (say) three small writes, one at the 
start
of the disk, one in the middle, one at the end, and then does a 
flush, can
virtio really express this as one non-contiguous O_DIRECT write (the 
three

components of which can be reordered by the elevator with respect to one
another) rather than three distinct O_DIRECT writes which can't be 
permuted?
Can qemu issue a write like that? cache=writeback + flush allows this 
to be

optimised by the block layer in the normal way.



Guest side virtio will send this as three requests followed by a 
flush.  Qemu will issue these as three distinct requests and then 
flush.  The requests are marked, as Christoph says, in a way that 
limits their reorderability, and perhaps if we fix these two problems 
performance will improve.


Something that comes to mind is merging of flush requests.  If N 
guests issue one write and one flush each, we should issue N writes 
and just one flush - a flush for the disk applies to all volumes on 
that disk.




Chris, can you carry out an experiment?  Write a program that pwrite()s 
a byte to a file at the same location repeatedly, with the file opened 
using O_SYNC.  Measure the write rate, and run blktrace on the host to 
see what the disk (/dev/sda, not the volume) sees.  Should be a (write, 
flush, write, flush) per pwrite pattern or similar (for writing the data 
and a journal block, perhaps even three writes will be needed).


Then scale this across multiple guests, measure and trace again.  If 
we're lucky, the flushes will be coalesced, if not, we need to work on it.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Chris Webb
Avi Kivity a...@redhat.com writes:

 Chris, can you carry out an experiment?  Write a program that
 pwrite()s a byte to a file at the same location repeatedly, with the
 file opened using O_SYNC.  Measure the write rate, and run blktrace
 on the host to see what the disk (/dev/sda, not the volume) sees.
 Should be a (write, flush, write, flush) per pwrite pattern or
 similar (for writing the data and a journal block, perhaps even
 three writes will be needed).
 
 Then scale this across multiple guests, measure and trace again.  If
 we're lucky, the flushes will be coalesced, if not, we need to work
 on it.

Sure, sounds like an excellent plan. I don't have a test machine at the
moment as the last host I was using for this has gone into production, but
I'm due to get another one to install later today or first thing tomorrow
which would be ideal for doing this. I'll follow up with the results once I
have them.

Cheers,

Chris.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Christoph Hellwig
On Wed, Mar 17, 2010 at 06:22:29PM +0200, Avi Kivity wrote:
 They should be reorderable.  Otherwise host filesystems on several 
 volumes would suffer the same problems.

They are reordable, just not as extremly as the the page cache.
Remember that the request queue really is just a relatively small queue
of outstanding I/O, and that is absolutely intentional.  Large scale
_caching_ is done by the VM in the pagecache, with all the usual aging,
pressure, etc algorithms applied to it.  The block devices have a
relatively small fixed size request queue associated with it to
facilitate request merging and limited reordering and having fully
set up I/O requests for the device.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Avi Kivity

On 03/17/2010 06:47 PM, Chris Webb wrote:

Avi Kivitya...@redhat.com  writes:

   

Chris, can you carry out an experiment?  Write a program that
pwrite()s a byte to a file at the same location repeatedly, with the
file opened using O_SYNC.  Measure the write rate, and run blktrace
on the host to see what the disk (/dev/sda, not the volume) sees.
Should be a (write, flush, write, flush) per pwrite pattern or
similar (for writing the data and a journal block, perhaps even
three writes will be needed).

Then scale this across multiple guests, measure and trace again.  If
we're lucky, the flushes will be coalesced, if not, we need to work
on it.
 

Sure, sounds like an excellent plan. I don't have a test machine at the
moment as the last host I was using for this has gone into production, but
I'm due to get another one to install later today or first thing tomorrow
which would be ideal for doing this. I'll follow up with the results once I
have them.
   


Meanwhile I looked at the code, and it looks bad.  There is an 
IO_CMD_FDSYNC, but it isn't tagged, so we have to drain the queue before 
issuing it.  In any case, qemu doesn't use it as far as I could tell, 
and even if it did, device-matter doesn't implement the needed 
-aio_fsync() operation.


So, there's a lot of plubming needed before we can get cache flushes 
merged into each other.  Given cache=writeback does allow merging, I 
think we explained part of the problem at least.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Christoph Hellwig
On Wed, Mar 17, 2010 at 06:40:30PM +0200, Avi Kivity wrote:
 Chris, can you carry out an experiment?  Write a program that pwrite()s 
 a byte to a file at the same location repeatedly, with the file opened 
 using O_SYNC.  Measure the write rate, and run blktrace on the host to 
 see what the disk (/dev/sda, not the volume) sees.  Should be a (write, 
 flush, write, flush) per pwrite pattern or similar (for writing the data 
 and a journal block, perhaps even three writes will be needed).
 
 Then scale this across multiple guests, measure and trace again.  If 
 we're lucky, the flushes will be coalesced, if not, we need to work on it.

As the person who has written quite a bit of the current O_SYNC
implementation and also reviewed the rest of it I can tell you that
those flushes won't be coalesced.  If we always rewrite the same block
we do the cache flush from the fsync method and there's is nothing
to coalesced it there.  If you actually do modify metadata (e.g. by
using the new real O_SYNC instead of the old one that always was O_DSYNC
that I introduced in 2.6.33 but that isn't picked up by userspace yet)
you might hit a very limited transaction merging window in some
filesystems, but it's generally very small for a good reason.  If it
were too large we'd make the once progress wait for I/O in another just
because we might expect transactions to coalesced later.  There's been
some long discussion about that fsync transaction batching tuning
for ext3 a while ago.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Christoph Hellwig
On Wed, Mar 17, 2010 at 06:53:34PM +0200, Avi Kivity wrote:
 Meanwhile I looked at the code, and it looks bad.  There is an 
 IO_CMD_FDSYNC, but it isn't tagged, so we have to drain the queue before 
 issuing it.  In any case, qemu doesn't use it as far as I could tell, 
 and even if it did, device-matter doesn't implement the needed 
 -aio_fsync() operation.

No one implements it, and all surrounding code is dead wood.  It would
require us to do asynchronous pagecache operations, which involve
major surgery of the VM code.  Patches to do this were rejected multiple
times.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Avi Kivity

On 03/17/2010 06:58 PM, Christoph Hellwig wrote:

On Wed, Mar 17, 2010 at 06:53:34PM +0200, Avi Kivity wrote:
   

Meanwhile I looked at the code, and it looks bad.  There is an
IO_CMD_FDSYNC, but it isn't tagged, so we have to drain the queue before
issuing it.  In any case, qemu doesn't use it as far as I could tell,
and even if it did, device-matter doesn't implement the needed
-aio_fsync() operation.
 

No one implements it, and all surrounding code is dead wood.  It would
require us to do asynchronous pagecache operations, which involve
major surgery of the VM code.  Patches to do this were rejected multiple
times.
   


Pity.  What about the O_DIRECT aio case?  It's ridiculous that you can 
submit async write requests but have to wait synchronously for them to 
actually hit the disk if you have a write cache.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Avi Kivity

On 03/17/2010 06:52 PM, Christoph Hellwig wrote:

On Wed, Mar 17, 2010 at 06:22:29PM +0200, Avi Kivity wrote:
   

They should be reorderable.  Otherwise host filesystems on several
volumes would suffer the same problems.
 

They are reordable, just not as extremly as the the page cache.
Remember that the request queue really is just a relatively small queue
of outstanding I/O, and that is absolutely intentional.  Large scale
_caching_ is done by the VM in the pagecache, with all the usual aging,
pressure, etc algorithms applied to it.


We already have the large scale caching and stuff running in the guest.  
We have a stream of optimized requests coming out of guests, running the 
same algorithm again shouldn't improve things.  The host has an 
opportunity to do inter-guest optimization, but given each guest has its 
own disk area, I don't see how any reordering or merging could help here 
(beyond sorting guests according to disk order).



The block devices have a
relatively small fixed size request queue associated with it to
facilitate request merging and limited reordering and having fully
set up I/O requests for the device.
   


We should enlarge the queues, increase request reorderability, and merge 
flushes (delay flushes until after unrelated writes, then adjacent 
flushes can be collapsed).


Collapsing flushes should get us better than linear scaling (since we 
collapes N writes + M flushes into N writes and 1 flush).  However the 
writes themselves scale worse than linearly, since they now span a 
larger disk space and cause higher seek penalties.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Avi Kivity

On 03/17/2010 06:57 PM, Christoph Hellwig wrote:

On Wed, Mar 17, 2010 at 06:40:30PM +0200, Avi Kivity wrote:
   

Chris, can you carry out an experiment?  Write a program that pwrite()s
a byte to a file at the same location repeatedly, with the file opened
using O_SYNC.  Measure the write rate, and run blktrace on the host to
see what the disk (/dev/sda, not the volume) sees.  Should be a (write,
flush, write, flush) per pwrite pattern or similar (for writing the data
and a journal block, perhaps even three writes will be needed).

Then scale this across multiple guests, measure and trace again.  If
we're lucky, the flushes will be coalesced, if not, we need to work on it.
 

As the person who has written quite a bit of the current O_SYNC
implementation and also reviewed the rest of it I can tell you that
those flushes won't be coalesced.  If we always rewrite the same block
we do the cache flush from the fsync method and there's is nothing
to coalesced it there.  If you actually do modify metadata (e.g. by
using the new real O_SYNC instead of the old one that always was O_DSYNC
that I introduced in 2.6.33 but that isn't picked up by userspace yet)
you might hit a very limited transaction merging window in some
filesystems, but it's generally very small for a good reason.  If it
were too large we'd make the once progress wait for I/O in another just
because we might expect transactions to coalesced later.  There's been
some long discussion about that fsync transaction batching tuning
for ext3 a while ago.
   


I definitely don't expect flush merging for a single guest, but for 
multiple guests there is certainly an opportunity for merging.  Most 
likely we don't take advantage of it and that's one of the problems.  
Copying data into pagecache so that we can merge the flushes seems like 
a very unsatisfactory implementation.





--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Vivek Goyal
On Wed, Mar 17, 2010 at 03:14:10PM +, Chris Webb wrote:
 Anthony Liguori anth...@codemonkey.ws writes:
 
  This really gets down to your definition of safe behaviour.  As it
  stands, if you suffer a power outage, it may lead to guest
  corruption.
  
  While we are correct in advertising a write-cache, write-caches are
  volatile and should a drive lose power, it could lead to data
  corruption.  Enterprise disks tend to have battery backed write
  caches to prevent this.
  
  In the set up you're emulating, the host is acting as a giant write
  cache.  Should your host fail, you can get data corruption.
 
 Hi Anthony. I suspected my post might spark an interesting discussion!
 
 Before considering anything like this, we did quite a bit of testing with
 OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool
 power off to kill the host. I didn't manage to corrupt any ext3, ext4 or
 NTFS filesystems despite these efforts.
 
 Is your claim here that:-
 
   (a) qemu doesn't emulate a disk write cache correctly; or
 
   (b) operating systems are inherently unsafe running on top of a disk with
   a write-cache; or
 
   (c) installations that are already broken and lose data with a physical
   drive with a write-cache can lose much more in this case because the
   write cache is much bigger?
 
 Following Christoph Hellwig's patch series from last September, I'm pretty
 convinced that (a) isn't true apart from the inability to disable the
 write-cache at run-time, which is something that neither recent linux nor
 windows seem to want to do out-of-the box.
 
 Given that modern SATA drives come with fairly substantial write-caches
 nowadays which operating systems leave on without widespread disaster, I
 don't really believe in (b) either, at least for the ide and scsi case.
 Filesystems know they have to flush the disk cache to avoid corruption.
 (Virtio makes the write cache invisible to the OS except in linux 2.6.32+ so
 I know virtio-blk has to be avoided for current windows and obsolete linux
 when writeback caching is on.)
 
 I can certainly imagine (c) might be the case, although when I use strace to
 watch the IO to the block device, I see pretty regular fdatasyncs being
 issued by the guests, interleaved with the writes, so I'm not sure how
 likely the problem would be in practice. Perhaps my test guests were
 unrepresentatively well-behaved.
 
 However, the potentially unlimited time-window for loss of incorrectly
 unsynced data is also something one could imagine fixing at the qemu level.
 Perhaps I should be implementing something like
 cache=writeback,flushtimeout=N which, upon a write being issued to the block
 device, starts an N second timer if it isn't already running. The timer is
 destroyed on flush, and if it expires before it's destroyed, a gratuitous
 flush is sent. Do you think this is worth doing? Just a simple 'while sleep
 10; do sync; done' on the host even!
 
 We've used cache=none and cache=writethrough, and whilst performance is fine
 with a single guest accessing a disk, when we chop the disks up with LVM and
 run a even a small handful of guests, the constant seeking to serve tiny
 synchronous IOs leads to truly abysmal throughput---we've seen less than
 700kB/s streaming write rates within guests when the backing store is
 capable of 100MB/s.
 
 With cache=writeback, there's still IO contention between guests, but the
 write granularity is a bit coarser, so the host's elevator seems to get a
 bit more of a chance to help us out and we can at least squeeze out 5-10MB/s
 from two or three concurrently running guests, getting a total of 20-30% of
 the performance of the underlying block device rather than a total of around
 5%.

Hi Chris,

Are you using CFQ in the host? What is the host kernel version? I am not sure
what is the problem here but you might want to play with IO controller and put
these guests in individual cgroups and see if you get better throughput even
with cache=writethrough.

If the problem is that if sync writes from different guests get intermixed
resulting in more seeks, IO controller might help as these writes will now
go on different group service trees and in CFQ, we try to service requests
from one service tree at a time for a period before we switch the service
tree.

The issue will be that all the logic is in CFQ and it works at leaf nodes
of storage stack and not at LVM nodes. So first you might want to try it with
single partitioned disk. If it helps, then it might help with LVM
configuration also (IO control working at leaf nodes).

Thanks
Vivek
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-17 Thread Chris Webb
Vivek Goyal vgo...@redhat.com writes:

 Are you using CFQ in the host? What is the host kernel version? I am not sure
 what is the problem here but you might want to play with IO controller and put
 these guests in individual cgroups and see if you get better throughput even
 with cache=writethrough.

Hi. We're using the deadline IO scheduler on 2.6.32.7. We got better
performance from deadline than from cfq when we last tested, which was
admittedly around the 2.6.30 timescale so is now a rather outdated
measurement.

 If the problem is that if sync writes from different guests get intermixed
 resulting in more seeks, IO controller might help as these writes will now
 go on different group service trees and in CFQ, we try to service requests
 from one service tree at a time for a period before we switch the service
 tree.

Thanks for the suggestion: I'll have a play with this. I currently use
/sys/kernel/uids/N/cpu_share with one UID per guest to divide up the CPU
between guests, but this could just as easily be done with a cgroup per
guest if a side-effect is to provide a hint about IO independence to CFQ.

Best wishes,

Chris.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-16 Thread Christoph Hellwig
On Mon, Mar 15, 2010 at 08:27:25PM -0500, Anthony Liguori wrote:
 Actually cache=writeback is as safe as any normal host is with a
 volatile disk cache, except that in this case the disk cache is
 actually a lot larger.  With a properly implemented filesystem this
 will never cause corruption.

 Metadata corruption, not necessarily corruption of data stored in a file.

Again, this will not cause metadata corruption either if the filesystem
loses barriers, although we may lose up to the cache size of new (data
or metadata operations).  The consistency of the filesystem is still
guaranteed.

 Not all software uses fsync as much as they should.  And often times,  
 it's for good reason (like ext3).

If an application needs data on disk it must call fsync, or there
is no guaranteed at all, even on ext3.  And with growing disk caches
these issues show up on normal disks often enough that people have
realized it by now.


 IIUC, an O_DIRECT write using cache=writeback is not actually on the  
 spindle when the write() completes.  Rather, an explicit fsync() would  
 be required.  That will cause data corruption in many applications (like  
 databases) regardless of whether the fs gets metadata corruption.

It's neither for O_DIRECT without qemu involved.  The O_DIRECT write
goes through the disk cache and requires and explicit fsync or O_SYNC
open flag to make sure it goes to disk.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-16 Thread Avi Kivity

On 03/15/2010 08:48 PM, Anthony Liguori wrote:

On 03/15/2010 04:27 AM, Avi Kivity wrote:


That's only beneficial if the cache is shared.  Otherwise, you could 
use the balloon to evict cache when memory is tight.


Shared cache is mostly a desktop thing where users run similar 
workloads.  For servers, it's much less likely.  So a modified-guest 
doesn't help a lot here.


Not really.  In many cloud environments, there's a set of common 
images that are instantiated on each node.  Usually this is because 
you're running a horizontally scalable application or because you're 
supporting an ephemeral storage model.


But will these servers actually benefit from shared cache?  So the 
images are shared, they boot up, what then?


- apache really won't like serving static files from the host pagecache
- dynamic content (java, cgi) will be mostly in anonymous memory, not 
pagecache

- ditto for application servers
- what else are people doing?

In fact, with ephemeral storage, you typically want to use 
cache=writeback since you aren't providing data guarantees across 
shutdown/failure.


Interesting point.

We'd need a cache=volatile for this use case to avoid the fdatasync()s 
we do now.  Also useful for -snapshot.  In fact I have a patch for this 
somewhere I can dig out.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-16 Thread Avi Kivity

On 03/15/2010 10:23 PM, Chris Webb wrote:

Avi Kivitya...@redhat.com  writes:

   

On 03/15/2010 10:07 AM, Balbir Singh wrote:

 

Yes, it is a virtio call away, but is the cost of paying twice in
terms of memory acceptable?
   

Usually, it isn't, which is why I recommend cache=off.
 

Hi Avi. One observation about your recommendation for cache=none:

We run hosts of VMs accessing drives backed by logical volumes carved out
from md RAID1. Each host has 32GB RAM and eight cores, divided between (say)
twenty virtual machines, which pretty much fill the available memory on the
host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback
caching turned on get advertised to the guest as having a write-cache, and
FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback
isn't acting as cache=neverflush like it would have done a year ago. I know
that comparing performance for cache=none against that unsafe behaviour
would be somewhat unfair!)

Wasteful duplication of page cache between guest and host notwithstanding,
turning on cache=writeback is a spectacular performance win for our guests.
For example, even IDE with cache=writeback easily beats virtio with
cache=none in most of the guest filesystem performance tests I've tried. The
anecdotal feedback from clients is also very strongly in favour of
cache=writeback.
   


Is this with qcow2, raw file, or direct volume access?

I can understand it for qcow2, but for direct volume access this 
shouldn't happen.  The guest schedules as many writes as it can, 
followed by a sync.  The host (and disk) can then reschedule them 
whether they are in the writeback cache or in the block layer, and must 
sync in the same way once completed.


Perhaps what we need is bdrv_aio_submit() which can take a number of 
requests.  For direct volume access, this allows easier reordering 
(io_submit() should plug the queues before it starts processing and 
unplug them when done, though I don't see the code for this?).  For 
qcow2, we can coalesce metadata updates for multiple requests into one 
RMW (for example, a sequential write split into multiple 64K-256K write 
requests).


Christoph/Kevin?


With a host full of cache=none guests, IO contention between guests is
hugely problematic with non-stop seek from the disks to service tiny
O_DIRECT writes (especially without virtio), many of which needn't have been
synchronous if only there had been some way for the guest OS to tell qemu
that. Running with cache=writeback seems to reduce the frequency of disk
flush per guest to a much more manageable level, and to allow the host's
elevator to optimise writing out across the guests in between these flushes.
   


The host eventually has to turn the writes into synchronous writes, no 
way around that.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-16 Thread Kevin Wolf
Am 16.03.2010 10:17, schrieb Avi Kivity:
 On 03/15/2010 10:23 PM, Chris Webb wrote:
 Avi Kivitya...@redhat.com  writes:


 On 03/15/2010 10:07 AM, Balbir Singh wrote:

  
 Yes, it is a virtio call away, but is the cost of paying twice in
 terms of memory acceptable?

 Usually, it isn't, which is why I recommend cache=off.
  
 Hi Avi. One observation about your recommendation for cache=none:

 We run hosts of VMs accessing drives backed by logical volumes carved out
 from md RAID1. Each host has 32GB RAM and eight cores, divided between (say)
 twenty virtual machines, which pretty much fill the available memory on the
 host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback
 caching turned on get advertised to the guest as having a write-cache, and
 FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback
 isn't acting as cache=neverflush like it would have done a year ago. I know
 that comparing performance for cache=none against that unsafe behaviour
 would be somewhat unfair!)

 Wasteful duplication of page cache between guest and host notwithstanding,
 turning on cache=writeback is a spectacular performance win for our guests.
 For example, even IDE with cache=writeback easily beats virtio with
 cache=none in most of the guest filesystem performance tests I've tried. The
 anecdotal feedback from clients is also very strongly in favour of
 cache=writeback.

 
 Is this with qcow2, raw file, or direct volume access?
 
 I can understand it for qcow2, but for direct volume access this 
 shouldn't happen.  The guest schedules as many writes as it can, 
 followed by a sync.  The host (and disk) can then reschedule them 
 whether they are in the writeback cache or in the block layer, and must 
 sync in the same way once completed.
 
 Perhaps what we need is bdrv_aio_submit() which can take a number of 
 requests.  For direct volume access, this allows easier reordering 
 (io_submit() should plug the queues before it starts processing and 
 unplug them when done, though I don't see the code for this?).  For 
 qcow2, we can coalesce metadata updates for multiple requests into one 
 RMW (for example, a sequential write split into multiple 64K-256K write 
 requests).

We already do merge sequential writes back into one larger request. So
this is in fact a case that wouldn't benefit from such changes. It may
help for other cases. But even if it did, coalescing metadata writes in
qcow2 sounds like a good way to mess up, so I'd stay with doing it only
for the data itself.

Apart from that, wouldn't your points apply to writeback as well?

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-16 Thread Avi Kivity

On 03/16/2010 11:54 AM, Kevin Wolf wrote:



Is this with qcow2, raw file, or direct volume access?

I can understand it for qcow2, but for direct volume access this
shouldn't happen.  The guest schedules as many writes as it can,
followed by a sync.  The host (and disk) can then reschedule them
whether they are in the writeback cache or in the block layer, and must
sync in the same way once completed.

Perhaps what we need is bdrv_aio_submit() which can take a number of
requests.  For direct volume access, this allows easier reordering
(io_submit() should plug the queues before it starts processing and
unplug them when done, though I don't see the code for this?).  For
qcow2, we can coalesce metadata updates for multiple requests into one
RMW (for example, a sequential write split into multiple 64K-256K write
requests).
 

We already do merge sequential writes back into one larger request. So
this is in fact a case that wouldn't benefit from such changes.


I'm not happy with that.  It increases overall latency.  With qcow2 it's 
fine, but I'd let requests to raw volumes flow unaltered.



It may
help for other cases. But even if it did, coalescing metadata writes in
qcow2 sounds like a good way to mess up, so I'd stay with doing it only
for the data itself.
   


I don't see why.


Apart from that, wouldn't your points apply to writeback as well?
   


They do, but for writeback the host kernel already does all the 
coalescing/merging/blah for us.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-16 Thread Christoph Hellwig
Avi,

cache=writeback can be faster than cache=none for the same reasons
a disk cache speeds up access.  As long as the I/O mix contains more
asynchronous then synchronous writes it allows the host to do much
more reordering, only limited by the cache size (which can be quite
huge when using the host pagecache) and the amount of cache flushes
coming from the host.  If you have a fsync heavy workload or metadata
operation with a filesystem like the current XFS you will get lots
of cache flushes that make the use of the additional cache limits.

If you don't have a of lot of cache flushes, e.g. due to dumb
applications that do not issue fsync, or even run ext3 in it's default
mode never issues cache flushes the benefit will be enormous, but the
data loss and possible corruption will be enormous.

But even for something like btrfs that does provide data integrity
but issues cache flushes fairly effeciently data=writeback may
provide a quite nice speedup, especially if using multiple guest
accessing the same spindle(s).

But I wouldn't be surprised if IBM's exteme differences are indeed due
to the extremly unsafe ext3 default behaviour.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-16 Thread Avi Kivity

On 03/16/2010 12:26 PM, Christoph Hellwig wrote:

Avi,

cache=writeback can be faster than cache=none for the same reasons
a disk cache speeds up access.  As long as the I/O mix contains more
asynchronous then synchronous writes it allows the host to do much
more reordering, only limited by the cache size (which can be quite
huge when using the host pagecache) and the amount of cache flushes
coming from the host.  If you have a fsync heavy workload or metadata
operation with a filesystem like the current XFS you will get lots
of cache flushes that make the use of the additional cache limits.
   


Are you talking about direct volume access or qcow2?

For direct volume access, I still don't get it.  The number of barriers 
issues by the host must equal (or exceed, but that's pointless) the 
number of barriers issued by the guest.  cache=writeback allows the host 
to reorder writes, but so does cache=none.  Where does the difference 
come from?


Put it another way.  In an unvirtualized environment, if you implement a 
write cache in a storage driver (not device), and sync it on a barrier 
request, would you expect to see a performance improvement?




If you don't have a of lot of cache flushes, e.g. due to dumb
applications that do not issue fsync, or even run ext3 in it's default
mode never issues cache flushes the benefit will be enormous, but the
data loss and possible corruption will be enormous.
   


Shouldn't the host never issue cache flushes in this case? (for direct 
volume access; qcow2 still needs flushes for metadata integrity).



But even for something like btrfs that does provide data integrity
but issues cache flushes fairly effeciently data=writeback may
provide a quite nice speedup, especially if using multiple guest
accessing the same spindle(s).

But I wouldn't be surprised if IBM's exteme differences are indeed due
to the extremly unsafe ext3 default behaviour.
   



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-16 Thread Christoph Hellwig
On Tue, Mar 16, 2010 at 12:36:31PM +0200, Avi Kivity wrote:
 Are you talking about direct volume access or qcow2?

Doesn't matter.

 For direct volume access, I still don't get it.  The number of barriers 
 issues by the host must equal (or exceed, but that's pointless) the 
 number of barriers issued by the guest.  cache=writeback allows the host 
 to reorder writes, but so does cache=none.  Where does the difference 
 come from?
 
 Put it another way.  In an unvirtualized environment, if you implement a 
 write cache in a storage driver (not device), and sync it on a barrier 
 request, would you expect to see a performance improvement?

cache=none only allows very limited reorderning in the host.  O_DIRECT
is synchronous on the host, so there's just some very limited reordering
going on in the elevator if we have other I/O going on in parallel.
In addition to that the disk writecache can perform limited reodering
and caching, but the disk cache has a rather limited size.  The host
pagecache gives a much wieder opportunity to reorder, especially if
the guest workload is not cache flush heavy.  If the guest workload
is extremly cache flush heavy the usefulness of the pagecache is rather
limited, as we'll only use very little of it, but pay by having to do
a data copy.  If the workload is not cache flush heavy, and we have
multiple guests doing I/O to the same spindles it will allow the host
do do much more efficient data writeout by beeing able to do better
ordered (less seeky) and bigger I/O (especially if the host has real
storage compared to ide for the guest).

 If you don't have a of lot of cache flushes, e.g. due to dumb
 applications that do not issue fsync, or even run ext3 in it's default
 mode never issues cache flushes the benefit will be enormous, but the
 data loss and possible corruption will be enormous.

 
 Shouldn't the host never issue cache flushes in this case? (for direct 
 volume access; qcow2 still needs flushes for metadata integrity).

If the guest never issues a flush the host will neither, indeed.  Data
will only go to disk by background writeout or memory pressure.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-16 Thread Avi Kivity

On 03/16/2010 12:44 PM, Christoph Hellwig wrote:

On Tue, Mar 16, 2010 at 12:36:31PM +0200, Avi Kivity wrote:
   

Are you talking about direct volume access or qcow2?
 

Doesn't matter.

   

For direct volume access, I still don't get it.  The number of barriers
issues by the host must equal (or exceed, but that's pointless) the
number of barriers issued by the guest.  cache=writeback allows the host
to reorder writes, but so does cache=none.  Where does the difference
come from?

Put it another way.  In an unvirtualized environment, if you implement a
write cache in a storage driver (not device), and sync it on a barrier
request, would you expect to see a performance improvement?
 

cache=none only allows very limited reorderning in the host.  O_DIRECT
is synchronous on the host, so there's just some very limited reordering
going on in the elevator if we have other I/O going on in parallel.
   


Presumably there is lots of I/O going on, or we wouldn't be having this 
conversation.



In addition to that the disk writecache can perform limited reodering
and caching, but the disk cache has a rather limited size.  The host
pagecache gives a much wieder opportunity to reorder, especially if
the guest workload is not cache flush heavy.  If the guest workload
is extremly cache flush heavy the usefulness of the pagecache is rather
limited, as we'll only use very little of it, but pay by having to do
a data copy.  If the workload is not cache flush heavy, and we have
multiple guests doing I/O to the same spindles it will allow the host
do do much more efficient data writeout by beeing able to do better
ordered (less seeky) and bigger I/O (especially if the host has real
storage compared to ide for the guest).
   


Let's assume the guest has virtio (I agree with IDE we need reordering 
on the host).  The guest sends batches of I/O separated by cache 
flushes.  If the batches are smaller than the virtio queue length, 
ideally things look like:


 io_submit(..., batch_size_1);
 io_getevents(..., batch_size_1);
 fdatasync();
 io_submit(..., batch_size_2);
  io_getevents(..., batch_size_2);
  fdatasync();
  io_submit(..., batch_size_3);
  io_getevents(..., batch_size_3);
  fdatasync();

(certainly that won't happen today, but it could in principle).

How does a write cache give any advantage?  The host kernel sees 
_exactly_ the same information as it would from a bunch of threaded 
pwritev()s followed by fdatasync().


(wish: IO_CMD_ORDERED_FDATASYNC)

If the batch size is larger than the virtio queue size, or if there are 
no flushes at all, then yes the huge write cache gives more opportunity 
for reordering.  But we're already talking hundreds of requests here.


Let's say the virtio queue size was unlimited.  What merging/reordering 
opportunity are we missing on the host?  Again we have exactly the same 
information: either the pagecache lru + radix tree that identifies all 
dirty pages in disk order, or the block queue with pending requests that 
contains exactly the same information.


Something is wrong.  Maybe it's my understanding, but on the other hand 
it may be a piece of kernel code.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-16 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-03-16 13:08:28]:

 On 03/16/2010 12:44 PM, Christoph Hellwig wrote:
 On Tue, Mar 16, 2010 at 12:36:31PM +0200, Avi Kivity wrote:
 Are you talking about direct volume access or qcow2?
 Doesn't matter.
 
 For direct volume access, I still don't get it.  The number of barriers
 issues by the host must equal (or exceed, but that's pointless) the
 number of barriers issued by the guest.  cache=writeback allows the host
 to reorder writes, but so does cache=none.  Where does the difference
 come from?
 
 Put it another way.  In an unvirtualized environment, if you implement a
 write cache in a storage driver (not device), and sync it on a barrier
 request, would you expect to see a performance improvement?
 cache=none only allows very limited reorderning in the host.  O_DIRECT
 is synchronous on the host, so there's just some very limited reordering
 going on in the elevator if we have other I/O going on in parallel.
 
 Presumably there is lots of I/O going on, or we wouldn't be having
 this conversation.


We are speaking of multiple VM's doing I/O in parallel.
 
 In addition to that the disk writecache can perform limited reodering
 and caching, but the disk cache has a rather limited size.  The host
 pagecache gives a much wieder opportunity to reorder, especially if
 the guest workload is not cache flush heavy.  If the guest workload
 is extremly cache flush heavy the usefulness of the pagecache is rather
 limited, as we'll only use very little of it, but pay by having to do
 a data copy.  If the workload is not cache flush heavy, and we have
 multiple guests doing I/O to the same spindles it will allow the host
 do do much more efficient data writeout by beeing able to do better
 ordered (less seeky) and bigger I/O (especially if the host has real
 storage compared to ide for the guest).
 
 Let's assume the guest has virtio (I agree with IDE we need
 reordering on the host).  The guest sends batches of I/O separated
 by cache flushes.  If the batches are smaller than the virtio queue
 length, ideally things look like:
 
  io_submit(..., batch_size_1);
  io_getevents(..., batch_size_1);
  fdatasync();
  io_submit(..., batch_size_2);
   io_getevents(..., batch_size_2);
   fdatasync();
   io_submit(..., batch_size_3);
   io_getevents(..., batch_size_3);
   fdatasync();
 
 (certainly that won't happen today, but it could in principle).

 How does a write cache give any advantage?  The host kernel sees
 _exactly_ the same information as it would from a bunch of threaded
 pwritev()s followed by fdatasync().


Are you suggesting that the model with cache=writeback gives us the
same I/O pattern as cache=none, so there are no opportunities for
optimization?
 
 (wish: IO_CMD_ORDERED_FDATASYNC)
 
 If the batch size is larger than the virtio queue size, or if there
 are no flushes at all, then yes the huge write cache gives more
 opportunity for reordering.  But we're already talking hundreds of
 requests here.
 
 Let's say the virtio queue size was unlimited.  What
 merging/reordering opportunity are we missing on the host?  Again we
 have exactly the same information: either the pagecache lru + radix
 tree that identifies all dirty pages in disk order, or the block
 queue with pending requests that contains exactly the same
 information.
 
 Something is wrong.  Maybe it's my understanding, but on the other
 hand it may be a piece of kernel code.
 

I assume you are talking of dedicated disk partitions and not
individual disk images residing on the same partition.

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-16 Thread Avi Kivity

On 03/16/2010 04:27 PM, Balbir Singh wrote:



Let's assume the guest has virtio (I agree with IDE we need
reordering on the host).  The guest sends batches of I/O separated
by cache flushes.  If the batches are smaller than the virtio queue
length, ideally things look like:

  io_submit(..., batch_size_1);
  io_getevents(..., batch_size_1);
  fdatasync();
  io_submit(..., batch_size_2);
   io_getevents(..., batch_size_2);
   fdatasync();
   io_submit(..., batch_size_3);
   io_getevents(..., batch_size_3);
   fdatasync();

(certainly that won't happen today, but it could in principle).

How does a write cache give any advantage?  The host kernel sees
_exactly_ the same information as it would from a bunch of threaded
pwritev()s followed by fdatasync().

 

Are you suggesting that the model with cache=writeback gives us the
same I/O pattern as cache=none, so there are no opportunities for
optimization?
   


Yes.  The guest also has a large cache with the same optimization algorithm.



   

(wish: IO_CMD_ORDERED_FDATASYNC)

If the batch size is larger than the virtio queue size, or if there
are no flushes at all, then yes the huge write cache gives more
opportunity for reordering.  But we're already talking hundreds of
requests here.

Let's say the virtio queue size was unlimited.  What
merging/reordering opportunity are we missing on the host?  Again we
have exactly the same information: either the pagecache lru + radix
tree that identifies all dirty pages in disk order, or the block
queue with pending requests that contains exactly the same
information.

Something is wrong.  Maybe it's my understanding, but on the other
hand it may be a piece of kernel code.

 

I assume you are talking of dedicated disk partitions and not
individual disk images residing on the same partition.
   


Correct. Images in files introduce new writes which can be optimized.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Avi Kivity

On 03/15/2010 09:22 AM, Balbir Singh wrote:

Selectively control Unmapped Page Cache (nospam version)

From: Balbir Singhbal...@linux.vnet.ibm.com

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario

- In a virtualized environment with cache!=none, we see
   double caching - (one in the host and one in the guest). As
   we try to scale guests, cache usage across the system grows.
   The goal of this patch is to reclaim page cache when Linux is running
   as a guest and get the host to hold the page cache and manage it.
   There might be temporary duplication, but in the long run, memory
   in the guests would be used for mapped pages.
   


Well, for a guest, host page cache is a lot slower than guest page cache.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-03-15 09:48:05]:

 On 03/15/2010 09:22 AM, Balbir Singh wrote:
 Selectively control Unmapped Page Cache (nospam version)
 
 From: Balbir Singhbal...@linux.vnet.ibm.com
 
 This patch implements unmapped page cache control via preferred
 page cache reclaim. The current patch hooks into kswapd and reclaims
 page cache if the user has requested for unmapped page control.
 This is useful in the following scenario
 
 - In a virtualized environment with cache!=none, we see
double caching - (one in the host and one in the guest). As
we try to scale guests, cache usage across the system grows.
The goal of this patch is to reclaim page cache when Linux is running
as a guest and get the host to hold the page cache and manage it.
There might be temporary duplication, but in the long run, memory
in the guests would be used for mapped pages.
 
 Well, for a guest, host page cache is a lot slower than guest page cache.


Yes, it is a virtio call away, but is the cost of paying twice in
terms of memory acceptable? One of the reasons I created a boot
parameter was to deal with selective enablement for cases where
memory is the most important resource being managed.

I do see a hit in performance with my results (please see the data
below), but the savings are quite large. The other solution mentioned
in the TODOs is to have the balloon driver invoke this path. The
sysctl also allows the guest to tune the amount of unmapped page cache
if needed.

The knobs are for

1. Selective enablement
2. Selective control of the % of unmapped pages

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Avi Kivity

On 03/15/2010 10:07 AM, Balbir Singh wrote:

* Avi Kivitya...@redhat.com  [2010-03-15 09:48:05]:

   

On 03/15/2010 09:22 AM, Balbir Singh wrote:
 

Selectively control Unmapped Page Cache (nospam version)

From: Balbir Singhbal...@linux.vnet.ibm.com

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario

- In a virtualized environment with cache!=none, we see
   double caching - (one in the host and one in the guest). As
   we try to scale guests, cache usage across the system grows.
   The goal of this patch is to reclaim page cache when Linux is running
   as a guest and get the host to hold the page cache and manage it.
   There might be temporary duplication, but in the long run, memory
   in the guests would be used for mapped pages.
   

Well, for a guest, host page cache is a lot slower than guest page cache.

 

Yes, it is a virtio call away, but is the cost of paying twice in
terms of memory acceptable?


Usually, it isn't, which is why I recommend cache=off.


One of the reasons I created a boot
parameter was to deal with selective enablement for cases where
memory is the most important resource being managed.

I do see a hit in performance with my results (please see the data
below), but the savings are quite large. The other solution mentioned
in the TODOs is to have the balloon driver invoke this path. The
sysctl also allows the guest to tune the amount of unmapped page cache
if needed.

The knobs are for

1. Selective enablement
2. Selective control of the % of unmapped pages
   


An alternative path is to enable KSM for page cache.  Then we have 
direct read-only guest access to host page cache, without any guest 
modifications required.  That will be pretty difficult to achieve though 
- will need a readonly bit in the page cache radix tree, and teach all 
paths to honour it.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-03-15 10:27:45]:

 On 03/15/2010 10:07 AM, Balbir Singh wrote:
 * Avi Kivitya...@redhat.com  [2010-03-15 09:48:05]:
 
 On 03/15/2010 09:22 AM, Balbir Singh wrote:
 Selectively control Unmapped Page Cache (nospam version)
 
 From: Balbir Singhbal...@linux.vnet.ibm.com
 
 This patch implements unmapped page cache control via preferred
 page cache reclaim. The current patch hooks into kswapd and reclaims
 page cache if the user has requested for unmapped page control.
 This is useful in the following scenario
 
 - In a virtualized environment with cache!=none, we see
double caching - (one in the host and one in the guest). As
we try to scale guests, cache usage across the system grows.
The goal of this patch is to reclaim page cache when Linux is running
as a guest and get the host to hold the page cache and manage it.
There might be temporary duplication, but in the long run, memory
in the guests would be used for mapped pages.
 Well, for a guest, host page cache is a lot slower than guest page cache.
 
 Yes, it is a virtio call away, but is the cost of paying twice in
 terms of memory acceptable?
 
 Usually, it isn't, which is why I recommend cache=off.


cache=off works for *direct I/O* supported filesystems and my concern is that
one of the side-effects is that idle VM's can consume a lot of memory
(assuming all the memory is available to them). As the number of VM's
grow, they could cache a whole lot of memory. In my experiments I
found that the total amount of memory cached far exceeded the mapped
ratio by a large amount when we had idle VM's. The philosophy of this
patch is to move the caching to the _host_ and let the host maintain
the cache instead of the guest.
 
 One of the reasons I created a boot
 parameter was to deal with selective enablement for cases where
 memory is the most important resource being managed.
 
 I do see a hit in performance with my results (please see the data
 below), but the savings are quite large. The other solution mentioned
 in the TODOs is to have the balloon driver invoke this path. The
 sysctl also allows the guest to tune the amount of unmapped page cache
 if needed.
 
 The knobs are for
 
 1. Selective enablement
 2. Selective control of the % of unmapped pages
 
 An alternative path is to enable KSM for page cache.  Then we have
 direct read-only guest access to host page cache, without any guest
 modifications required.  That will be pretty difficult to achieve
 though - will need a readonly bit in the page cache radix tree, and
 teach all paths to honour it.
 

Yes, it is, I've taken a quick look. I am not sure if de-duplication
would be the best approach, may be dropping the page in the page cache
might be a good first step. Data consistency would be much easier to
maintain that way, as long as the guest is not writing frequently to
that page, we don't need the page cache in the host.

 -- 
 Do not meddle in the internals of kernels, for they are subtle and quick to 
 panic.
 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Avi Kivity

On 03/15/2010 11:17 AM, Balbir Singh wrote:

* Avi Kivitya...@redhat.com  [2010-03-15 10:27:45]:

   

On 03/15/2010 10:07 AM, Balbir Singh wrote:
 

* Avi Kivitya...@redhat.com   [2010-03-15 09:48:05]:

   

On 03/15/2010 09:22 AM, Balbir Singh wrote:
 

Selectively control Unmapped Page Cache (nospam version)

From: Balbir Singhbal...@linux.vnet.ibm.com

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario

- In a virtualized environment with cache!=none, we see
   double caching - (one in the host and one in the guest). As
   we try to scale guests, cache usage across the system grows.
   The goal of this patch is to reclaim page cache when Linux is running
   as a guest and get the host to hold the page cache and manage it.
   There might be temporary duplication, but in the long run, memory
   in the guests would be used for mapped pages.
   

Well, for a guest, host page cache is a lot slower than guest page cache.

 

Yes, it is a virtio call away, but is the cost of paying twice in
terms of memory acceptable?
   

Usually, it isn't, which is why I recommend cache=off.

 

cache=off works for *direct I/O* supported filesystems and my concern is that
one of the side-effects is that idle VM's can consume a lot of memory
(assuming all the memory is available to them). As the number of VM's
grow, they could cache a whole lot of memory. In my experiments I
found that the total amount of memory cached far exceeded the mapped
ratio by a large amount when we had idle VM's. The philosophy of this
patch is to move the caching to the _host_ and let the host maintain
the cache instead of the guest.
   


That's only beneficial if the cache is shared.  Otherwise, you could use 
the balloon to evict cache when memory is tight.


Shared cache is mostly a desktop thing where users run similar 
workloads.  For servers, it's much less likely.  So a modified-guest 
doesn't help a lot here.



One of the reasons I created a boot
parameter was to deal with selective enablement for cases where
memory is the most important resource being managed.

I do see a hit in performance with my results (please see the data
below), but the savings are quite large. The other solution mentioned
in the TODOs is to have the balloon driver invoke this path. The
sysctl also allows the guest to tune the amount of unmapped page cache
if needed.

The knobs are for

1. Selective enablement
2. Selective control of the % of unmapped pages
   

An alternative path is to enable KSM for page cache.  Then we have
direct read-only guest access to host page cache, without any guest
modifications required.  That will be pretty difficult to achieve
though - will need a readonly bit in the page cache radix tree, and
teach all paths to honour it.

 

Yes, it is, I've taken a quick look. I am not sure if de-duplication
would be the best approach, may be dropping the page in the page cache
might be a good first step. Data consistency would be much easier to
maintain that way, as long as the guest is not writing frequently to
that page, we don't need the page cache in the host.
   


Trimming the host page cache should happen automatically under 
pressure.  Since the page is cached by the guest, it won't be re-read, 
so the host page is not frequently used and then dropped.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-03-15 11:27:56]:

 The knobs are for
 
 1. Selective enablement
 2. Selective control of the % of unmapped pages
 An alternative path is to enable KSM for page cache.  Then we have
 direct read-only guest access to host page cache, without any guest
 modifications required.  That will be pretty difficult to achieve
 though - will need a readonly bit in the page cache radix tree, and
 teach all paths to honour it.
 
 Yes, it is, I've taken a quick look. I am not sure if de-duplication
 would be the best approach, may be dropping the page in the page cache
 might be a good first step. Data consistency would be much easier to
 maintain that way, as long as the guest is not writing frequently to
 that page, we don't need the page cache in the host.
 
 Trimming the host page cache should happen automatically under
 pressure.  Since the page is cached by the guest, it won't be
 re-read, so the host page is not frequently used and then dropped.


Yes, agreed, but dropping is easier than tagging cache as read-only
and getting everybody to understand read-only cached pages. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Randy Dunlap
On Mon, 15 Mar 2010 12:52:15 +0530 Balbir Singh wrote:

 Selectively control Unmapped Page Cache (nospam version)
 
 From: Balbir Singh bal...@linux.vnet.ibm.com
 
 This patch implements unmapped page cache control via preferred
 page cache reclaim. The current patch hooks into kswapd and reclaims
 page cache if the user has requested for unmapped page control.
 This is useful in the following scenario
 
 - In a virtualized environment with cache!=none, we see
   double caching - (one in the host and one in the guest). As
   we try to scale guests, cache usage across the system grows.
   The goal of this patch is to reclaim page cache when Linux is running
   as a guest and get the host to hold the page cache and manage it.
   There might be temporary duplication, but in the long run, memory
   in the guests would be used for mapped pages.
 - The option is controlled via a boot option and the administrator
   can selectively turn it on, on a need to use basis.
 
 A lot of the code is borrowed from zone_reclaim_mode logic for
 __zone_reclaim(). One might argue that the with ballooning and
 KSM this feature is not very useful, but even with ballooning,
 we need extra logic to balloon multiple VM machines and it is hard
 to figure out the correct amount of memory to balloon. With these
 patches applied, each guest has a sufficient amount of free memory
 available, that can be easily seen and reclaimed by the balloon driver.
 The additional memory in the guest can be reused for additional
 applications or used to start additional guests/balance memory in
 the host.
 
 KSM currently does not de-duplicate host and guest page cache. The goal
 of this patch is to help automatically balance unmapped page cache when
 instructed to do so.
 
 There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
 and the number of pages to reclaim when unmapped_page_control argument
 is supplied. These numbers were chosen to avoid aggressiveness in
 reaping page cache ever so frequently, at the same time providing control.
 
 The sysctl for min_unmapped_ratio provides further control from
 within the guest on the amount of unmapped pages to reclaim.
 
 The patch is applied against mmotm feb-11-2010.

Hi,
If you go ahead with this, please add the boot parameter  its description
to Documentation/kernel-parameters.txt.


 TODOS
 -
 1. Balance slab cache as well
 2. Invoke the balance routines from the balloon driver

---
~Randy
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Anthony Liguori

On 03/15/2010 04:27 AM, Avi Kivity wrote:


That's only beneficial if the cache is shared.  Otherwise, you could 
use the balloon to evict cache when memory is tight.


Shared cache is mostly a desktop thing where users run similar 
workloads.  For servers, it's much less likely.  So a modified-guest 
doesn't help a lot here.


Not really.  In many cloud environments, there's a set of common images 
that are instantiated on each node.  Usually this is because you're 
running a horizontally scalable application or because you're supporting 
an ephemeral storage model.


In fact, with ephemeral storage, you typically want to use 
cache=writeback since you aren't providing data guarantees across 
shutdown/failure.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Chris Webb
Avi Kivity a...@redhat.com writes:

 On 03/15/2010 10:07 AM, Balbir Singh wrote:

 Yes, it is a virtio call away, but is the cost of paying twice in
 terms of memory acceptable?
 
 Usually, it isn't, which is why I recommend cache=off.

Hi Avi. One observation about your recommendation for cache=none:

We run hosts of VMs accessing drives backed by logical volumes carved out
from md RAID1. Each host has 32GB RAM and eight cores, divided between (say)
twenty virtual machines, which pretty much fill the available memory on the
host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback
caching turned on get advertised to the guest as having a write-cache, and
FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback
isn't acting as cache=neverflush like it would have done a year ago. I know
that comparing performance for cache=none against that unsafe behaviour
would be somewhat unfair!)

Wasteful duplication of page cache between guest and host notwithstanding,
turning on cache=writeback is a spectacular performance win for our guests.
For example, even IDE with cache=writeback easily beats virtio with
cache=none in most of the guest filesystem performance tests I've tried. The
anecdotal feedback from clients is also very strongly in favour of
cache=writeback.

With a host full of cache=none guests, IO contention between guests is
hugely problematic with non-stop seek from the disks to service tiny
O_DIRECT writes (especially without virtio), many of which needn't have been
synchronous if only there had been some way for the guest OS to tell qemu
that. Running with cache=writeback seems to reduce the frequency of disk
flush per guest to a much more manageable level, and to allow the host's
elevator to optimise writing out across the guests in between these flushes.

Cheers,

Chris.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Anthony Liguori

On 03/15/2010 03:23 PM, Chris Webb wrote:

Avi Kivitya...@redhat.com  writes:

   

On 03/15/2010 10:07 AM, Balbir Singh wrote:

 

Yes, it is a virtio call away, but is the cost of paying twice in
terms of memory acceptable?
   

Usually, it isn't, which is why I recommend cache=off.
 

Hi Avi. One observation about your recommendation for cache=none:

We run hosts of VMs accessing drives backed by logical volumes carved out
from md RAID1. Each host has 32GB RAM and eight cores, divided between (say)
twenty virtual machines, which pretty much fill the available memory on the
host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback
caching turned on get advertised to the guest as having a write-cache, and
FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback
isn't acting as cache=neverflush like it would have done a year ago. I know
that comparing performance for cache=none against that unsafe behaviour
would be somewhat unfair!)
   


I knew someone would do this...

This really gets down to your definition of safe behaviour.  As it 
stands, if you suffer a power outage, it may lead to guest corruption.


While we are correct in advertising a write-cache, write-caches are 
volatile and should a drive lose power, it could lead to data 
corruption.  Enterprise disks tend to have battery backed write caches 
to prevent this.


In the set up you're emulating, the host is acting as a giant write 
cache.  Should your host fail, you can get data corruption.


cache=writethrough provides a much stronger data guarantee.  Even in the 
event of a host failure, data integrity will be preserved.


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Christoph Hellwig
On Mon, Mar 15, 2010 at 06:43:06PM -0500, Anthony Liguori wrote:
 I knew someone would do this...

 This really gets down to your definition of safe behaviour.  As it  
 stands, if you suffer a power outage, it may lead to guest corruption.

 While we are correct in advertising a write-cache, write-caches are  
 volatile and should a drive lose power, it could lead to data  
 corruption.  Enterprise disks tend to have battery backed write caches  
 to prevent this.

 In the set up you're emulating, the host is acting as a giant write  
 cache.  Should your host fail, you can get data corruption.

 cache=writethrough provides a much stronger data guarantee.  Even in the  
 event of a host failure, data integrity will be preserved.

Actually cache=writeback is as safe as any normal host is with a
volatile disk cache, except that in this case the disk cache is
actually a lot larger.  With a properly implemented filesystem this
will never cause corruption.  You will lose recent updates after
the last sync/fsync/etc up to the size of the cache, but filesystem
metadata should never be corrupted, and data that has been forced to
disk using fsync/O_SYNC should never be lost either.  If it is that's
a bug somewhere in the stack, but in my powerfail testing we never did
so using xfs or ext3/4 after I fixed up the fsync code in the latter
two.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Anthony Liguori

On 03/15/2010 07:43 PM, Christoph Hellwig wrote:

On Mon, Mar 15, 2010 at 06:43:06PM -0500, Anthony Liguori wrote:
   

I knew someone would do this...

This really gets down to your definition of safe behaviour.  As it
stands, if you suffer a power outage, it may lead to guest corruption.

While we are correct in advertising a write-cache, write-caches are
volatile and should a drive lose power, it could lead to data
corruption.  Enterprise disks tend to have battery backed write caches
to prevent this.

In the set up you're emulating, the host is acting as a giant write
cache.  Should your host fail, you can get data corruption.

cache=writethrough provides a much stronger data guarantee.  Even in the
event of a host failure, data integrity will be preserved.
 

Actually cache=writeback is as safe as any normal host is with a
volatile disk cache, except that in this case the disk cache is
actually a lot larger.  With a properly implemented filesystem this
will never cause corruption.


Metadata corruption, not necessarily corruption of data stored in a file.


   You will lose recent updates after
the last sync/fsync/etc up to the size of the cache, but filesystem
metadata should never be corrupted, and data that has been forced to
disk using fsync/O_SYNC should never be lost either.


Not all software uses fsync as much as they should.  And often times, 
it's for good reason (like ext3).  This is mitigated by the fact that 
there's usually a short window of time before metadata is flushed to 
disk.  Adding another layer increases that delay.


IIUC, an O_DIRECT write using cache=writeback is not actually on the 
spindle when the write() completes.  Rather, an explicit fsync() would 
be required.  That will cause data corruption in many applications (like 
databases) regardless of whether the fs gets metadata corruption.


You could argue that the software should disable writeback caching on 
the virtual disk, but we don't currently support that so even if the 
application did, it's not going to help.


Regards,

Anthony Liguori


   If it is that's
a bug somewhere in the stack, but in my powerfail testing we never did
so using xfs or ext3/4 after I fixed up the fsync code in the latter
two.

   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Balbir Singh
* Chris Webb ch...@arachsys.com [2010-03-15 20:23:54]:

 Avi Kivity a...@redhat.com writes:
 
  On 03/15/2010 10:07 AM, Balbir Singh wrote:
 
  Yes, it is a virtio call away, but is the cost of paying twice in
  terms of memory acceptable?
  
  Usually, it isn't, which is why I recommend cache=off.
 
 Hi Avi. One observation about your recommendation for cache=none:
 
 We run hosts of VMs accessing drives backed by logical volumes carved out
 from md RAID1. Each host has 32GB RAM and eight cores, divided between (say)
 twenty virtual machines, which pretty much fill the available memory on the
 host. Our qemu-kvm is new enough that IDE and SCSI drives with writeback
 caching turned on get advertised to the guest as having a write-cache, and
 FLUSH gets translated to fsync() by qemu. (Consequently cache=writeback
 isn't acting as cache=neverflush like it would have done a year ago. I know
 that comparing performance for cache=none against that unsafe behaviour
 would be somewhat unfair!)
 
 Wasteful duplication of page cache between guest and host notwithstanding,
 turning on cache=writeback is a spectacular performance win for our guests.
 For example, even IDE with cache=writeback easily beats virtio with
 cache=none in most of the guest filesystem performance tests I've tried. The
 anecdotal feedback from clients is also very strongly in favour of
 cache=writeback.
 
 With a host full of cache=none guests, IO contention between guests is
 hugely problematic with non-stop seek from the disks to service tiny
 O_DIRECT writes (especially without virtio), many of which needn't have been
 synchronous if only there had been some way for the guest OS to tell qemu
 that. Running with cache=writeback seems to reduce the frequency of disk
 flush per guest to a much more manageable level, and to allow the host's
 elevator to optimise writing out across the guests in between these flushes.

Thanks for the inputs above, they are extremely useful. The goal of
these patches is that with cache != none, we allow double caching when
needed and then slowly take away unmapped pages, pushing the caching
to the host. There are knobs to control how much, etc and the whole
feature is enabled via a boot parameter.

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter

2010-03-15 Thread Balbir Singh
* Randy Dunlap randy.dun...@oracle.com [2010-03-15 08:46:31]:

 On Mon, 15 Mar 2010 12:52:15 +0530 Balbir Singh wrote:
 
 Hi,
 If you go ahead with this, please add the boot parameter  its description
 to Documentation/kernel-parameters.txt.


I certainly will, thanks for keeping a watch. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html