Bug#927071: [Pkg-xen-devel] Bug#927071: xen: More balloon-leak observation

2020-09-18 Thread Hans van Kranenburg
Hi again,

On 5/1/19 12:55 AM, Elliott Mitchell wrote:
> On Mon, Apr 22, 2019 at 04:02:28PM +0200, Hans van Kranenburg wrote:
>> On 4/22/19 1:10 AM, Elliott Mitchell wrote:
>>> There is plenty of free memory for creating additional VMs (perhaps too
>>> much, and that confused Xen?), so this is really puzzling that memory is
>>> being ballooned away from Dom0.  At this point I plan after the next
>>> restart to double the allocation for Dom0 and see whether Dom0 is able
>>> to last more than a week.
>>
>> Weird. Can you log memory stats over time, so that you can see when it
>> happens, and correlate it to other events?
> 
> At this point there is only one real pattern I've noticed:  Always
> `smartd` was the process which triggered the kernel OOM-killer.
> 
> Originally I was attributing this to `smartd` doing some large memory
> allocation during its night-time tasks (which I would attribute to
> perhaps `smartd` not being that well written).  Yet now, I never saw
> anything else trigger the OOM-killer and I'm now willing to speculate
> some I/O operation `smartd` was doing triggers a bug in Xen.

At first I replied with "I haven't heard about this symptom before your
report.", but later I realized that I am totally seeing the same kind of
behaviour.

During some debian-xen day in Feb 2020, I even had a bit of a closer
look at this together with Ian, and we ended up thinking that there's
actually some kind of obscure miscalculation bug happening. If you look
closely at the numbers in xl info and xl list, then you'll see that the
numbers just do not add up.

The dom0 gets some kind of fake-down-ballooning which is an accounting
error.

I can't provide more proof right now, because I have to reproduce the
thing in a simplified environment to be able to provide a kind of
walk-through scenario with all the output of the numbers.

And yes, I have seen oom killers do stuff in customer production
environments because of this. O_O

A team member in my team has been busy doing storage migrations where we
attach new block devices to domUs and then sync all their data to the
new filesystem (moving from ext4 to btrfs and also to new iSCSI storage)
and later reboot after a final sync and then swap block devices, etc.
>From the graphs we've been looking at, combined with when migration
stuff is happening, I have gotten a suspicion that it looks like the
fake dom0 down-ballooning is related to grant mappings, since it seems
like the dom0 memory is not decreasing when attaching the new disk, but
it is when starting activity using it.

To be continued

Hans



Bug#927071: xen: More balloon-leak observation

2019-07-19 Thread Elliott Mitchell
What I'm seeing seems kind of related to the topic of XSA-300.  Mainly
something ballooning out pages.

The Debian Wiki advises reducing the amount of memory used by Domain-0.
Perhaps the Debian Wiki should be advising to try to keep the Domain-0
maximum substantially higher than the actual allocation in order to allow
for ballooning pages used for I/O?

For my case it looks like Domain-0 can function with less than 300MB of
allocated memory, but needs around 200MB of ballooned pages for I/O.


-- 
(\___(\___(\__  --=> 8-) EHM <=--  __/)___/)___/)
 \BS (| ehem+sig...@m5p.com  PGP 87145445 |)   /
  \_CS\   |  _  -O #include  O-   _  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445



Bug#927071: [Pkg-xen-devel] Bug#927071: xen: More balloon-leak observation

2019-04-30 Thread Elliott Mitchell
On Mon, Apr 22, 2019 at 04:02:28PM +0200, Hans van Kranenburg wrote:
> On 4/22/19 1:10 AM, Elliott Mitchell wrote:
> > There is plenty of free memory for creating additional VMs (perhaps too
> > much, and that confused Xen?), so this is really puzzling that memory is
> > being ballooned away from Dom0.  At this point I plan after the next
> > restart to double the allocation for Dom0 and see whether Dom0 is able
> > to last more than a week.
> 
> Weird. Can you log memory stats over time, so that you can see when it
> happens, and correlate it to other events?

At this point there is only one real pattern I've noticed:  Always
`smartd` was the process which triggered the kernel OOM-killer.

Originally I was attributing this to `smartd` doing some large memory
allocation during its night-time tasks (which I would attribute to
perhaps `smartd` not being that well written).  Yet now, I never saw
anything else trigger the OOM-killer and I'm now willing to speculate
some I/O operation `smartd` was doing triggers a bug in Xen.


-- 
(\___(\___(\__  --=> 8-) EHM <=--  __/)___/)___/)
 \BS (| ehem+sig...@m5p.com  PGP 87145445 |)   /
  \_CS\   |  _  -O #include  O-   _  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445



Bug#927071: [Pkg-xen-devel] Bug#927071: xen: More balloon-leak observation

2019-04-22 Thread Hans van Kranenburg
Hi,

On 4/22/19 1:10 AM, Elliott Mitchell wrote:
> Refering to this as "balloon-leak" as it looks sort of like a
> memory-leak, but is instead memory disappearing into the balloon.

I haven't heard about this symptom before your report.

> There are two things which have been happening more recently which may
> have exaggerated this problem.  First, one of the DomUs is acting as a
> fileserver and that has been getting more usage recently.  Second, I've
> been testing block-device hotplug as a mechanism to transfer data between
> VMs (xl block-detach from one VM, then xl block-attach to another VM).

Did you look at the numbers and see if it happens when you do this?

> The DomUs appear absolutely uneffected by this.  Even as Dom0 has gotten
> to a situation where `shutdown -r now` fails, the DomUs appear to be
> chugging away with no problems.
> 
> There is plenty of free memory for creating additional VMs (perhaps too
> much, and that confused Xen?), so this is really puzzling that memory is
> being ballooned away from Dom0.  At this point I plan after the next
> restart to double the allocation for Dom0 and see whether Dom0 is able
> to last more than a week.

Weird. Can you log memory stats over time, so that you can see when it
happens, and correlate it to other events?

Hans



Bug#927071: xen: More balloon-leak observation

2019-04-21 Thread Elliott Mitchell
Refering to this as "balloon-leak" as it looks sort of like a
memory-leak, but is instead memory disappearing into the balloon.

There are two things which have been happening more recently which may
have exaggerated this problem.  First, one of the DomUs is acting as a
fileserver and that has been getting more usage recently.  Second, I've
been testing block-device hotplug as a mechanism to transfer data between
VMs (xl block-detach from one VM, then xl block-attach to another VM).

The DomUs appear absolutely uneffected by this.  Even as Dom0 has gotten
to a situation where `shutdown -r now` fails, the DomUs appear to be
chugging away with no problems.

There is plenty of free memory for creating additional VMs (perhaps too
much, and that confused Xen?), so this is really puzzling that memory is
being ballooned away from Dom0.  At this point I plan after the next
restart to double the allocation for Dom0 and see whether Dom0 is able
to last more than a week.


-- 
(\___(\___(\__  --=> 8-) EHM <=--  __/)___/)___/)
 \BS (| ehem+sig...@m5p.com  PGP 87145445 |)   /
  \_CS\   |  _  -O #include  O-   _  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445