Re: bhyve guest's memory representation & live migration using COW

2019-03-07 Thread Mark Johnston
On Mon, Mar 04, 2019 at 10:52:44AM +0200, Elena Mihailescu wrote:
> I'm writing this e-mail to keep you in the loop with the live
> migration progress and to ask for advice regarding the process of
> cleaning the dirty bit. At Rod Grimes's suggestion, I added the
> virtualization list at CC.
> 
> I had some issues with the vmm_dirty bit (a bit from oflags) and
> initially I thought that there some issues related to the fact that in
> some cases oflags is directly updated (oflags = ) and with the
> fact that the dirty bit is set otherwise that using vm_page_dirty(),
> but it seems that the issues I had were from other sources and now,
> after cleaning the code, and refactored a bit, it seems to work fine.
> Some of the code snippets that I thought that could affect the
> migration are at [1].
> 
> However, if I dump the guest memory after the live migration (the
> guest is not yet started after migration) and compare it with the
> source guest's ctx->baseaddr, there are some differences and I don't
> know from where they may come from.
> 
> One of the things I must implement is related to the dirty bit
> mechanism. For now, I do not clear the dirty bit after I migrate a
> page.

Note that there is code that only calls vm_page_dirty(m) if
m->dirty != VM_PAGE_BITS_ALL.  One example is in vm_page_advise().
So if the page is dirty from the kernel's point of view and you clear
the VMM dirty bit, subsequent modifications may not be propagated to the
vm_page state.

> This is not a wrong implementation because in every round, I
> migrate all the pages that have been modified ever (since the guest
> was started), but it is not an optimal implementation. To clear the
> dirty bit, I've tried different mechanism such as:
> - pmap_remove_write - kernel panic: pde entry not found for a wired mapped 
> page
> - vm_object_page_clean
> - vm_map_sync -> kernel panic
> - pmap_protect - remove write permission, copy page and add write
> permission -> kernel panic; pmap_demote_pde: page table page for a
> wired mapping is missing
>
> Any idea about a way of forcing a page to be cleared (so the physical
> dirty bit to be cleared)?

I'm not sure exactly what you mean, but pmap_clear_modify(m) will clear
the modification bit for mappings of m.  It seems like you want
something roughly like:

vm_page_xbusy(m);
if (pmap_is_modified(m))
m->dirty = m->vmm_dirty = VM_PAGE_BITS_ALL;
pmap_clear_modify(m);
vm_page_xunbusy(m);

Or do we need to restrict the operation to a specific mapping of m?
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


Re: Why can't I dtrace processes running in a jail from the host?

2018-08-10 Thread Mark Johnston
On Fri, Aug 10, 2018 at 09:15:22AM +0200, Patrick M. Hausen wrote:
> Hi!
> 
> > Am 09.08.2018 um 16:52 schrieb Mark Johnston :
> > For userland static probes to be globally visible, the process needs to
> > register them with the kernel when it starts.  This is done
> > automatically using a constructor which issues ioctls to
> > /dev/dtrace/helper, hence the requirement for /dev/dtrace/* in the jail.
> 
> I figured as much. Enabling /dev/dtrace/* in the jail and restarting
> the jail made the probes visible in the host system
> 
> I'm still somewhat stuck. What I'm trying to do is track down some
> performance problems in a large complex PHP web application.
> I have done this in the past on "regular" setups without jails
> and with PHP 5.6 compiled with dtrace support using
> /usr/local/share/dtrace-toolkit/Php/* ...
> 
> This setup is jailed with PHP 7.2, dtrace support seems to be the
> default for the port.
> 
> I'm specifically after
> 
> php-fpm dtrace_execute_ex function-entry
> php-fpm dtrace_execute_ex function-return
> 
> of course, to see where the application spends it's CPU cycles.
> 
> But regardless if I'm doing this on the host or in the jail, I only get
> these results:
> 
> dtrace -m php\*
>   ZEND_CATCH_SPEC_CONST_CV_HANDLER:exception-caught
>   php_request_shutdown:request-shutdown
>   php_request_startup:request-startup
>   zend_error:error
>   zend_throw_exception_internal:exception-thrown
> 
> Nothing else. Still `dtrace -m php\* -l` does show all the probes.

Indeed, I can reproduce this.  Peering at the zend source code, it seems
that there's a difference in how the dtrace_execute_ex hook gets
installed: with 5.6, it's unconditional so long as zend is compiled with
HAVE_DTRACE.  In 7.2, the php process needs to have USE_ZEND_DTRACE=1
set in its environment for the dtrace function execution hook to be
installed.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


Re: Why can't I dtrace processes running in a jail from the host?

2018-08-09 Thread Mark Johnston
On Thu, Aug 09, 2018 at 01:09:00PM +0200, Patrick M. Hausen wrote:
> Hi all,
> 
> I'm wondering why on a busy hosting server with hundreds of PHP-FPM
> workers running in jails "dtrace -l" on the host does not show any
> PHP specific probes. PHP *is* compiled with dtrace support for all the
> jails.
> 
> Enabling /dev/dtrace/* via devfs.rules for a specific jail and then repeating
> the process *inside* the jail works as expected.
> 
> Shouldn't jailed processes be transparently visible from the host system
> but not vice versa?

For userland static probes to be globally visible, the process needs to
register them with the kernel when it starts.  This is done
automatically using a constructor which issues ioctls to
/dev/dtrace/helper, hence the requirement for /dev/dtrace/* in the jail.

In general it is still possible to use unregistered userland probes in
this scenario: dtrace(1) can discover them when it attaches to a
specified process.  I'm not sure how well this will work if the process
is jailed and dtrace(1) is invoked on the host, but it's worth trying.

I would be rather wary of enabling access to /dev/dtrace/* in a jail.
The kernel code which parses probe metadata has a large attack surface
and has had security holes in the past.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


Re: Inspect pages created after a vm_object is marked as copy-on-write

2018-07-01 Thread Mark Johnston
On Sun, Jul 01, 2018 at 01:34:01AM +0300, Konstantin Belousov wrote:
> On Sat, Jun 30, 2018 at 05:59:56PM -0400, Mark Johnston wrote:
> > On Sat, Jun 30, 2018 at 10:38:21AM +0300, Mihai Carabas wrote:
> > > On Sat, Jun 30, 2018 at 1:52 AM, Mark Johnston  wrote:
> > > > On Fri, Jun 29, 2018 at 11:58:31AM +0300, Elena Mihailescu wrote:
> > > >> Is there anything I am doing wrong? Maybe I misunderstood something 
> > > >> about
> > > >> the way the virtual memory works in FreeBSD.
> > > >
> > > > I'll note that inspecting and manipulating vm_map_entry and vm_object
> > > > structures in the bhyve code constitutes something of an abstraction
> > > > violation, though it's reasonable to proceed this way while working on a
> > > > prototype of the feature.  That is, I think you should keep trying your
> > > > current approach, but just be aware that you are using the copy-on-write
> > > > mechanism in a way that the VM system isn't really expecting.
> > > >
> > > 
> > > Can you point out the right approach in our case?
> > 
> > I am merely suggesting that once the required VM interactions are fully
> > understood, the mechanism implemented for bhyve should be generalized
> > and lifted into the VM code.  It's hard to say what the "right" approach
> > is, since I don't fully understand the proposed algorithm.  It sounds
> > like you might be attempting something like:
> > 
> > 1. mark the mappings of to-be-migrated objects as NEEDS_COW, so that a
> >subsequent write fault triggers creation of a shadow object
> It is actually MAP_ENTRY_COW | MAP_ENTRY_NEEDS_COPY.

Indeed.

> Note that setting an entry to COW changes the behaviour of mprotect(2),
> at least.
> 
> > 2. invalidate all physical mappings of pages in the object to be copied,
> >so that subsequent writes trigger a fault
> I do not think this is needed to detect writes after the COW is set.
> It is enough to remove the write permissions.  Same as fork() does,
> see the vm_map_copy_entry() code for the handling of MAP_ENTRY_NEEDS_COPY
> case.

Ah, right.

> > 3. copy pages from the backing object to the destination
> As I understand, this is done right after the entry is marked
> as COW.
> 
> > 4. copy any pages from the shadow object to the desination
> And this is done after all backing data is copied and the process is
> suspended.

Right, I see.  Some reading suggests that in general we might perform
multiple iterations of this procedure before suspending the process and
performing any remaining copies.

> > 5. collapse the backing object into the shadow
> > 6. if the shadow object exists and was non-empty before the collapse,
> >goto 1
> Are you trying to describe how to undo the COW marking ?  Marking an
> entry as COW really changes its semantic, and we do not need the undo
> operation in the base so far.  Collapsing the objects would lesser
> the pressure on the system pollution with objects, but it does not
> change back the meaning of mappings, e.g. their behaviour on inheritance
> on fork.

I was just thinking about how to avoid creating a long chain of objects
if multiple iterations are in fact needed.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


Re: Inspect pages created after a vm_object is marked as copy-on-write

2018-06-30 Thread Mark Johnston
On Sat, Jun 30, 2018 at 10:38:21AM +0300, Mihai Carabas wrote:
> On Sat, Jun 30, 2018 at 1:52 AM, Mark Johnston  wrote:
> > On Fri, Jun 29, 2018 at 11:58:31AM +0300, Elena Mihailescu wrote:
> >> Is there anything I am doing wrong? Maybe I misunderstood something about
> >> the way the virtual memory works in FreeBSD.
> >
> > I'll note that inspecting and manipulating vm_map_entry and vm_object
> > structures in the bhyve code constitutes something of an abstraction
> > violation, though it's reasonable to proceed this way while working on a
> > prototype of the feature.  That is, I think you should keep trying your
> > current approach, but just be aware that you are using the copy-on-write
> > mechanism in a way that the VM system isn't really expecting.
> >
> 
> Can you point out the right approach in our case?

I am merely suggesting that once the required VM interactions are fully
understood, the mechanism implemented for bhyve should be generalized
and lifted into the VM code.  It's hard to say what the "right" approach
is, since I don't fully understand the proposed algorithm.  It sounds
like you might be attempting something like:

1. mark the mappings of to-be-migrated objects as NEEDS_COW, so that a
   subsequent write fault triggers creation of a shadow object
2. invalidate all physical mappings of pages in the object to be copied,
   so that subsequent writes trigger a fault
3. copy pages from the backing object to the destination
4. copy any pages from the shadow object to the desination
5. collapse the backing object into the shadow
6. if the shadow object exists and was non-empty before the collapse,
   goto 1

Is that at all accurate?  I'm not familiar with the mechanisms used to
implement live migration in other hypervisors.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


Re: Inspect pages created after a vm_object is marked as copy-on-write

2018-06-29 Thread Mark Johnston
On Fri, Jun 29, 2018 at 11:58:31AM +0300, Elena Mihailescu wrote:
>  Hello,
> 
> I am interested if there is a method to inspect what pages/objects were
> created after a vm_object (the vm_map_entry associated with the object) is
> marked as copy-on-write. More specifically, I'm interested only in the
> pages that were copied when a write operation was proceed on a page that
> belongs to the object marked copy-on-write.
> 
> I need this for a live migration feature for bhyve in order to send the
> pages that were modified between the iterations in which I migrate the
> guest's memory(the guest's memory will be migrated in rounds - firstly, all
> memory will be sent remote, then, only the pages that were modified and so
> on).
> 
> What I want to implement is the following:
> Step 1: Given a vm_object *obj, mark its associated vm_map_entry *entry as
> copy-on-write.
> Step 2: After a while (a non-deterministic amount of time),
> inspect/retrieve the pages that were created based on information existent
> in the object.
> 
> What I tried until now:
> 
> I implemented a function in kernel that:
> - gets the vmspace structure pointer for the current process
> - gets the vm_map structure pointer for the vmspace
> - iterates through each vm_map_entry and based on the vm_offset_start and
> vm_offset_end determines vm_map_entry that contains the object I am
> interested in.
> - for this object, it prints some debug information such as: shadow_count,
> ref_count, whether if it has a backing_object or not.
> The code written is similar with the code from here (the way in which I get
> vmspace for the current process and the way I am iterating through
> vm_map_entry and objects):
> [0]
> https://github.com/FreeBSD-UPB/freebsd/blob/projects/bhyve_migration/sys/amd64/vmm/vmm_dev.c#L979
> 
> I have read the following documentation about FreeBSD's implementation for
> virtual memory:
> [1] https://www.freebsd.org/doc/en/books/arch-handbook/vm.html
> [2]
> https://www.freebsd.org/doc/en_US.ISO8859-1/articles/vm-design/article.html
> [3] https://people.freebsd.org/~neel/bhyve/bhyve_nested_paging.pdf
> [4] http://www.cse.chalmers.se/edu/year/2011/course/EDA203/unix4.pdf
> 
> As far as I could tell after reading the documentation presented above, I
> should look for the object that the object I am interested in is a shadow
> of or an object that my object is shadow for.

Right. When a copy-on-write fault results in the creation of a new
object, the new object is said to shadow the original object, which
becomes the backing object for the shadow object.  When faults in the
corresponding map entry occur, the fault handler first searches for a
page in the map entry's object, and then falls back to the backing
object if necessary.

> To do that, I should inspect the following fields from the vm_object
> structure (among others)(
> https://github.com/freebsd/freebsd/blob/master/sys/vm/vm_object.h#L98) :
> 
> - int shadow_count; /* how many objects that this is a shadow for */
> - struct vm_object *backing_object; /* object that I'm a shadow of */
> 
> But in all my tests, for the object I am interested in, the shadow_count is
> 0 and the backing_object is NULL.
> 
> The code I use to mark the vm_map_entry for the object I am interested in
> copy-on-write is here:
> [5]
> https://github.com/FreeBSD-UPB/freebsd/blob/projects/bhyve_migration/sys/amd64/vmm/vmm_dev.c#L949

MAP_ENTRY_NEEDS_COPY needs to be set in order for the copy-on-write
machinery to work the way you expect.  Take a look at vm_map_lookup():
when it sees that flag and the caller is attempting a write fault, it
creates a shadow object, updates the entry and clears
MAP_ENTRY_NEEDS_COPY, leaving MAP_ENTRY_COW set.

> Is there anything I am doing wrong? Maybe I misunderstood something about
> the way the virtual memory works in FreeBSD.

I'll note that inspecting and manipulating vm_map_entry and vm_object
structures in the bhyve code constitutes something of an abstraction
violation, though it's reasonable to proceed this way while working on a
prototype of the feature.  That is, I think you should keep trying your
current approach, but just be aware that you are using the copy-on-write
mechanism in a way that the VM system isn't really expecting.

> There is another way I could inspect what pages were created between the
> moment I mark an object (its vm_map_entry) as copy-on-write and a later
> moment?
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"