[GitHub] [cloudstack] joseflauzino opened a new issue #5935: Persistence of VM stats

GitBox Fri, 04 Feb 2022 04:12:36 -0800


joseflauzino opened a new issue #5935:
URL: https://github.com/apache/cloudstack/issues/5935



   ##### ISSUE TYPE
   
    * Enhancement Request
   
   ##### COMPONENT NAME
   
   ~~~
   StatsCollector
   ~~~
   
   ##### CLOUDSTACK VERSION
   
   ~~~
   4.17
   ~~~
   
   ##### SUMMARY
   
   This spec changes the way Apache CloudStack collects and stores the VM stats 
to make the data more consistent and provide historical data.
   
   ------
   
   # Table of Contents
   
   1.  [Problem description](#problem-description)
       1.1. [Current collecting/storing data workflows and possible 
configurations](#current-collecting/storing-data-workflows-and-possible-configurations)
       1.2. [Current data cleaning workflow](#current-data-cleaning-workflow)
       1.3. [Current API](#current-api)
   2. [Proposed changes](#proposed-changes)
       2.1. [Proposed collecting/storing data 
workflow](#proposed-collecting/storing-data-workflow)
       2.2. [Configuration proposal](#configuration-proposal)
       2.3. [Data cleaning proposal](#data-cleaning-proposal)
       2.4. [New API proposal](#new-api-proposal)
       2.5. [UI adjustment proposal](#ui-adjustment-proposal)
   3. [Work items](#work-items)
       3.1. [Database tables](#database-tables)
       3.2. [Global configurations](#global-configurations)
       3.3. [API](#api)
       3.4. [UI](#ui)
   4. [Future works](#future-works)
   ------
   
   # 1. Problem description
   
   In Apache CloudStack (ACS), VM stats are collected by Management Servers. 
Currently, each Management Server collects the data independently and stores it 
only in (primary) memory. This model of collecting and storing VM stats results 
in some limitations, numbered as follows:
   
   1.  When restarting a Management Server (or when it crashes), the VMs stats 
data is lost (since there is no data persistence);
   
   2.  When the cloud is composed of multiple Management Servers, each one of 
them can show different data about the VMs, as there is no centralization or 
synchronization of the data collected by different Management Servers;
   
   3.  It is not possible to obtain historical data. The reasons for this are: 
i) ACS stores either the accumulative/aggregated of collected data or only the 
most recently collected data point (see Section 
[1.1](#current-collecting/storing-data-workflows-and-possible-configurations) 
for details); ii) even if you were to consider storing multiple collected data 
points and presenting a history for each individual Management Server (due to 
limitation 2), there would be no guarantee that data from a certain period 
would exist (see limitation 1).
   
   The next subsections describe in more details how the collection of VM stats 
is currently designed and implemented by ACS. Only the most relevant points for 
this spec are presented.
   
   ## 1.1. Current collecting/storing data workflows and possible configurations
   
   Currently, each Management Server perform its own VM stats collection. This 
data is collected only from VMs that are running. The collected data is only 
stored in a concurrent hash map in memory, where keys are VM IDs and values are 
stats. Since there is no data being shared or synced between Management 
Servers, the stats about a VM can be different in each one of them.
   
   It is possible to configure the interval between data gathering with the 
global configuration `vm.stats.interval`, which is defined in milliseconds.
   
   The global configuration `vm.stats.increment.metrics.in.memory` (which is 
set by a boolean value) allows operators to define whether i) data should be 
stored incrementally (*i.e.*, accumulating the data); or ii) in such a way as 
to keep only the data from the most recent collection (*i.e.*, a data 
replacement).
   
   Figure 1 illustrates the current collecting and storing data workflows.
   
   
![current-vm-stats-collection](https://user-images.githubusercontent.com/17031007/152525967-3f41b4f2-84e8-4219-ae5a-2a9f25f100da.png)
   
   **Figure 1:** The current workflow to collect and store VM stats performed 
periodically for each Management Server.
   
   ## 1.2. Current data cleaning workflow
   
   In the latest ACS release (4.16.0), no cleanup of VM stats data is 
performed, which leads Management Servers to continue to show them even for VMs 
that are no longer running (*e.g.*, VMs that have changed to states such as 
'stopping', 'stopped', 'destroyed', 'expunging', and so on). PR 
[\#5633](https://github.com/apache/cloudstack/pull/5633), already approved and 
merged, addresses the issue of data cleaning considering the current collecting 
and storing VM stats workflow (*i.e.*, the cleanup is done with no concern for 
providing historical data).
   
   ## 1.3. Current API
   
   The current implemented API, 
[*listVirtualMachinesMetrics*](https://cloudstack.apache.org/api/apidocs-4.16/apis/listVirtualMachinesMetrics.html),
 just extends the 
[*listVirtualMachines*](https://cloudstack.apache.org/api/apidocs-4.16/apis/listVirtualMachines.html)
 API, so it inherits all of its parameters, even if some of them are not 
suitable/useful for the API purpose. Also, although the official documentation 
states that only tags related to metrics are returned, the current API returns 
all the same information as the *listVirtualMachines* API. Finally, if the 
*listVirtualMachinesMetrics* API is called passing in the `details` parameter a 
comma-separated list that does not include the `stats` attribute, it does not 
return the VM stats as, again, it has the same behavior as the 
*listVirtualMachines* API.
   
   # 2. Proposed changes
   
   This spec proposes to change the way ACS collects and stores the VM stats. 
The intent is to make the data presented by Management Servers more consistent 
and also provide historical data. The proposal changes are described in the 
next subsections.
   
   ## 2.1. Proposed collecting/storing data workflow
   
   For storing, our proposal is to persist the VM stats in the database 
(MySQL); however, we see the opportunity for the future to have options to 
choose between different metrics storage backends such as InfluxDB, Mongo, and 
so on. In addition, the data will be stored in a *collected data point* format. 
In this context, a *collected data point* represents a single collection of all 
stats for a specific VM, performed by a given Management Server. Each 
*collected data point* will have a timestamp that indicates when the collection 
was performed. The data collection will continue to work the same way: at each 
collection round, each Management Server collects the stats from all running 
VMs.
   
   This new approach will allow ACS users to obtain historical data. Also, it 
will logically centralize the data. Thus, all Management Servers will show the 
same data about each VM.
   
   Figure 2 illustrates the proposal for the new data collection and storage 
workflow.
   
   
![proposed-vm-stats-collection](https://user-images.githubusercontent.com/17031007/152526181-94c3de87-f0ab-4bdf-990b-43038953b75f.png)
   
   **Figure 2:** The proposed workflow to collect and store VM stats performed 
periodically for each Management Server.
   
   ## 2.2. Configuration proposal
   
   Since the data will now be persisted in the database instead of being kept 
only in primary memory, we propose to change the global configuration 
`vm.stats.increment.metrics.in.memory` to just `vm.stats.increment.metrics`. We 
also propose that this configuration will no longer control how data is stored; 
since data will always be stored in *collected data points* format (never 
incrementally). Instead, this configuration will now indicate how data is 
returned by the API by default (see subsections [2.4](#new-api-proposal) and 
[3.3](#api) for details).
   
   We also propose to create a new global configuration called 
`vm.stats.max.retention.time`. It deﬁnes how long the *collected data points* 
should be stored so that the oldest records can be automatically deleted as 
theirs time to live (TTL) is reached.
   
   Finally, we propose that the VM stats collection process be disabled by 
setting the global configuration `vm.stats.interval` to 0 or less than 0.
   
   ## 2.3. Data cleaning proposal
   
   We propose two types of data cleanup process. The first one automatically 
removes old records, which are *collected data points* that have a timestamp 
indicating that the time limit set in the global configuration 
`vm.stats.max.retention.time` has been exceeded. If 
`vm.stats.max.retention.time` be set to 0 or less than 0, then this automatic 
removal process will be disabled. The second cleanup process removes all 
*collected data points* related to VMs that were destroyed. Therefore, the 
cleaning mechanisms added by PR 
[\#5633](https://github.com/apache/cloudstack/pull/5633) in order to remove 
stats for VMs that are no longer running will be removed.
   
   ## 2.4. New API proposal
   
   For compatibility reasons, we propose to keep the current API and create a 
new one to handle historical reporting of VM stats. The current API, 
*listVirtualMachinesMetrics*, will have only minimal changes to work with the 
new data storage mode (see subsection [3.3](#api) for details). The new API, 
called *listVirtualMachinesUsageHistory*, allows ACS users to get historical 
data filtered by specific time periods. For this, the API has the parameters 
`startdate` and `enddate`, which allow ACS users to do 4 different types of 
filtering:
   
   -   Get all VM stats **starting at** a given time (by passing only the 
startdate  parameter);
   
   -   Get all VM stats **up to** a given time (by passing only the `enddate` 
parameter);
   
   -   Get all VM stats **from a specific time range** (by passing both the 
`startdate` and `enddate` parameters, so that `startdate` is before `enddate`);
   
   -   Get all VM stats **with a specific timestamp** (by passing both the 
`startdate` and `enddate` parameters, so that `startdate` equals
       `enddate`).
   
   In addition, it is possible to combine these parameters with other 
parameters offered by the API (see all parameters in Table 1). This API returns 
just the stats data and essential information to identify the VMs. All response 
tags are described in Table 2.
   
   | **Parameter Name** | **Description** |
   ------------------|--------------------|
   | id                 | The ID of the virtual machine. |
   | ids                | The IDs of the virtual machines, mutually exclusive 
with id. |
   | keyword            | List by keyword. |
   | page               | The page number. |
   | pagesize           | The page size. |
   | name               | Name of the virtual machine (a substring match is 
made against the parameter value, data for all matching VMs will be returned). |
   | startdate          | Start date to filter VM stats. |
   | enddate            | End date to filter VM stats. |
   
   **Table 1:** The*listVirtualMachinesUsageHistory* request parameters.
   
   | **Response Name**     | **Description** |
   | --------------------- | --------------- |
   | id                    | The ID of the virtual machine. |
   | name                  | The name of the virtual machine. |
   | stats (*)             | The virtual machine stats. |
   |&nbsp;&nbsp; timestamp         | The time when the stats were collected. |
   |&nbsp;&nbsp; cpuused           | The amount (percentage) of the VM's CPU 
currently used. |
   |&nbsp;&nbsp; diskioread        | The read (I/O) of disk on the VM. |
   |&nbsp;&nbsp; diskiowrite       | The write (I/O) of disk on the VM. |
   |&nbsp;&nbsp; diskread          | The disk read in MiB. |
   |&nbsp;&nbsp; diskwrite         | The disk write in MiB. |
   |&nbsp;&nbsp; diskkbsread       | The read (bytes) of disk on the VM. |
   |&nbsp;&nbsp; diskkbswrite      | The write (bytes) of disk on the VM. |
   |&nbsp;&nbsp; memoryintfreekbs  | The internal memory that's free in VM or 
zero if it can not be calculated. |
   |&nbsp;&nbsp; memorykbs         | The memory used by the VM in Kbps.|
   |&nbsp;&nbsp; memorytargetkbs   | The target memory in VM in Kbps.|
   |&nbsp;&nbsp; networkread       | The network read in MiB.|
   |&nbsp;&nbsp; networkwrite      | The network write in MiB.|
   |&nbsp;&nbsp; networkkbsread    | The incoming network traffic on the VM.|
   |&nbsp;&nbsp; networkkbswrite   | The outgoing network traffic on the host.|
   
   **Table 2:** The *listVirtualMachinesUsageHistory* response tags.
   
   ## 2.5. UI adjustment proposal
   
   The UI continues to consume the same API (*listVirtualMachinesMetrics*) to 
show VM stats. The only change is that it now only shows stats data for VMs 
with the *running* state.
   
   # 3. Work items
   
   This section describes all work items to implement the proposal.
   
   ## 3.1. Database tables
   
   No existing tables are modified, there is only one new table to be created: 
table `vm_stats`, where each record represents a *collected data point*.
   
   | **Column**        | **Nullable** | **Updatable** | **Description** |
   | ----------------- | ------------ | ------------- | --------------- |
   | id                | No           | No            | To identify the 
*collected data point*. |
   | vm_id            | No           | No            | To identify the related 
VM. |
   | mgmt_server_id  | No           | No            | Indicates which 
Management Server collected the data. |
   | timestamp         | No           | No            | Indicates the instant 
the *collected data point* was created (*i.e.*, when the data was collected). |
   | vm_stats_data   | No           | No            | The collected data in 
JSON format. These are the same data that is currently stored only in memory. |
   
   **Table 3:** Database table vm_stats.
   
   ## 3.2. Global configurations
   
   -   Rename the global configuration `vm.stats.increment.metrics.in.memory` 
to `vm.stats.increment.metrics`;
   
   -   Create the global configuration `vm.stats.max.retention.time`;
   
   -   Change `StatsCollector` to disable the automatic removal process of VM 
stats records when the global configuration `vm.stats.max.retention.time`  is 
set to 0 or less than 0;
   
   -   Change `StatsCollector` to disable the VM stats collection when the 
global configuration vm.stats.interval` is set to 0 or less than 0.
   
   ## 3.3. API
   
   -   Adjust the *listVirtualMachinesMetrics* API to get data from the 
database instead of the in-memory map;
   
   -   Add a new parameter called `accumulate` (set by a boolean value) to API 
*listVirtualMachinesMetrics* that allows ACS users force the API to return data 
in either accumulative or non-accumulative mode. This overwrites the global 
configuration `vm.stats.increment.metrics`. When `accumulate` parameter is not 
passed, stats are returned according to the global configuration 
`vm.stats.increment.metrics`;
   
   -   Create the new API *listVirtualMachinesUsageHistory* with all request 
parameters described in Table 1 and all response tags described in Table 2;
   
   -   Annotate the *listVirtualMachinesMetrics* API as deprecated so that in 
the future it can be replaced by the new API.
   
   ## 3.4. UI
   
   -   Adjust the UI to show only the most recent stats for each VM;
   
   -   Adjust the UI to not show stats for VMs that are no longer running, even 
though the API returns the historical stats data for those VMs.
   
   # 4. Future works
   
   -   Implement new views, in UI, to show history of VM stats;
   
   -   Evaluate if there are other useful parameters to add to the 
*listVirtualMachinesUsageHistory* API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [cloudstack] joseflauzino opened a new issue #5935: Persistence of VM stats

Reply via email to