from:"Zachary Amsden"

Jenkins.impala.io back up

2018-02-14 Thread Zachary Amsden

Everything went as planned and Jenkins is now upgraded.

 - Zach

Heads Up - Jenkins shut down.

2018-02-14 Thread Zachary Amsden

I am going to upgrade jenkins.impala.io to address a security advisory; no
GVOs will run for a bit.

Hopefully I should be done in 30 minutes or less.

 - Zach

Re: Review Request 64317: SENTRY-2085: Keep sentry exceptions contained within Sentry

2017-12-29 Thread Zachary Amsden via Review Board


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64317/
---

(Updated Dec. 29, 2017, 11:30 p.m.)


Review request for sentry and Na Li.


Changes
---

username -> userName.  Tests now pass.


Repository: sentry


Description
---

Instead of leaking new exceptions outside the API, use the
existing authorization exceptions to indicate authorization
failure when a user has no group configured.


Diffs (updated)
-

  
sentry-binding/sentry-binding-hive-common/src/main/java/org/apache/sentry/binding/hive/authz/HiveAuthzBinding.java
 8ce7a02 
  
sentry-binding/sentry-binding-hive/src/main/java/org/apache/sentry/binding/hive/authz/HiveAuthzBindingHookBase.java
 9c60c22 
  
sentry-binding/sentry-binding-hive/src/test/java/org/apache/sentry/binding/hive/TestHiveAuthzBindings.java
 a41d1bd 
  
sentry-binding/sentry-binding-solr/src/main/java/org/apache/sentry/binding/solr/authz/SentrySolrPluginImpl.java
 91d08f0 
  
sentry-binding/sentry-binding-solr/src/main/java/org/apache/sentry/binding/solr/authz/SolrAuthzBinding.java
 803e5ea 
  
sentry-binding/sentry-binding-solr/src/test/java/org/apache/sentry/binding/solr/TestSolrAuthzBinding.java
 f060b82 
  
sentry-core/sentry-core-common/src/main/java/org/apache/sentry/core/common/exception/SentryGroupNotFoundException.java
 b978df6 
  
sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/AuthorizationProvider.java
 2d82bcf 
  
sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/GroupMappingService.java
 7e85261 
  
sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/HadoopGroupMappingService.java
 bde53d5 
  
sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/ResourceAuthorizationProvider.java
 005724f 
  
sentry-provider/sentry-provider-common/src/test/java/org/apache/sentry/provider/common/TestNoAuthorizationProvider.java
 fe01b06 
  
sentry-provider/sentry-provider-db/src/main/java/org/apache/sentry/provider/db/service/thrift/SentryPolicyStoreProcessor.java
 650880b 
  
sentry-provider/sentry-provider-db/src/test/java/org/apache/sentry/provider/db/generic/service/thrift/TestSentryGenericPolicyProcessor.java
 6597a7c 
  
sentry-provider/sentry-provider-file/src/main/java/org/apache/sentry/provider/file/LocalGroupMappingService.java
 5447420 
  
sentry-provider/sentry-provider-file/src/test/java/org/apache/sentry/provider/file/TestLocalGroupMapping.java
 9864b82 
  
sentry-solr/solr-sentry-handlers/src/main/java/org/apache/solr/handler/component/QueryDocAuthorizationComponent.java
 2338ab8 
  
sentry-tests/sentry-tests-hive-v2/src/test/java/org/apache/sentry/tests/e2e/hive/TestUserManagement.java
 02ac514 
  
sentry-tests/sentry-tests-hive/src/test/java/org/apache/sentry/tests/e2e/hive/TestUserManagement.java
 02ac514 


Diff: https://reviews.apache.org/r/64317/diff/4/

Changes: https://reviews.apache.org/r/64317/diff/3-4/


Testing
---

Running JUnit tests with mvn install.


Thanks,

Zachary Amsden

Re: Review Request 64317: SENTRY-2085: Keep sentry exceptions contained within Sentry

2017-12-29 Thread Zachary Amsden via Review Board


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64317/
---

(Updated Dec. 29, 2017, 9:14 p.m.)


Review request for sentry and Na Li.


Repository: sentry


Description
---

Instead of leaking new exceptions outside the API, use the
existing authorization exceptions to indicate authorization
failure when a user has no group configured.


Diffs (updated)
-

  
sentry-binding/sentry-binding-hive-common/src/main/java/org/apache/sentry/binding/hive/authz/HiveAuthzBinding.java
 8ce7a02 
  
sentry-binding/sentry-binding-hive/src/main/java/org/apache/sentry/binding/hive/authz/HiveAuthzBindingHookBase.java
 9c60c22 
  
sentry-binding/sentry-binding-hive/src/test/java/org/apache/sentry/binding/hive/TestHiveAuthzBindings.java
 a41d1bd 
  
sentry-binding/sentry-binding-solr/src/main/java/org/apache/sentry/binding/solr/authz/SentrySolrPluginImpl.java
 91d08f0 
  
sentry-binding/sentry-binding-solr/src/main/java/org/apache/sentry/binding/solr/authz/SolrAuthzBinding.java
 803e5ea 
  
sentry-binding/sentry-binding-solr/src/test/java/org/apache/sentry/binding/solr/TestSolrAuthzBinding.java
 f060b82 
  
sentry-core/sentry-core-common/src/main/java/org/apache/sentry/core/common/exception/SentryGroupNotFoundException.java
 b978df6 
  
sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/AuthorizationProvider.java
 2d82bcf 
  
sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/GroupMappingService.java
 7e85261 
  
sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/HadoopGroupMappingService.java
 bde53d5 
  
sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/ResourceAuthorizationProvider.java
 005724f 
  
sentry-provider/sentry-provider-common/src/test/java/org/apache/sentry/provider/common/TestNoAuthorizationProvider.java
 fe01b06 
  
sentry-provider/sentry-provider-db/src/main/java/org/apache/sentry/provider/db/service/thrift/SentryPolicyStoreProcessor.java
 650880b 
  
sentry-provider/sentry-provider-db/src/test/java/org/apache/sentry/provider/db/generic/service/thrift/TestSentryGenericPolicyProcessor.java
 6597a7c 
  
sentry-provider/sentry-provider-file/src/main/java/org/apache/sentry/provider/file/LocalGroupMappingService.java
 5447420 
  
sentry-provider/sentry-provider-file/src/test/java/org/apache/sentry/provider/file/TestLocalGroupMapping.java
 9864b82 
  
sentry-solr/solr-sentry-handlers/src/main/java/org/apache/solr/handler/component/QueryDocAuthorizationComponent.java
 2338ab8 
  
sentry-tests/sentry-tests-hive-v2/src/test/java/org/apache/sentry/tests/e2e/hive/TestUserManagement.java
 02ac514 
  
sentry-tests/sentry-tests-hive/src/test/java/org/apache/sentry/tests/e2e/hive/TestUserManagement.java
 02ac514 


Diff: https://reviews.apache.org/r/64317/diff/3/

Changes: https://reviews.apache.org/r/64317/diff/2-3/


Testing
---

Running JUnit tests with mvn install.


Thanks,

Zachary Amsden

Re: Review Request 64317: SENTRY-2085: Keep sentry exceptions contained within Sentry

2017-12-29 Thread Zachary Amsden via Review Board



> On Dec. 19, 2017, 9:26 p.m., kalyan kumar kalvagadda wrote:
> > sentry-binding/sentry-binding-hive/src/main/java/org/apache/sentry/binding/hive/authz/HiveAuthzBindingHookBase.java
> > Line 837 (original), 837 (patched)
> > <https://reviews.apache.org/r/64317/diff/2/?file=1908499#file1908499line838>
> >
> > This is not the appropriate error that should be logged. 
> > 
> > SemanticException is not not right exception, please throw appropriate 
> > exception and handle it.
> 
> Zachary Amsden wrote:
> I'm confused by review board - I don't see the line you are mentioning 
> being changed, nor do I remember writing any code related to 
> SemanticException - what line in particular is this about?
> 
> kalyan kumar kalvagadda wrote:
> Moving the try and bringing below logic into a try block
> 
> Set userPrivileges = authProvider.getPolicyEngine().getPrivileges(
>authProvider.getGroupMapping().getGroups(userName), 
> Sets.newHashSet(userName),
>hiveAuthzBinding.getActiveRoleSet(), hiveAuthzBinding.getAuthServer());
>
> 
> 
> by doing that, this method would throw SemanticException when getGroups 
> in above code throws SentryGroupNotFoundException.

I think you misread, instead of re-throwing the exception, I swallow it and do 
the user privilege lookup with an empty group set (which I believe is the right 
behavior if you want user based authorization).

```java
try {
  Set groups;
  try {
groups = authProvider.getGroupMapping().getGroups(userName);
  } catch (SentryGroupNotFoundException e) {
groups = Collections.emptySet();
LOG.debug("Could not find groups for user: " + userName);
  }
  Set userPrivileges =
  authProvider.getPolicyEngine().getPrivileges(groups, 
Sets.newHashSet(userName),
  hiveAuthzBinding.getActiveRoleSet(), 
hiveAuthzBinding.getAuthServer());

```


> On Dec. 19, 2017, 9:26 p.m., kalyan kumar kalvagadda wrote:
> > sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/ResourceAuthorizationProvider.java
> > Lines 99 (patched)
> > <https://reviews.apache.org/r/64317/diff/2/?file=1908508#file1908508line99>
> >
> > You can not silently consume the exception here. It should be thrown in 
> > the caller where it can create appripriate error message. 
> > 
> > If you comsume this exception, AuthorizationException will be throw 
> > with message saying user doesn not have privileges to perfrom a operation.
> > 
> > Which is not the case.
> 
> Zachary Amsden wrote:
> This should not be throwing at all - hasAccess is a public API.  If the 
> user has unknown group affiliation, based on the group permissions, there 
> should be no access.  Exactly why can be probed with other calls (getGroups).
> 
> Re-throwing the exception from here virally introduces it to all of the 
> upper layers and rapidly balloons into exactly the nightmare this patch is 
> trying to avoid.
> 
> kalyan kumar kalvagadda wrote:
> I just looked into Impala code. I see what you are saying. 
> 
> My assumption when I gave above comment was that method hasAccess is 
> called in sentry bindings and was suggesting that sentry bindings should 
> handle the SentryGroupNotFoundException.
> 
> I now see that Impala code calls method hasAccess directly. 
> 
> 
> but my consern still is valid, we whould be throwing Authorization 
> exception with out proper reason.
> 
> kalyan kumar kalvagadda wrote:
> At least add an error that access is desined becasue group is not found 
> for the user. That will help in troubleshooting.

Done.  I think I forgot to update the diff because for some reason my local 
version has this logging whereas review board does not.


- Zachary


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64317/#review194177
---


On Dec. 5, 2017, 12:55 a.m., Zachary Amsden wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64317/
> ---
> 
> (Updated Dec. 5, 2017, 12:55 a.m.)
> 
> 
> Review request for sentry and Na Li.
> 
> 
> Repository: sentry
> 
> 
> Description
> ---
> 
> Instead of leaking new exceptions outside the API, use the
> existing authorization exceptions to indicate authorization
> failure when

Re: Review Request 64317: SENTRY-2085: Keep sentry exceptions contained within Sentry

2017-12-20 Thread Zachary Amsden via Review Board



> On Dec. 19, 2017, 9:26 p.m., kalyan kumar kalvagadda wrote:
> > sentry-binding/sentry-binding-hive/src/main/java/org/apache/sentry/binding/hive/authz/HiveAuthzBindingHookBase.java
> > Line 837 (original), 837 (patched)
> > <https://reviews.apache.org/r/64317/diff/2/?file=1908499#file1908499line838>
> >
> > This is not the appropriate error that should be logged. 
> > 
> > SemanticException is not not right exception, please throw appropriate 
> > exception and handle it.

I'm confused by review board - I don't see the line you are mentioning being 
changed, nor do I remember writing any code related to SemanticException - what 
line in particular is this about?


> On Dec. 19, 2017, 9:26 p.m., kalyan kumar kalvagadda wrote:
> > sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/ResourceAuthorizationProvider.java
> > Lines 99 (patched)
> > <https://reviews.apache.org/r/64317/diff/2/?file=1908508#file1908508line99>
> >
> > You can not silently consume the exception here. It should be thrown in 
> > the caller where it can create appripriate error message. 
> > 
> > If you comsume this exception, AuthorizationException will be throw 
> > with message saying user doesn not have privileges to perfrom a operation.
> > 
> > Which is not the case.

This should not be throwing at all - hasAccess is a public API.  If the user 
has unknown group affiliation, based on the group permissions, there should be 
no access.  Exactly why can be probed with other calls (getGroups).

Re-throwing the exception from here virally introduces it to all of the upper 
layers and rapidly balloons into exactly the nightmare this patch is trying to 
avoid.


> On Dec. 19, 2017, 9:26 p.m., kalyan kumar kalvagadda wrote:
> > sentry-provider/sentry-provider-db/src/main/java/org/apache/sentry/provider/db/service/thrift/SentryPolicyStoreProcessor.java
> > Lines 752 (patched)
> > <https://reviews.apache.org/r/64317/diff/2/?file=1908510#file1908510line752>
> >
> > Whay are you using NoSuchObject here when you are using AccessDenied in 
> > rest of places.
> > 
> > Any specific reason?
> > 
> > I think it should be same as else where.

This API is requesting the groups of another user.  AccessDenied makes sense 
when the *requesting* user has no privilege that can be established by group 
membership, but AccessDenied when the *requested* user has no group membership 
doesn't.  The group membership set should simply be empty IMO, but I took the 
more conservative approach of keeping the same behavior of throwing and tried 
to return a more appropriate exception.  I'm happy to go any direction on this 
though, just wanted to explain my reasoning.


- Zachary


-----------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64317/#review194177
---


On Dec. 5, 2017, 12:55 a.m., Zachary Amsden wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64317/
> ---
> 
> (Updated Dec. 5, 2017, 12:55 a.m.)
> 
> 
> Review request for sentry and Na Li.
> 
> 
> Repository: sentry
> 
> 
> Description
> ---
> 
> Instead of leaking new exceptions outside the API, use the
> existing authorization exceptions to indicate authorization
> failure when a user has no group configured.
> 
> 
> Diffs
> -
> 
>   
> sentry-binding/sentry-binding-hive-common/src/main/java/org/apache/sentry/binding/hive/authz/HiveAuthzBinding.java
>  8ce7a02ed4c565e34229a5c80c1b4fd1a84bad19 
>   
> sentry-binding/sentry-binding-hive/src/main/java/org/apache/sentry/binding/hive/authz/HiveAuthzBindingHookBase.java
>  9c60c22aac826affd05cdf28b3816c68c139326d 
>   
> sentry-binding/sentry-binding-hive/src/test/java/org/apache/sentry/binding/hive/TestHiveAuthzBindings.java
>  a41d1bd533157c96430c3bf3569e1612db77c7b2 
>   
> sentry-binding/sentry-binding-solr/src/main/java/org/apache/sentry/binding/solr/authz/SentrySolrPluginImpl.java
>  91d08f0bc7f344c87e5bfb1e11b4b68728e676be 
>   
> sentry-binding/sentry-binding-solr/src/main/java/org/apache/sentry/binding/solr/authz/SolrAuthzBinding.java
>  803e5eabf322cd120456a78c57f127ed4c94f5fc 
>   
> sentry-binding/sentry-binding-solr/src/test/java/org/apache/sentry/binding/solr/TestSolrAuthzBinding.java
>  f060b82da44f642e9a1dbff86e6e834fbc09cb2b 
>   
> sentry-core/sentry-core-common/src/main/j

Re: Review Request 64317: SENTRY-2085: Keep sentry exceptions contained within Sentry

2017-12-14 Thread Zachary Amsden via Review Board



> On Dec. 14, 2017, 8:41 p.m., Na Li wrote:
> > sentry-provider/sentry-provider-common/src/test/java/org/apache/sentry/provider/common/TestNoAuthorizationProvider.java
> > Lines 42 (patched)
> > <https://reviews.apache.org/r/64317/diff/2/?file=1908509#file1908509line42>
> >
> > This is not needed

Since this is now a checked exception, we need to catch the exception now, even 
if we expect it not to be thrown.  I can drop the assert, but it seems better 
to keep it, since we really don't want that exception.


> On Dec. 14, 2017, 8:41 p.m., Na Li wrote:
> > sentry-provider/sentry-provider-file/src/test/java/org/apache/sentry/provider/file/TestLocalGroupMapping.java
> > Lines 66 (patched)
> > <https://reviews.apache.org/r/64317/diff/2/?file=1908513#file1908513line66>
> >
> > We don't need this

Ditto.


> On Dec. 14, 2017, 8:41 p.m., Na Li wrote:
> > sentry-provider/sentry-provider-file/src/test/java/org/apache/sentry/provider/file/TestLocalGroupMapping.java
> > Lines 74 (patched)
> > <https://reviews.apache.org/r/64317/diff/2/?file=1908513#file1908513line74>
> >
> > We don't need this

Ditto.


- Zachary


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64317/#review193839
---


On Dec. 5, 2017, 12:55 a.m., Zachary Amsden wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64317/
> ---
> 
> (Updated Dec. 5, 2017, 12:55 a.m.)
> 
> 
> Review request for sentry and Na Li.
> 
> 
> Repository: sentry
> 
> 
> Description
> ---
> 
> Instead of leaking new exceptions outside the API, use the
> existing authorization exceptions to indicate authorization
> failure when a user has no group configured.
> 
> 
> Diffs
> -
> 
>   
> sentry-binding/sentry-binding-hive-common/src/main/java/org/apache/sentry/binding/hive/authz/HiveAuthzBinding.java
>  8ce7a02ed4c565e34229a5c80c1b4fd1a84bad19 
>   
> sentry-binding/sentry-binding-hive/src/main/java/org/apache/sentry/binding/hive/authz/HiveAuthzBindingHookBase.java
>  9c60c22aac826affd05cdf28b3816c68c139326d 
>   
> sentry-binding/sentry-binding-hive/src/test/java/org/apache/sentry/binding/hive/TestHiveAuthzBindings.java
>  a41d1bd533157c96430c3bf3569e1612db77c7b2 
>   
> sentry-binding/sentry-binding-solr/src/main/java/org/apache/sentry/binding/solr/authz/SentrySolrPluginImpl.java
>  91d08f0bc7f344c87e5bfb1e11b4b68728e676be 
>   
> sentry-binding/sentry-binding-solr/src/main/java/org/apache/sentry/binding/solr/authz/SolrAuthzBinding.java
>  803e5eabf322cd120456a78c57f127ed4c94f5fc 
>   
> sentry-binding/sentry-binding-solr/src/test/java/org/apache/sentry/binding/solr/TestSolrAuthzBinding.java
>  f060b82da44f642e9a1dbff86e6e834fbc09cb2b 
>   
> sentry-core/sentry-core-common/src/main/java/org/apache/sentry/core/common/exception/SentryGroupNotFoundException.java
>  b978df69df1d777311146406278444ae4e7f83ee 
>   
> sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/AuthorizationProvider.java
>  2d82bcfcd5343d1b130df2f723d33a106d36ea81 
>   
> sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/GroupMappingService.java
>  7e85261070f133a6886434732d23d5a72894f8ef 
>   
> sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/HadoopGroupMappingService.java
>  bde53d5f640c98f41dea54d54dfe708ffee5dcd3 
>   
> sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/ResourceAuthorizationProvider.java
>  005724f3e3f8c623c2a266f60825cf77ac1ea777 
>   
> sentry-provider/sentry-provider-common/src/test/java/org/apache/sentry/provider/common/TestNoAuthorizationProvider.java
>  fe01b062c592e17ffa336552986e83f3f5f294e3 
>   
> sentry-provider/sentry-provider-db/src/main/java/org/apache/sentry/provider/db/service/thrift/SentryPolicyStoreProcessor.java
>  650880bb682d76c000fa51b497fae484c257b342 
>   
> sentry-provider/sentry-provider-db/src/test/java/org/apache/sentry/provider/db/generic/service/thrift/TestSentryGenericPolicyProcessor.java
>  6597a7ca724d1377ad07d8bc18530eb89b659693 
>   
> sentry-provider/sentry-provider-file/src/main/java/org/apache/sentry/provider/file/LocalGroupMappingService.java
>  54474203aed4868c3bde8450d4d27427fa1de7f6 
>   
> sentry-provider/sentry-provider-file/src/test/java/org/apache/sentry/provider/file/T

Re: Review Request 64317: SENTRY-2085: Keep sentry exceptions contained within Sentry

2017-12-05 Thread Zachary Amsden via Review Board



> On Dec. 5, 2017, 8:28 p.m., Na Li wrote:
> > sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/ResourceAuthorizationProvider.java
> > Line 101 (original), 108 (patched)
> > <https://reviews.apache.org/r/64317/diff/2/?file=1908508#file1908508line108>
> >
> > We are using both group and user to get privilege. So we should catch 
> > group not found exception, and user based privilege can be found. Need to 
> > add test cases that there is user-based privilege, but the user does not 
> > belong to a group. And the privilege can be found

I don't think this is possible to test.  The only way to generate 
SentryGroupNotFoundException with the test suite is using 
LocalGroupMappingService.  But this requires initialization and setup of a 
policy engine, and there is no policy engine which supports adding 
user-specific privileges.  I'd have to either add support for that to the 
policy engine or the local group mapping service, and these both seem like 
rather large changes.

A simple test of adding privileges to a group named "user1" did not succeed in 
adding privileges to a user named "user1".  Note that this did not throw an 
exception, so at least SentryGroupNotFoundException is not getting raised, it 
is just the local policy engine has no idea how to deal with or add 
user-specific privileges:

```java
  @Test
  public void testUserPrivilegeWithoutGroups() throws Exception {
Subject user1 = new Subject("user1");
Server server1 = new Server("server1");
AccessURI uri = new AccessURI("file:///path/to/");
Set actions = EnumSet.of(DBModelAction.ALL, 
DBModelAction.SELECT, DBModelAction.INSERT);
policyFile.addRolesToGroup("user1",  true, "role1", "role1")
  .addPermissionsToRole("role1", true, "server=" + server1.getName() + 
"->uri=" + uri.getName(),
  "server=" + server1.getName() + "->uri=" + uri.getName());
policyFile.write(iniFile);
PolicyEngine policy = 
DBPolicyTestUtil.createPolicyEngineForTest(server1.getName(), initResource);
authzProvider = new LocalGroupResourceAuthorizationProvider(initResource, 
policy, HivePrivilegeModel.getInstance());
List authorizableHierarchy = 
ImmutableList.of(server1, uri);
Assert.assertTrue(authorizableHierarchy.toString(),
authzProvider.hasAccess(user1, authorizableHierarchy, actions, 
ActiveRoleSet.ALL));
  }
```


- Zachary


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64317/#review192925
---


On Dec. 5, 2017, 12:55 a.m., Zachary Amsden wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64317/
> ---
> 
> (Updated Dec. 5, 2017, 12:55 a.m.)
> 
> 
> Review request for sentry and Na Li.
> 
> 
> Repository: sentry
> 
> 
> Description
> ---
> 
> Instead of leaking new exceptions outside the API, use the
> existing authorization exceptions to indicate authorization
> failure when a user has no group configured.
> 
> 
> Diffs
> -
> 
>   
> sentry-binding/sentry-binding-hive-common/src/main/java/org/apache/sentry/binding/hive/authz/HiveAuthzBinding.java
>  8ce7a02ed4c565e34229a5c80c1b4fd1a84bad19 
>   
> sentry-binding/sentry-binding-hive/src/main/java/org/apache/sentry/binding/hive/authz/HiveAuthzBindingHookBase.java
>  9c60c22aac826affd05cdf28b3816c68c139326d 
>   
> sentry-binding/sentry-binding-hive/src/test/java/org/apache/sentry/binding/hive/TestHiveAuthzBindings.java
>  a41d1bd533157c96430c3bf3569e1612db77c7b2 
>   
> sentry-binding/sentry-binding-solr/src/main/java/org/apache/sentry/binding/solr/authz/SentrySolrPluginImpl.java
>  91d08f0bc7f344c87e5bfb1e11b4b68728e676be 
>   
> sentry-binding/sentry-binding-solr/src/main/java/org/apache/sentry/binding/solr/authz/SolrAuthzBinding.java
>  803e5eabf322cd120456a78c57f127ed4c94f5fc 
>   
> sentry-binding/sentry-binding-solr/src/test/java/org/apache/sentry/binding/solr/TestSolrAuthzBinding.java
>  f060b82da44f642e9a1dbff86e6e834fbc09cb2b 
>   
> sentry-core/sentry-core-common/src/main/java/org/apache/sentry/core/common/exception/SentryGroupNotFoundException.java
>  b978df69df1d777311146406278444ae4e7f83ee 
>   
> sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/AuthorizationProvider.java
>  2d82bcfcd5343d1b130df2f723d33a106d36ea81 
>   
> sentry-provider/sentry-provider-common/src/main/java/org/apac

Re: Review Request 64317: SENTRY-2085: Keep sentry exceptions contained within Sentry

2017-12-04 Thread Zachary Amsden via Review Board


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64317/
---

(Updated Dec. 5, 2017, 12:55 a.m.)


Review request for sentry and Na Li.


Repository: sentry


Description
---

Instead of leaking new exceptions outside the API, use the
existing authorization exceptions to indicate authorization
failure when a user has no group configured.


Diffs (updated)
-

  
sentry-binding/sentry-binding-hive-common/src/main/java/org/apache/sentry/binding/hive/authz/HiveAuthzBinding.java
 8ce7a02ed4c565e34229a5c80c1b4fd1a84bad19 
  
sentry-binding/sentry-binding-hive/src/main/java/org/apache/sentry/binding/hive/authz/HiveAuthzBindingHookBase.java
 9c60c22aac826affd05cdf28b3816c68c139326d 
  
sentry-binding/sentry-binding-hive/src/test/java/org/apache/sentry/binding/hive/TestHiveAuthzBindings.java
 a41d1bd533157c96430c3bf3569e1612db77c7b2 
  
sentry-binding/sentry-binding-solr/src/main/java/org/apache/sentry/binding/solr/authz/SentrySolrPluginImpl.java
 91d08f0bc7f344c87e5bfb1e11b4b68728e676be 
  
sentry-binding/sentry-binding-solr/src/main/java/org/apache/sentry/binding/solr/authz/SolrAuthzBinding.java
 803e5eabf322cd120456a78c57f127ed4c94f5fc 
  
sentry-binding/sentry-binding-solr/src/test/java/org/apache/sentry/binding/solr/TestSolrAuthzBinding.java
 f060b82da44f642e9a1dbff86e6e834fbc09cb2b 
  
sentry-core/sentry-core-common/src/main/java/org/apache/sentry/core/common/exception/SentryGroupNotFoundException.java
 b978df69df1d777311146406278444ae4e7f83ee 
  
sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/AuthorizationProvider.java
 2d82bcfcd5343d1b130df2f723d33a106d36ea81 
  
sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/GroupMappingService.java
 7e85261070f133a6886434732d23d5a72894f8ef 
  
sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/HadoopGroupMappingService.java
 bde53d5f640c98f41dea54d54dfe708ffee5dcd3 
  
sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/ResourceAuthorizationProvider.java
 005724f3e3f8c623c2a266f60825cf77ac1ea777 
  
sentry-provider/sentry-provider-common/src/test/java/org/apache/sentry/provider/common/TestNoAuthorizationProvider.java
 fe01b062c592e17ffa336552986e83f3f5f294e3 
  
sentry-provider/sentry-provider-db/src/main/java/org/apache/sentry/provider/db/service/thrift/SentryPolicyStoreProcessor.java
 650880bb682d76c000fa51b497fae484c257b342 
  
sentry-provider/sentry-provider-db/src/test/java/org/apache/sentry/provider/db/generic/service/thrift/TestSentryGenericPolicyProcessor.java
 6597a7ca724d1377ad07d8bc18530eb89b659693 
  
sentry-provider/sentry-provider-file/src/main/java/org/apache/sentry/provider/file/LocalGroupMappingService.java
 54474203aed4868c3bde8450d4d27427fa1de7f6 
  
sentry-provider/sentry-provider-file/src/test/java/org/apache/sentry/provider/file/TestLocalGroupMapping.java
 9864b82bfd9c499ab2b1f8ba9d4664fe19899d4e 
  
sentry-solr/solr-sentry-handlers/src/main/java/org/apache/solr/handler/component/QueryDocAuthorizationComponent.java
 2338ab8375a6381e8d5fc8b38f766789187f69af 
  
sentry-tests/sentry-tests-hive-v2/src/test/java/org/apache/sentry/tests/e2e/hive/TestUserManagement.java
 02ac51454a13c0c1c61bb8684872e4815bd88b97 
  
sentry-tests/sentry-tests-hive/src/test/java/org/apache/sentry/tests/e2e/hive/TestUserManagement.java
 02ac51454a13c0c1c61bb8684872e4815bd88b97 


Diff: https://reviews.apache.org/r/64317/diff/2/

Changes: https://reviews.apache.org/r/64317/diff/1-2/


Testing
---

Running JUnit tests with mvn install.


Thanks,

Zachary Amsden

Re: Review Request 64317: SENTRY-2085: Keep sentry exceptions contained within Sentry

2017-12-04 Thread Zachary Amsden via Review Board


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64317/#review192796
---




sentry-provider/sentry-provider-db/src/test/java/org/apache/sentry/provider/db/generic/service/thrift/TestSentryGenericPolicyProcessor.java
Lines 257-258 (patched)
<https://reviews.apache.org/r/64317/#comment271087>

Leftover garbage from failed test attempt.



sentry-tests/sentry-tests-hive-v2/src/test/java/org/apache/sentry/tests/e2e/hive/TestUserManagement.java
Line 383 (original)
<https://reviews.apache.org/r/64317/#comment271086>

Missing the same change for sentry-tests-hive


- Zachary Amsden


On Dec. 4, 2017, 11:30 p.m., Zachary Amsden wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64317/
> ---
> 
> (Updated Dec. 4, 2017, 11:30 p.m.)
> 
> 
> Review request for sentry and Na Li.
> 
> 
> Repository: sentry
> 
> 
> Description
> ---
> 
> Instead of leaking new exceptions outside the API, use the
> existing authorization exceptions to indicate authorization
> failure when a user has no group configured.
> 
> 
> Diffs
> -
> 
>   
> sentry-binding/sentry-binding-hive-common/src/main/java/org/apache/sentry/binding/hive/authz/HiveAuthzBinding.java
>  8ce7a02ed4c565e34229a5c80c1b4fd1a84bad19 
>   
> sentry-binding/sentry-binding-hive/src/main/java/org/apache/sentry/binding/hive/authz/HiveAuthzBindingHookBase.java
>  9c60c22aac826affd05cdf28b3816c68c139326d 
>   
> sentry-binding/sentry-binding-hive/src/test/java/org/apache/sentry/binding/hive/TestHiveAuthzBindings.java
>  a41d1bd533157c96430c3bf3569e1612db77c7b2 
>   
> sentry-binding/sentry-binding-solr/src/main/java/org/apache/sentry/binding/solr/authz/SentrySolrPluginImpl.java
>  91d08f0bc7f344c87e5bfb1e11b4b68728e676be 
>   
> sentry-binding/sentry-binding-solr/src/main/java/org/apache/sentry/binding/solr/authz/SolrAuthzBinding.java
>  803e5eabf322cd120456a78c57f127ed4c94f5fc 
>   
> sentry-binding/sentry-binding-solr/src/test/java/org/apache/sentry/binding/solr/TestSolrAuthzBinding.java
>  f060b82da44f642e9a1dbff86e6e834fbc09cb2b 
>   
> sentry-core/sentry-core-common/src/main/java/org/apache/sentry/core/common/exception/SentryGroupNotFoundException.java
>  b978df69df1d777311146406278444ae4e7f83ee 
>   
> sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/AuthorizationProvider.java
>  2d82bcfcd5343d1b130df2f723d33a106d36ea81 
>   
> sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/GroupMappingService.java
>  7e85261070f133a6886434732d23d5a72894f8ef 
>   
> sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/HadoopGroupMappingService.java
>  bde53d5f640c98f41dea54d54dfe708ffee5dcd3 
>   
> sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/ResourceAuthorizationProvider.java
>  005724f3e3f8c623c2a266f60825cf77ac1ea777 
>   
> sentry-provider/sentry-provider-common/src/test/java/org/apache/sentry/provider/common/TestNoAuthorizationProvider.java
>  fe01b062c592e17ffa336552986e83f3f5f294e3 
>   
> sentry-provider/sentry-provider-db/src/main/java/org/apache/sentry/provider/db/service/thrift/SentryPolicyStoreProcessor.java
>  650880bb682d76c000fa51b497fae484c257b342 
>   
> sentry-provider/sentry-provider-db/src/test/java/org/apache/sentry/provider/db/generic/service/thrift/TestSentryGenericPolicyProcessor.java
>  6597a7ca724d1377ad07d8bc18530eb89b659693 
>   
> sentry-provider/sentry-provider-file/src/main/java/org/apache/sentry/provider/file/LocalGroupMappingService.java
>  54474203aed4868c3bde8450d4d27427fa1de7f6 
>   
> sentry-provider/sentry-provider-file/src/test/java/org/apache/sentry/provider/file/TestLocalGroupMapping.java
>  9864b82bfd9c499ab2b1f8ba9d4664fe19899d4e 
>   
> sentry-solr/solr-sentry-handlers/src/main/java/org/apache/solr/handler/component/QueryDocAuthorizationComponent.java
>  2338ab8375a6381e8d5fc8b38f766789187f69af 
>   
> sentry-tests/sentry-tests-hive-v2/src/test/java/org/apache/sentry/tests/e2e/hive/TestUserManagement.java
>  02ac51454a13c0c1c61bb8684872e4815bd88b97 
> 
> 
> Diff: https://reviews.apache.org/r/64317/diff/1/
> 
> 
> Testing
> ---
> 
> Running JUnit tests with mvn install.
> 
> 
> Thanks,
> 
> Zachary Amsden
> 
>

Review Request 64317: SENTRY-2085: Keep sentry exceptions contained within Sentry

2017-12-04 Thread Zachary Amsden via Review Board


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64317/
---

Review request for sentry and Na Li.


Repository: sentry


Description
---

Instead of leaking new exceptions outside the API, use the
existing authorization exceptions to indicate authorization
failure when a user has no group configured.


Diffs
-

  
sentry-binding/sentry-binding-hive-common/src/main/java/org/apache/sentry/binding/hive/authz/HiveAuthzBinding.java
 8ce7a02ed4c565e34229a5c80c1b4fd1a84bad19 
  
sentry-binding/sentry-binding-hive/src/main/java/org/apache/sentry/binding/hive/authz/HiveAuthzBindingHookBase.java
 9c60c22aac826affd05cdf28b3816c68c139326d 
  
sentry-binding/sentry-binding-hive/src/test/java/org/apache/sentry/binding/hive/TestHiveAuthzBindings.java
 a41d1bd533157c96430c3bf3569e1612db77c7b2 
  
sentry-binding/sentry-binding-solr/src/main/java/org/apache/sentry/binding/solr/authz/SentrySolrPluginImpl.java
 91d08f0bc7f344c87e5bfb1e11b4b68728e676be 
  
sentry-binding/sentry-binding-solr/src/main/java/org/apache/sentry/binding/solr/authz/SolrAuthzBinding.java
 803e5eabf322cd120456a78c57f127ed4c94f5fc 
  
sentry-binding/sentry-binding-solr/src/test/java/org/apache/sentry/binding/solr/TestSolrAuthzBinding.java
 f060b82da44f642e9a1dbff86e6e834fbc09cb2b 
  
sentry-core/sentry-core-common/src/main/java/org/apache/sentry/core/common/exception/SentryGroupNotFoundException.java
 b978df69df1d777311146406278444ae4e7f83ee 
  
sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/AuthorizationProvider.java
 2d82bcfcd5343d1b130df2f723d33a106d36ea81 
  
sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/GroupMappingService.java
 7e85261070f133a6886434732d23d5a72894f8ef 
  
sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/HadoopGroupMappingService.java
 bde53d5f640c98f41dea54d54dfe708ffee5dcd3 
  
sentry-provider/sentry-provider-common/src/main/java/org/apache/sentry/provider/common/ResourceAuthorizationProvider.java
 005724f3e3f8c623c2a266f60825cf77ac1ea777 
  
sentry-provider/sentry-provider-common/src/test/java/org/apache/sentry/provider/common/TestNoAuthorizationProvider.java
 fe01b062c592e17ffa336552986e83f3f5f294e3 
  
sentry-provider/sentry-provider-db/src/main/java/org/apache/sentry/provider/db/service/thrift/SentryPolicyStoreProcessor.java
 650880bb682d76c000fa51b497fae484c257b342 
  
sentry-provider/sentry-provider-db/src/test/java/org/apache/sentry/provider/db/generic/service/thrift/TestSentryGenericPolicyProcessor.java
 6597a7ca724d1377ad07d8bc18530eb89b659693 
  
sentry-provider/sentry-provider-file/src/main/java/org/apache/sentry/provider/file/LocalGroupMappingService.java
 54474203aed4868c3bde8450d4d27427fa1de7f6 
  
sentry-provider/sentry-provider-file/src/test/java/org/apache/sentry/provider/file/TestLocalGroupMapping.java
 9864b82bfd9c499ab2b1f8ba9d4664fe19899d4e 
  
sentry-solr/solr-sentry-handlers/src/main/java/org/apache/solr/handler/component/QueryDocAuthorizationComponent.java
 2338ab8375a6381e8d5fc8b38f766789187f69af 
  
sentry-tests/sentry-tests-hive-v2/src/test/java/org/apache/sentry/tests/e2e/hive/TestUserManagement.java
 02ac51454a13c0c1c61bb8684872e4815bd88b97 


Diff: https://reviews.apache.org/r/64317/diff/1/


Testing
---

Running JUnit tests with mvn install.


Thanks,

Zachary Amsden

Anyone care to do a quick review?

2017-11-29 Thread Zachary Amsden

It's a simple test incompatibility fix between Hadoop 2.6 and 3.0.

Also, I'd like to point out that when writing tests, the default behavior
of the Python test suite is to verify the /sorted/ test output matches.
That doesn't work with regex or row_regex matching (although I may be
fixing this soon to workaround another incompatible change), and isn't
appropriate for all tests, especially for things like planner tests, where
the exact order actually does matter.  To get the behavior of verifying
test output in-order, and with regex capability, you need to use
VERIFY_IS_EQUAL on results.

Anyway, here's the diff, I'm looking for another +1 before I commit:

https://gerrit.cloudera.org/#/c/8656/

IMPALA-5607

2017-10-31 Thread Zachary Amsden

In discussion about IMPALA-5607, it came up that implementing this as
described is a compatibility breaking change.  There are a couple of
required changes:

1) Return data type for date_part and EXTRACT FROM must be promoted to
BIGINT.
2) MILLISECONDS now will include seconds part in the calculation of
milliseconds.

I think the first is a non-issue; we're promoting the type to be wide
enough to hold nanoseconds precision values (with the seconds component
included).  An alternative could be to return this as a decimal type, but
that seems rather unwieldy for other date expressions so I'd prefer these
values to all be returned as integral types.

The bigger issue is including seconds in the calculation of milliseconds,
microseconds and nanoseconds breaks the existing value returned for
milliseconds, which is just bare milliseconds with no seconds component.
For compatibility with other SQL implementations, I think we'd like to
include seconds with all of these date parts, but that is certainly
debatable.

The question then, is anyone relying on this functionality that can't
easily workaround such a change?  The Impala documentation doesn't specify
this behavior either way, and there isn't a formal specification for how
sub-second granularity time is handled.  Whatever we decide, we should
document this going forward.

 - Zach

Re: [VOTE] Graduate to a TLP

2017-10-18 Thread Zachary Amsden

+1

On Wed, Oct 18, 2017 at 11:35 AM, David Knupp  wrote:

> +1
>
>> Daniel Hecht 
>> October 18, 2017 at 11:32 AM
>> +1
>> Anuj Phadke 
>> October 18, 2017 at 9:23 AM
>> +1
>>
>> On Wed, Oct 18, 2017 at 9:17 AM Joe McDonnell 
>>
>> Joe McDonnell 
>> October 18, 2017 at 9:16 AM
>> +1
>>
>>
>> John Sherman 
>> October 18, 2017 at 9:11 AM
>>
>> +1
>>
>> I’ve been very happy with my interactions with the Impala community.
>>
>> - John
>>
>> Philip Zeyliger 
>> October 18, 2017 at 8:06 AM
>> +1!
>>
>> -- Philip
>>
>>
>>
> --
> David Knupp | Software Engineer
> Cloudera Inc., 433 California Street #1100. San Francisco, CA 94104
>
>  glink&utm_campaign=reach>
>

Re: Build broken by b66af0357e - IMPALA-5854: Update external hadoop versions

2017-09-01 Thread Zachary Amsden

If this fails for you, wiping out toolchain/cdh_components and restarting
the build should fix the problem.

If not, ping me!

 - Zach

On Fri, Sep 1, 2017 at 10:19 AM, Tim Armstrong 
wrote:

> It works ok for me. Maybe you need to re-source impala-config.sh or
> re-bootstrap to pull down the new components?
>
> On Fri, Sep 1, 2017 at 10:17 AM, Lars Volker  wrote:
>
> > It looks like the recent hadoop version update broke the build for me. I
> > get this error, with the previous commit it still work. Anyone else
> seeing
> > this?
> >
> > be/src/common/hdfs.h:30:18: fatal error: hdfs.h: No such file or
> directory
> >  #include 
> >
>

Re: Making a new development environment from scratch

2017-08-11 Thread Zachary Amsden

Nice!  I had to hack around this as well to get a different Java version
set up.  It may be worth making the Java version a configurable parameter.

 - Zach

On Fri, Aug 11, 2017 at 10:16 AM, Jim Apple  wrote:

> bin/bootstrap_development.sh now no longer references the chef repo <
> https://github.com/awleblang/impala-setup>. My hope is that this makes it
> easier to maintain and extend to new Linux distributions.
>
> It now also supports Ubuntu 16.04 and 14.04. I have tested it in EC2,
> Google Compute Engine, and Docker.
>
> It currently uses OpenJDK -- version 7 on 14.04 and version 8 on 16.04. It
> could be changed to Oracle JDK on 16.04:
> https://issues.apache.org/jira/browse/IMPALA-5793
>
> The pre-merge job has been switched away from using <
> https://jenkins.impala.io/view/Utility/job/ubuntu-14.04-from-scratch/> to
> using . You will
> need to rebase before starting the pre-merge job.
>

Re: IMPALA-3738 advice requested

2017-06-27 Thread Zachary Amsden

I would recommend care when messing with the DCHECK_EQ macros and
user-defined types, I've been burned by this many times.  This may have
been fixed with the new version of glog-0.3.4-p2 toolchain update, the
scary old comment "TODO(hamaji): Figure out a way to fix" and "#if 1 using
::operator<<; #endif" seem to have been removed, so YMMV.  Maybe the friend
suggestion works now.

Also, suggestion number 4:

4.  Wait for reflection to arrive in C++20, whereby we get this
functionality for free.

I wouldn't invest heavily in stringify functionality that requires
maintenance of any kind at all.  DCHECKs are mainly useful for generating
cores, where the enum values are already available via debug symbols anyway.

 - Zach

On Tue, Jun 27, 2017 at 2:06 PM, Jim Apple  wrote:

> I'm not sure it's worth the effort to write all of the stringify methods.
> My order of preference is:
>
> 1. Just use friend with the current conversion-to-int behavior.
>
> 2. Do something from this list so there won't be manual work for each new
> enum decl:
> https://stackoverflow.com/questions/28828957/enum-to-string-
> in-modern-c-and-future-c17-c20
>
> 3. Write a stringify function for each enum type.
>
> On Tue, Jun 27, 2017 at 1:42 PM, Steve Carlin 
> wrote:
>
> > Hi all,
> >
> > Short time user, first time poster.  This post is in reference to the
> > following Jira:  https://issues.apache.org/jira/browse/IMPALA-3738
> >
> > I'm currently looking at IMPALA-3738 which involves changing enums to
> c+11
> > style (e.g. "enum A" to "enum class A"
> >
> > Unfortunately, as with everything in life, there are complications.
> > Nothing major, but wanted some advice before I changed a lot of code only
> > to find out it gets rejected.
> >
> > Chrome did a similar fix and had a similar complication.  For details,
> you
> > can reference this code review:
> > https://codereview.chromium.org/835783003/patch/1/10003
> >
> > But the problem is this:  Simply changing it to c+11 style gives us a
> > compiler error with the macro DCHECK_EQ.  The solution which they used in
> > chrome was to add a "friend" class.  I'm ok with doing this unless
> someone
> > has another suggestion.
> >
> > On top of this though, I have another question.  The DCHECK_EQ macro will
> > print a message such as this:  Check failed: x == y (1 vs. 3).
> > Specifically, my question has to do with the "1 vs 3" portion.  Do we
> want
> > to print a more user friendly message?  If so, we would have to do
> > something similar to what the chrome checkin has:  They call a method
> that
> > converts the enum to a string and prints out the string value (they
> already
> > had these methods in place before their change).  Is it worth doing this
> > kind of rewrite, or are we ok with just having the enum value?
> >
> > Thanks!
> >
> > Steve
> >
>

Can a committer carry a +2

2017-06-07 Thread Zachary Amsden

Last GVO run passed with flying colors:
https://gerrit.cloudera.org/#/c/6935/

Thanks in advance,

 - Zach

Re: connection refuseLeap status not available

2017-05-24 Thread Zachary Amsden

This is similar to what I hit with NTP the other day after a restart.  I
tried a number of things, and I think the only thing that worked was
waiting for NTP to sync.  Pitfalls: ntpdate requires a host on the command
line, and doesn't read the configuration file.

There was some circumstantial evidence that a connection to the Ubuntu NTP
pool failed during boot - overload or random connection failure, but I
never got an exact cause.

On Wed, May 24, 2017 at 12:09 PM, Tim Armstrong 
wrote:

> I see that with some frequency when restarting my system.
>
> I usually manage to fix it with a non-scientific approach of running some
> combination of these commands until ntp-wait works:
>
> sudo service ntp restart
> sudo ntpdate -s ntp.ubuntu.com
> ntp-wait -v
>
> On Wed, May 24, 2017 at 11:33 AM, Jim Apple  wrote:
>
> > testdata/cluster/admin
> >
> > calls ntp-wait which returns an error:
> >
> > "ntpq: read: Connection refuseLeap status not avalaible"
> >
> > Has anyone seen this? There are 0 Google hits for refuseLeap and the
> > string is not present in my repository or my /etc/.
> >
>

Re: Bootstrapping a development environment on Ubuntu 16.04

2017-05-22 Thread Zachary Amsden

FWIW, I've tried 16.04 XFCE / Ubuntu studio variants on two different HW
machines now, was able to build impala on one and load data, but both
machines flaked out with weird graphics related failures after kernel
upgrades resulting in a black screen on boot.  I was able to debug one of
them enough to get it running again and also enough to know I didn't want
to repeat the procedure, so I upgraded to 17.04 and have a somewhat working
system, but I'd recommend avoiding any customization of packages and
installs unless you enjoy headless debugging, and I'd recommend avoiding
17.04 entirely if you want a stable supported base.

 - Zach

On Sun, May 21, 2017 at 11:13 AM, Jim Apple  wrote:

> I was able to run all the data loading and all of the core tests. One
> Kudu test flaked a couple of times on a timeout error.
>
> On Sun, May 21, 2017 at 10:10 AM, Tim Armstrong 
> wrote:
> > Did you manage to run all of the tests and data loading? I remember a
> while
> > back I was having issues with Kudu crashing - never retried to see if it
> > got better.
> >
> > On Sat, May 20, 2017 at 9:23 PM, Jim Apple  wrote:
> >
> >> I am hoping to upgrade from Ubuntu 14.04 to 16.04. In order to check
> >> that I could still do Impala development work, I bootstrapped a
> >> development environment on a 16.04 cloud instance. The script that
> >> worked for me is here:
> >>
> >> https://cwiki.apache.org/confluence/display/IMPALA/Bootstrap
> >> ping+an+Impala+Development+Environment+From+Scratch
> >>
> >> While it might be useful to have this as a Jenkins job, I have no
> >> plans to create such a job.
> >>
> >> It might be a bit cleaner to get this done in the chef repo in
> >> https://github.com/awleblang/impala-setup, but I don't plan to do that
> >> at this time.
> >>
>

Could someone kick off a GVO for this?

2017-05-15 Thread Zachary Amsden

https://gerrit.cloudera.org/#/c/6335/

Nested namespaces in C++ code

2017-05-02 Thread Zachary Amsden

So I've recently come upon a situation where it would be useful to have
nested namespaces.  The reason is that there are a lot of static functions
in a class which I would like to benchmark.  Writing the same code to run
the benchmark test over and over is very tedious and is much more
efficiently accomplished with a macro.  However, since the functions are
embedded as static functions in a class, it isn't possible to pass the
names along as macro parameters, which would make this task much easier.
Instead I have to resort to a lot of duplicated parameters or multiple
levels of macros (since I can't easily convert :: into a legal member of a
function name).

If the functions were embedded instead in a nested namespace, a simple
'using namespace::subspace' declaration would bring all these to top level
in the benchmark and all would be well.

I've seen conflicting documentation about nested namespace in Apache C++
style guides, so wondering if we could have a discussion on that.

Thanks,

 - Zach

Verification run for 6389

2017-04-07 Thread Zachary Amsden

Assuming things are good with the build, could someone kick off a
verification run for https://gerrit.cloudera.org/#/c/6389/

Thanks,

 - Zach

Re: Impala JIRA migration to https://issues.apache.org/jira is complete

2017-03-14 Thread Zachary Amsden

As one of those affected, what should I do?  I already created a new
account as zamsden, but all of my issues are assigned to
zamsden_impala_ad21.  Should I reassign and transfer things over or am I
forever doomed to be zamsden_impala_ad21?

Thanks,

 - Zach

On Mon, Mar 13, 2017 at 2:24 PM, Jim Apple  wrote:

> *If you read or write Impala JIRAs, this message is for you.*
>
> Impala's JIRA is now at https://issues.apache.org/jira/browse/IMPALA/.
> Links like https://issues.cloudera.org/browse/IMPALA-3224 should now
> redirect to https://issues.apache.org/jira/browse/IMPALA-3224.
>
> Permissions on the Apache JIRA are much more strict. We have tried to make
> them almost as permissive for the IMPALA project as they were on the IMPALA
> project on issues.cloudera.org. If you need permissions you don't have,
> please email d...@impala.apache.org.
>
> Other than that, we don't know of any major issues at this point, however
> we have found and fixed some unexpected problems so far, and others may
> exist. If you find something that is impeding your work, please file a
> subtask of https://issues.apache.org/jira/browse/IMPALA-5064.
>
> We tried to translate usernames as accurately as possible. In many cases
> this worked just fine. In some cases this was not possible, and you will
> see the issues that you participated on as $OLDUSERNAME now list you as
> $OLDUSERNAME_impala_1de3, with some random-looking hex digits after your
> name. You can claim that username by resetting your password:
>
> 1. Go to https://issues.apache.org/jira/secure/ForgotLoginDetails.jspa
> 2. Forgot Password, enter the username that looks like
> $OLDUSERNAME_impala_1ead
> 3. Wait for email
> 4. Reset password
>
> If you have problems with that process, please follow
> https://www.apache.org/dev/infra-contact and contact Apache
> infrastructure. The Impala community does not administer the JIRA server.
>

GVO request

2017-03-13 Thread Zachary Amsden

Looking to start a verification run for
https://gerrit.cloudera.org/#/c/6275/

Thanks in advance,

 - Zach

Re: status-benchmark.cc compilation time

2017-02-23 Thread Zachary Amsden

Yes.  If you take a look at the benchmark, you'll notice the JNI call to
initialize the frontend doesn't even have the right signature anymore.
That's one easy way to bitrot while still compiling.

Even fixing that isn't enough to get it off the ground.

 - Zach

On Tue, Feb 21, 2017 at 11:44 AM, Henry Robinson  wrote:

> Did you run . bin/set-classpath.sh before running expr-benchmark?
>
> On 21 February 2017 at 11:30, Zachary Amsden  wrote:
>
> > Unfortunately some of the benchmarks have actually bit-rotted.  For
> > example, expr-benchmark compiles but immediately throws JNI exceptions.
> >
> > On Tue, Feb 21, 2017 at 10:55 AM, Marcel Kornacker 
> > wrote:
> >
> > > I'm also in favor of not compiling it on the standard commandline.
> > >
> > > However, I'm very much against allowing the benchmarks to bitrot. As
> > > was pointed out, those benchmarks can be valuable tools during
> > > development, and keeping them in working order shouldn't really impact
> > > the development process.
> > >
> > > In other words, let's compile them as part of gvo.
> > >
> > > On Tue, Feb 21, 2017 at 10:50 AM, Alex Behm 
> > > wrote:
> > > > +1 for not compiling the benchmarks in -notests
> > > >
> > > > On Mon, Feb 20, 2017 at 7:55 PM, Jim Apple 
> > wrote:
> > > >
> > > >> > On which note, would anyone object if we disabled benchmark
> > > compilation
> > > >> by
> > > >> > default when building the BE tests? I mean separating out -notests
> > > into
> > > >> > -notests and -build_benchmarks (the latter false by default).
> > > >>
> > > >> I think this is a great idea.
> > > >>
> > > >> > I don't mind if the benchmarks bitrot as a result, because we
> don't
> > > run
> > > >> > them regularly or pay attention to their output except when
> > > developing a
> > > >> > feature. Of course, maybe an 'exhaustive' run should build the
> > > benchmarks
> > > >> > as well just to keep us honest, but I'd be happy if 95% of Jenkins
> > > builds
> > > >> > didn't bother.
> > > >>
> > > >> The pre-merge (aka GVM aka GVO) testing builds
> > > >> http://jenkins.impala.io:8080/job/all-build-options, which builds
> > > >> without the "-notests" flag.
> > > >>
> > >
> >
>
>
>
> --
> Henry Robinson
> Software Engineer
> Cloudera
> 415-994-6679
>

Re: status-benchmark.cc compilation time

2017-02-21 Thread Zachary Amsden

Unfortunately some of the benchmarks have actually bit-rotted.  For
example, expr-benchmark compiles but immediately throws JNI exceptions.

On Tue, Feb 21, 2017 at 10:55 AM, Marcel Kornacker 
wrote:

> I'm also in favor of not compiling it on the standard commandline.
>
> However, I'm very much against allowing the benchmarks to bitrot. As
> was pointed out, those benchmarks can be valuable tools during
> development, and keeping them in working order shouldn't really impact
> the development process.
>
> In other words, let's compile them as part of gvo.
>
> On Tue, Feb 21, 2017 at 10:50 AM, Alex Behm 
> wrote:
> > +1 for not compiling the benchmarks in -notests
> >
> > On Mon, Feb 20, 2017 at 7:55 PM, Jim Apple  wrote:
> >
> >> > On which note, would anyone object if we disabled benchmark
> compilation
> >> by
> >> > default when building the BE tests? I mean separating out -notests
> into
> >> > -notests and -build_benchmarks (the latter false by default).
> >>
> >> I think this is a great idea.
> >>
> >> > I don't mind if the benchmarks bitrot as a result, because we don't
> run
> >> > them regularly or pay attention to their output except when
> developing a
> >> > feature. Of course, maybe an 'exhaustive' run should build the
> benchmarks
> >> > as well just to keep us honest, but I'd be happy if 95% of Jenkins
> builds
> >> > didn't bother.
> >>
> >> The pre-merge (aka GVM aka GVO) testing builds
> >> http://jenkins.impala.io:8080/job/all-build-options, which builds
> >> without the "-notests" flag.
> >>
>

Re: If FE tests hang on your machine, try restarting the minicluster

2017-02-17 Thread Zachary Amsden

There was a brief connectivity interruption today with an external NTP pool
that broke Kudu for me, could be the same issue.

On Fri, Feb 17, 2017 at 11:56 AM, Lars Volker  wrote:

> Hi All,
>
> The frontend tests seemed to hang on my local dev machine. Running jstack
> on the child of a hanging test process gave the attached stacks.
>
> From the stacks it looked like Kudu was implicated in the hang, and a
> restart of the local minicluster fixed the issue. Hope it helps someone. :)
>
> Cheers, Lars
>

Can someone +2 REPLACE() functionality and kick off a verification run?

2017-02-14 Thread Zachary Amsden

Got hit by a front-end parser error message, then a clang-tidy, then a
merge conflict.  Should be passing now.

https://gerrit.cloudera.org/#/c/5776/

Thanks in advance,

 - Zach

Re: Preferred syntax for warning about ignored Status returns

2017-01-09 Thread Zachary Amsden

Yes Jim, exactly.  I also sent the wrong link at least once, but this was
my point.  In the absence of compiler support I think it is actually
possible to implement this anyway using syntax 1 but that is left as an
exercise to the reader.

On Mon, Jan 9, 2017 at 10:20 AM, Jim Apple  wrote:

> Though the attribute is "on" the definition, it can't appear in that
> location (after parameters, before curly braces). I don't know why.
>
> I think I now understand what you were saying above: if we use
> WARN_UNUSED_RESULT, then it can go after the function for functions
> with prototypes but must go earlier in the line for functions without
> prototypes.
>
> Did I get that right?
>
> On Mon, Jan 9, 2017 at 10:11 AM, Zachary Amsden 
> wrote:
> > Maybe I'm just being dense but I couldn't get it to work with syntax 2
> on a
> > function definition without having a separate forward declaration:
> >
> > https://godbolt.org/g/ODtoQC  vs. https://godbolt.org/g/WCxDZv
> > (non-functional)
> >
> >  - Zach
> >
> > On Mon, Jan 9, 2017 at 10:00 AM, Jim Apple  wrote:
> >
> >> That's applying it to the type definition. At the type use:
> >>
> >> https://godbolt.org/g/RMYVW7
> >>
> >> On Mon, Jan 9, 2017 at 9:56 AM, Zachary Amsden 
> >> wrote:
> >> > GCC doesn't catch this when optimization is enabled and the result is
> >> > discarded:
> >> >
> >> > https://godbolt.org/g/4b0BQC
> >> >
> >> > I think that means a type wrapper approach is needed, which probably
> >> > necessitates option 1.
> >> >
> >> >  - Zach
> >> >
> >> > On Mon, Jan 9, 2017 at 9:17 AM, Jim Apple 
> wrote:
> >> >
> >> >> My vote, as I mentioned on the patch, is option 1. I see MUST_USE(T)
> >> >> as a property of T, like const T or volatile T. I think it is dual to
> >> >> move semantics or to Rust's ability to temporarily or permanently
> >> >> consume values so that /only/ one copy is in use rather than
> >> >> MUST_USE's /at least one/.
> >> >>
> >> >> https://en.wikipedia.org/wiki/Substructural_type_system
> >> >>
> >> >> On Fri, Jan 6, 2017 at 4:05 PM, Taras Bobrovytsky
> >> >>  wrote:
> >> >> > I'd vote for #2. I think it's better to have less important
> >> information
> >> >> > (such as qualifiers) towards the end of lines. (I think it would be
> >> nice
> >> >> if
> >> >> > modifiers such as public and private were at the end of method
> >> >> declarations
> >> >> > in Java, for example: void myMethod() private static {...})
> >> >> >
> >> >> > On Fri, Jan 6, 2017 at 3:38 PM, Daniel Hecht 
> >> >> wrote:
> >> >> >
> >> >> >> As I indicated in the original review, my preference is #2 but I
> >> don't
> >> >> feel
> >> >> >> too strongly.
> >> >> >>
> >> >> >> On Fri, Jan 6, 2017 at 2:59 PM, Tim Armstrong <
> >> tarmstr...@cloudera.com>
> >> >> >> wrote:
> >> >> >>
> >> >> >> > Hi All,
> >> >> >> >   I wanted to poll the Impala community for opinions about style
> >> for
> >> >> >> > declaring functions where the caller is expected to do something
> >> with
> >> >> the
> >> >> >> > return value.
> >> >> >> >
> >> >> >> > Ideally we'd be able to declare Status with an attribute that
> made
> >> >> this
> >> >> >> > take effect globally, but unfortunately that's not available
> until
> >> >> C++17.
> >> >> >> >
> >> >> >> > So we need to annotate each Status-returning function. The two
> >> >> >> alternatives
> >> >> >> > we discussed on this CR (https://gerrit.cloudera.org/#/c/4878/)
> >> were:
> >> >> >> >
> >> >> >> > #1 - a special macro wrapping Status
> >> >> >> >
> >> >> >> > MUST_USE(Status) DoSomethingThatCanFail(int64_t foo, Bar* bar);
> >> >> >> >
> >> >> >> > Pros:
> >> >> >> > * Closely connected to the return type that it affects
> >> >> >> > * It's easier to search/replace Status with MUST_USE(Status)
> >> >> >> >
> >> >> >> > Cons:
> >> >> >> > * Could get visually noisy if we use it everywhere
> >> >> >> >
> >> >> >> > #2 - a macro that gets appended to the declaration:
> >> >> >> >
> >> >> >> > Status DoSomethingThatCanFail(int64_t foo, Bar* bar)
> >> >> WARN_UNUSED_RESULT;
> >> >> >> >
> >> >> >> > Pros:
> >> >> >> > * Macro is slightly
> >> >> >> > * Less visually noisy since it's at the end of the declaration
> >> >> >> >
> >> >> >> > What do people think?
> >> >> >> >
> >> >> >>
> >> >>
> >>
>

Re: Preferred syntax for warning about ignored Status returns

2017-01-09 Thread Zachary Amsden

Maybe I'm just being dense but I couldn't get it to work with syntax 2 on a
function definition without having a separate forward declaration:

https://godbolt.org/g/ODtoQC  vs. https://godbolt.org/g/WCxDZv
(non-functional)

 - Zach

On Mon, Jan 9, 2017 at 10:00 AM, Jim Apple  wrote:

> That's applying it to the type definition. At the type use:
>
> https://godbolt.org/g/RMYVW7
>
> On Mon, Jan 9, 2017 at 9:56 AM, Zachary Amsden 
> wrote:
> > GCC doesn't catch this when optimization is enabled and the result is
> > discarded:
> >
> > https://godbolt.org/g/4b0BQC
> >
> > I think that means a type wrapper approach is needed, which probably
> > necessitates option 1.
> >
> >  - Zach
> >
> > On Mon, Jan 9, 2017 at 9:17 AM, Jim Apple  wrote:
> >
> >> My vote, as I mentioned on the patch, is option 1. I see MUST_USE(T)
> >> as a property of T, like const T or volatile T. I think it is dual to
> >> move semantics or to Rust's ability to temporarily or permanently
> >> consume values so that /only/ one copy is in use rather than
> >> MUST_USE's /at least one/.
> >>
> >> https://en.wikipedia.org/wiki/Substructural_type_system
> >>
> >> On Fri, Jan 6, 2017 at 4:05 PM, Taras Bobrovytsky
> >>  wrote:
> >> > I'd vote for #2. I think it's better to have less important
> information
> >> > (such as qualifiers) towards the end of lines. (I think it would be
> nice
> >> if
> >> > modifiers such as public and private were at the end of method
> >> declarations
> >> > in Java, for example: void myMethod() private static {...})
> >> >
> >> > On Fri, Jan 6, 2017 at 3:38 PM, Daniel Hecht 
> >> wrote:
> >> >
> >> >> As I indicated in the original review, my preference is #2 but I
> don't
> >> feel
> >> >> too strongly.
> >> >>
> >> >> On Fri, Jan 6, 2017 at 2:59 PM, Tim Armstrong <
> tarmstr...@cloudera.com>
> >> >> wrote:
> >> >>
> >> >> > Hi All,
> >> >> >   I wanted to poll the Impala community for opinions about style
> for
> >> >> > declaring functions where the caller is expected to do something
> with
> >> the
> >> >> > return value.
> >> >> >
> >> >> > Ideally we'd be able to declare Status with an attribute that made
> >> this
> >> >> > take effect globally, but unfortunately that's not available until
> >> C++17.
> >> >> >
> >> >> > So we need to annotate each Status-returning function. The two
> >> >> alternatives
> >> >> > we discussed on this CR (https://gerrit.cloudera.org/#/c/4878/)
> were:
> >> >> >
> >> >> > #1 - a special macro wrapping Status
> >> >> >
> >> >> > MUST_USE(Status) DoSomethingThatCanFail(int64_t foo, Bar* bar);
> >> >> >
> >> >> > Pros:
> >> >> > * Closely connected to the return type that it affects
> >> >> > * It's easier to search/replace Status with MUST_USE(Status)
> >> >> >
> >> >> > Cons:
> >> >> > * Could get visually noisy if we use it everywhere
> >> >> >
> >> >> > #2 - a macro that gets appended to the declaration:
> >> >> >
> >> >> > Status DoSomethingThatCanFail(int64_t foo, Bar* bar)
> >> WARN_UNUSED_RESULT;
> >> >> >
> >> >> > Pros:
> >> >> > * Macro is slightly
> >> >> > * Less visually noisy since it's at the end of the declaration
> >> >> >
> >> >> > What do people think?
> >> >> >
> >> >>
> >>
>

Re: Preferred syntax for warning about ignored Status returns

2017-01-09 Thread Zachary Amsden

GCC doesn't catch this when optimization is enabled and the result is
discarded:

https://godbolt.org/g/4b0BQC

I think that means a type wrapper approach is needed, which probably
necessitates option 1.

 - Zach

On Mon, Jan 9, 2017 at 9:17 AM, Jim Apple  wrote:

> My vote, as I mentioned on the patch, is option 1. I see MUST_USE(T)
> as a property of T, like const T or volatile T. I think it is dual to
> move semantics or to Rust's ability to temporarily or permanently
> consume values so that /only/ one copy is in use rather than
> MUST_USE's /at least one/.
>
> https://en.wikipedia.org/wiki/Substructural_type_system
>
> On Fri, Jan 6, 2017 at 4:05 PM, Taras Bobrovytsky
>  wrote:
> > I'd vote for #2. I think it's better to have less important information
> > (such as qualifiers) towards the end of lines. (I think it would be nice
> if
> > modifiers such as public and private were at the end of method
> declarations
> > in Java, for example: void myMethod() private static {...})
> >
> > On Fri, Jan 6, 2017 at 3:38 PM, Daniel Hecht 
> wrote:
> >
> >> As I indicated in the original review, my preference is #2 but I don't
> feel
> >> too strongly.
> >>
> >> On Fri, Jan 6, 2017 at 2:59 PM, Tim Armstrong 
> >> wrote:
> >>
> >> > Hi All,
> >> >   I wanted to poll the Impala community for opinions about style for
> >> > declaring functions where the caller is expected to do something with
> the
> >> > return value.
> >> >
> >> > Ideally we'd be able to declare Status with an attribute that made
> this
> >> > take effect globally, but unfortunately that's not available until
> C++17.
> >> >
> >> > So we need to annotate each Status-returning function. The two
> >> alternatives
> >> > we discussed on this CR (https://gerrit.cloudera.org/#/c/4878/) were:
> >> >
> >> > #1 - a special macro wrapping Status
> >> >
> >> > MUST_USE(Status) DoSomethingThatCanFail(int64_t foo, Bar* bar);
> >> >
> >> > Pros:
> >> > * Closely connected to the return type that it affects
> >> > * It's easier to search/replace Status with MUST_USE(Status)
> >> >
> >> > Cons:
> >> > * Could get visually noisy if we use it everywhere
> >> >
> >> > #2 - a macro that gets appended to the declaration:
> >> >
> >> > Status DoSomethingThatCanFail(int64_t foo, Bar* bar)
> WARN_UNUSED_RESULT;
> >> >
> >> > Pros:
> >> > * Macro is slightly
> >> > * Less visually noisy since it's at the end of the declaration
> >> >
> >> > What do people think?
> >> >
> >>
>

Re: [PATCH 3/3] Fix TSC MSR read in nested SVM

2011-08-03 Thread Zachary Amsden

Caution: this requires more care.

Pretty sure this breaks userspace suspend at the cost of supporting a
not-so-reasonable hardware feature.

On Tue, Aug 2, 2011 at 5:55 AM, Nadav Har'El  wrote:
> When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be
> read without exit), we need to return L2's notion of the TSC, not L1's.
>
> The current code incorrectly returned L1 TSC, because svm_get_msr() was also
> used in x86.c where this was assumed, but now that these places call the new
> svm_read_l1_tsc(), the MSR read can be fixed.
>
> Signed-off-by: Nadav Har'El 
> ---
>  arch/x86/kvm/svm.c |    4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> --- .before/arch/x86/kvm/svm.c  2011-08-02 15:51:02.0 +0300
> +++ .after/arch/x86/kvm/svm.c   2011-08-02 15:51:02.0 +0300
> @@ -2907,9 +2907,7 @@ static int svm_get_msr(struct kvm_vcpu *
>
>        switch (ecx) {
>        case MSR_IA32_TSC: {
> -               struct vmcb *vmcb = get_host_vmcb(svm);
> -
> -               *data = vmcb->control.tsc_offset +
> +               *data = svm->vmcb->control.tsc_offset +
>                        svm_scale_tsc(vcpu, native_read_tsc());
>
>                break;
>

This is correct.  Now you properly return the correct MSR value for
the TSC to the guest in all cases.

However, there is a BIG PROBLEM (yes, it is that bad).  Sorry I did
not think of it before.

When the guest gets suspended and frozen by userspace, to be restarted
later, what qemu is going to do is come along and read all of the MSRs
as part of the saved state.  One of those happens to be the TSC MSR.

Since you can't guarantee suspend will take place when only L1 is
running, you may have mixed L1/L2 TSC MSRs now being returned to
userspace.  Now, when you resume this guest, you will have mixed L1/L2
TSC values written into the MSRs.

Those will almost certainly fail to be matched by the TSC offset
matching code, and thus, with multiple VCPUs, you will end up with
slightly unsynchronized TSCs, and with that, all the problems
associated with unstable TSC come back to haunt you again.  Basically,
all bets for stability are off.

Apart from recording the L1 and L2 TSC through separate MSRs, there
isn't a good way to solve this.  Walking through all the steps again,
I'm pretty sure this is why I thought it would be better for L0
kvm_get_msr() to return the L1 values even if L2 was running.

In the end, it may not be worth the hassle to support this mode of
operation that to our current knowledge, is in fact, unused.  I would
much prefer to see a bad virtualization of a known insecure and unused
guest mode than to have a well used feature such as L1 guest suspend /
resume continue to cause clock de-synchronization.

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] Fix nested VMX TSC emulation

2011-08-03 Thread Zachary Amsden

Comments inline.  Sorry for top-posting.  Gmail is not my normal mode
of LKML processing, but hey.

On Tue, Aug 2, 2011 at 5:54 AM, Nadav Har'El  wrote:
> This patch fixes two corner cases in nested (L2) handling of TSC-related
> issues:
>
> 1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to
> the TSC MSR without an exit, then this should set L1's TSC value itself - not
> offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code).
>
> 2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore
> the vmcs12.TSC_OFFSET.
>
> Signed-off-by: Nadav Har'El 
> ---
>  arch/x86/kvm/vmx.c |   31 +--
>  1 file changed, 21 insertions(+), 10 deletions(-)
>
> --- .before/arch/x86/kvm/vmx.c  2011-08-02 15:51:02.0 +0300
> +++ .after/arch/x86/kvm/vmx.c   2011-08-02 15:51:02.0 +0300
> @@ -1777,15 +1777,23 @@ static void vmx_set_tsc_khz(struct kvm_v
>  */
>  static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>  {
> -       vmcs_write64(TSC_OFFSET, offset);
> -       if (is_guest_mode(vcpu))
> +       if (is_guest_mode(vcpu)) {
>                /*
> -                * We're here if L1 chose not to trap the TSC MSR. Since
> -                * prepare_vmcs12() does not copy tsc_offset, we need to also
> -                * set the vmcs12 field here.
> +                * We're here if L1 chose not to trap WRMSR to TSC. According
> +                * to the spec, this should set L1's TSC; The offset that L1
> +                * set for L2 remains unchanged, and still needs to be added
> +                * to the newly set TSC to get L2's TSC.
>                 */
> -               get_vmcs12(vcpu)->tsc_offset = offset -
> -                       to_vmx(vcpu)->nested.vmcs01_tsc_offset;
> +               struct vmcs12 *vmcs12;
> +               to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset;
> +               /* recalculate vmcs02.TSC_OFFSET: */
> +               vmcs12 = get_vmcs12(vcpu);
> +               vmcs_write64(TSC_OFFSET, offset +
> +                       (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
> +                        vmcs12->tsc_offset : 0));
> +       } else {
> +               vmcs_write64(TSC_OFFSET, offset);
> +       }
>  }

This part looks good.

>  static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
> @@ -6529,8 +6537,11 @@ static void prepare_vmcs02(struct kvm_vc
>
>        set_cr4_guest_host_mask(vmx);
>
> -       vmcs_write64(TSC_OFFSET,
> -               vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> +       if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING)
> +               vmcs_write64(TSC_OFFSET,
> +                       vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> +       else
> +               vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);

I need more context here... where do you apply the adjustment?

The offset should be added to the vmcs01_tsc_offset only (but also
written into the hardware VMCS, which should not be preserved when the
guest exits).

>
>        if (enable_vpid) {
>                /*
> @@ -6937,7 +6948,7 @@ static void nested_vmx_vmexit(struct kvm
>
>        load_vmcs12_host_state(vcpu, vmcs12);
>
> -       /* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran 
> */
> +       /* Update TSC_OFFSET if TSC was changed while L2 ran */
>        vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
>
>        /* This is needed for same reason as it was needed in prepare_vmcs02 */
>

This is correct.  You should always restore the L1 offset when exiting
if it might have changed.  This implies also that you must update
vmx->nested.vmcs01_tsc_offset if you receive a call to
vmx_adjust_tsc_offset while L2 is running, which is why I wanted to
see more context above.

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] L1 TSC handling

2011-08-03 Thread Zachary Amsden

ACK from me, but please check with Joerg on the SVM change.  I believe
the scaling is done properly, however it won't hurt to have a second
opinion.

Zach

On Tue, Aug 2, 2011 at 5:54 AM, Nadav Har'El  wrote:
> KVM assumed in several places that reading the TSC MSR returns the value for
> L1. This is incorrect, because when L2 is running, the correct TSC read exit
> emulation is to return L2's value.
>
> We therefore add a new x86_ops function, read_l1_tsc, to use in places that
> specifically need to read the L1 TSC, NOT the TSC of the current level of
> guest.
>
> Note that one change, of one line in kvm_arch_vcpu_load, is made redundant
> by a different patch sent by Zachary Amsden (and not yet applied):
> kvm_arch_vcpu_load() should not read the guest TSC, and if it didn't, of
> course we didn't have to change the call of kvm_get_msr() to read_l1_tsc().
>
> Signed-off-by: Nadav Har'El 
> ---
>  arch/x86/include/asm/kvm_host.h |    2 ++
>  arch/x86/kvm/svm.c              |    9 +
>  arch/x86/kvm/vmx.c              |   17 +
>  arch/x86/kvm/x86.c              |    8 
>  4 files changed, 32 insertions(+), 4 deletions(-)
>
> --- .before/arch/x86/include/asm/kvm_host.h     2011-08-02 15:51:02.0 
> +0300
> +++ .after/arch/x86/include/asm/kvm_host.h      2011-08-02 15:51:02.0 
> +0300
> @@ -636,6 +636,8 @@ struct kvm_x86_ops {
>                               struct x86_instruction_info *info,
>                               enum x86_intercept_stage stage);
>
> +       u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu);
> +
>        const struct trace_print_flags *exit_reasons_str;
>  };
>
> --- .before/arch/x86/kvm/vmx.c  2011-08-02 15:51:02.0 +0300
> +++ .after/arch/x86/kvm/vmx.c   2011-08-02 15:51:02.0 +0300
> @@ -1748,6 +1748,21 @@ static u64 guest_read_tsc(void)
>  }
>
>  /*
> + * Like guest_read_tsc, but always returns L1's notion of the timestamp
> + * counter, even if a nested guest (L2) is currently running.
> + */
> +u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu)
> +{
> +       u64 host_tsc, tsc_offset;
> +
> +       rdtscll(host_tsc);
> +       tsc_offset = is_guest_mode(vcpu) ?
> +               to_vmx(vcpu)->nested.vmcs01_tsc_offset :
> +               vmcs_read64(TSC_OFFSET);
> +       return host_tsc + tsc_offset;
> +}
> +
> +/*
>  * Empty call-back. Needs to be implemented when VMX enables the SET_TSC_KHZ
>  * ioctl. In this case the call-back should update internal vmx state to make
>  * the changes effective.
> @@ -7059,6 +7074,8 @@ static struct kvm_x86_ops vmx_x86_ops =
>        .set_tdp_cr3 = vmx_set_cr3,
>
>        .check_intercept = vmx_check_intercept,
> +
> +       .read_l1_tsc = vmx_read_l1_tsc,
>  };
>
>  static int __init vmx_init(void)
> --- .before/arch/x86/kvm/svm.c  2011-08-02 15:51:02.0 +0300
> +++ .after/arch/x86/kvm/svm.c   2011-08-02 15:51:02.0 +0300
> @@ -2894,6 +2894,13 @@ static int cr8_write_interception(struct
>        return 0;
>  }
>
> +u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu)
> +{
> +       struct vmcb *vmcb = get_host_vmcb(to_svm(vcpu));
> +       return vmcb->control.tsc_offset +
> +               svm_scale_tsc(vcpu, native_read_tsc());
> +}
> +
>  static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data)
>  {
>        struct vcpu_svm *svm = to_svm(vcpu);
> @@ -4243,6 +4250,8 @@ static struct kvm_x86_ops svm_x86_ops =
>        .set_tdp_cr3 = set_tdp_cr3,
>
>        .check_intercept = svm_check_intercept,
> +
> +       .read_l1_tsc = svm_read_l1_tsc,
>  };
>
>  static int __init svm_init(void)
> --- .before/arch/x86/kvm/x86.c  2011-08-02 15:51:02.0 +0300
> +++ .after/arch/x86/kvm/x86.c   2011-08-02 15:51:02.0 +0300
> @@ -1098,7 +1098,7 @@ static int kvm_guest_time_update(struct
>
>        /* Keep irq disabled to prevent changes to the clock */
>        local_irq_save(flags);
> -       kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
> +       tsc_timestamp = kvm_x86_ops->read_l1_tsc(v);
>        kernel_ns = get_kernel_ns();
>        this_tsc_khz = vcpu_tsc_khz(v);
>        if (unlikely(this_tsc_khz == 0)) {
> @@ -2215,7 +2215,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu
>                s64 tsc_delta;
>                u64 tsc;
>
> -               kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc);
> +               tsc = kvm_x86_ops->read_l1_tsc(vcpu);
>                tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
>                             tsc - vcpu->arch.last_guest_tsc;
>
> @@ -2239,7 +2239,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *
>  {
>        kvm_x86_ops->vcpu_put(vcpu

Re: Nested VMX - L1 hangs on running L2

2011-07-31 Thread Zachary Amsden

This patch looks good, with one comment noted inline below.  Are there
no other call sites for kvm_get_msr() or which alias some other
function to kvm_get_msr(MSR_IA32_TSC) ?

Did I miss the first patch?

Zach

On Sun, Jul 31, 2011 at 6:48 AM, Nadav Har'El  wrote:
> On Fri, Jul 29, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs 
> on running L2":
>>...
>> In that case, you need to distinguish between reads of the TSC MSR by
>> the guest and reads by the host (as done internally to track drift and
>>...
>> Unfortunately, the layering currently doesn't seem to allow for this,
>> and it looks like both vendor specific variants currently get this
>> wrong.
>>...
>> 1) Add a new vendor-specific API call, vmx_x86_ops->get_L1_TSC(), and
>> transform current uses of the code which does TSC compensation to use
>> this new API.  *Bonus* - no need to do double indirection through the
>> generic MSR code.
>
> Thank you so much for this analysis!
>
> I propose the following two patches. The first one just fixes two small
> bugs in the correct emulation in the VMX case, and the second patch solves
> the originally-reported bug as you outlined above:
>
> The code in x86.c which used to call *_get_msr() to get the TSC no longer
> does it - because *_get_msr() should only be called with the intention of
> reading the msr of the current guest (L1 or L2) and returning it to that
> guest.
>
> Instead, I added a new x86_op read_l1_tsc(vcpu), whose intention is to
> always return L1's notion of the current TSC - even if the current guest
> is L2. All calls in x86.c now use this new read_l1_tsc() function instead
> of reading the MSR - does this look correct to you?
>
>> read via RDMSR which is mis-virtualized (this bug exists today in the
>> SVM implementation if I am reading it correctly - cc'd Joerg to notify
>
> I believe that you're right. I created (in the patch below) svm.c's
> read_l1_tsc() with the same code you had in svm_get_msr(), but I think
> that in svm_get_msr() the code should be different in the nested case -
> I think you should use svm->vmcb, not get_host_vmcb()? I didn't add this svm.c
> fix to the patch, because I'm not sure at all that my understanding here is
> correct.
>
> Zachary, Marcelo, do these patches look right to you?
> Bandan, and others who case reproduce this bug - do these patches solve the
> bug?
>
>> So you are right, this is still wrong for the case in which L1 does
>> not trap TSC MSR reads.  Note however, the RDTSC instruction is still
>> virtualized properly, it is only the relatively rare actual TSC MSR
>> read via RDMSR which is mis-virtualized (this bug exists today in the
>> SVM implementation if I am reading it correctly - cc'd Joerg to notify
>> him of that).  That, combined with the relative importance of
>> supporting a guest which does not trap on these MSR reads suggest this
>> is a low priority design issue however (RDTSC still works even if the
>> MSR is trapped, correct?)
>
> Yes, RDTSC_EXITING is separate from the MSR bitmap. I agree that these
> are indeed obscure corner cases, but I think that it's still really confusing
> that a function called kvm_get_msr deliberately works incorrectly (returning
> always the L1 TSC) just so that it can fit some other uses. The SVM code 
> worked
> in the common case, but was still confusing why svm_get_msr() deliberately
> goes to great lengths to use L1's TSC_OFFSET even when L2 is running - when
> this is not how this MSR is really supposed to behave. Not to mention that one
> day, somebody *will* want to use these obscure settings and be surprise that
> they don't work ;-)
>
> Subject: [PATCH 1/2] nVMX: Fix nested TSC emulation
>
> This patch fixes two corner cases in nested (L2) handling of TSC-related
> exits:
>
> 1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to
> the TSC MSR without an exit, then this should set L1's TSC value itself - not
> offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code).
>
> 2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore
> the vmcs12.TSC_OFFSET.
>
> Signed-off-by: Nadav Har'El 
> ---
>  arch/x86/kvm/vmx.c |   31 +--
>  1 file changed, 21 insertions(+), 10 deletions(-)
>
> --- .before/arch/x86/kvm/vmx.c  2011-07-31 16:17:30.0 +0300
> +++ .after/arch/x86/kvm/vmx.c   2011-07-31 16:17:30.0 +0300
> @@ -1762,15 +1762,23 @@ static void vmx_set_tsc_khz(struct kvm_v
>  */
>  static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>  {
> -       vmcs_wr

Re: Nested VMX - L1 hangs on running L2

2011-07-29 Thread Zachary Amsden

2011/7/27 Nadav Har'El :
> On Wed, Jul 20, 2011, Zachary Amsden wrote about "Re: Nested VMX - L1 hangs 
> on running L2":
>> > > No, both patches are wrong.
>> >
>>
>> kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc) should always return the L1 TSC,
>> regardless of the setting of any MSR bitmap. The reason why is that it
>> is being called by the L0 hypervisor kernel, which handles only
>> interactions with the L1 MSRs.
>
> guest_read_tsc() (called by the above get_msr) currently does this:
>
>        static u64 guest_read_tsc(void)
>        {
>                u64 host_tsc, tsc_offset;
>
>                rdtscll(host_tsc);
>                tsc_offset = vmcs_read64(TSC_OFFSET);
>                return host_tsc + tsc_offset;
>        }

That's wrong.  You should NEVER believe the offset written into the
hardware VMCS to be the current, real L1 TSC offset, as that is not an
invariant.

Instead, you should ALWAYS return the host TSC + the L1 TSC offset.
Sometimes, this may be the hardware value.

> I guess you'd want this to change to something like:
>
>                tsc_offset = is_guest_mode(vcpu) ?
>                        vmx->nested.vmcs01_tsc_offset :
>                        vmcs_read64(TSC_OFFSET);
>
> But I still am not convinced that that would be right

I believe this is correct.  But may it be cheaper to read from the
in-memory structure than the actual hardware VMCS?

> E.g., imagine the case where L1 uses TSC_OFFSETING and but doesn't
> trap TSC MSR read. The SDM says (if I understand it correctly) that this TSC
> MSR read will not exit (because L1 doesn't trap it) but *will* do the extra
> offsetting. In this case, the original code (using vmcs02's TSC_OFFSET which
> is the sum of that of vmcs01 and vmcs12), is correct, and the new code will
> be incorrect. Or am I misunderstanding the SDM?

In that case, you need to distinguish between reads of the TSC MSR by
the guest and reads by the host (as done internally to track drift and
compensation).  The code that needs to change isn't guest_read_tsc(),
that code must respect the invariant of only returning the L1 guest
TSC (in fact, that may be a good name change for the function).  What
needs to change is the actual code involved in the MSR read.  If it
determines that something other than the L1 guest is running, it needs
to ignore the hardware TSC offset and return the TSC as if read by the
L1 guest.

Unfortunately, the layering currently doesn't seem to allow for this,
and it looks like both vendor specific variants currently get this
wrong.

The call stack:

kvm_get_msr()
kvm_x86_ops->get_msr()
vendor_get_msr()
vendor_guest_read_tsc()

offers no insight as to the intention of the caller.  Is it trying to
get the guest TSC to return to the guest, or is it trying to get the
guest TSC to calibrate / measure and compensate for TSC effects?

So you are right, this is still wrong for the case in which L1 does
not trap TSC MSR reads.  Note however, the RDTSC instruction is still
virtualized properly, it is only the relatively rare actual TSC MSR
read via RDMSR which is mis-virtualized (this bug exists today in the
SVM implementation if I am reading it correctly - cc'd Joerg to notify
him of that).  That, combined with the relative importance of
supporting a guest which does not trap on these MSR reads suggest this
is a low priority design issue however (RDTSC still works even if the
MSR is trapped, correct?)

If you want to go the extra mile to support such guests, the only
fully correct approach then is to do one of the following:

1) Add a new vendor-specific API call, vmx_x86_ops->get_L1_TSC(), and
transform current uses of the code which does TSC compensation to use
this new API.  *Bonus* - no need to do double indirection through the
generic MSR code.

or, alternatively:

2) Do not trap MSR reads of the TSC if the current L1 guest is not
trapping MSR reads of the TSC.  This is not possible if you cannot
enforce specific read vs. write permission in hardware - it may be
possible however, if you can trap all MSR writes regardless of the
permission bitmap.

> Can you tell me in which case the original code would returen incorrect
> results to a guest (L1 or L2) doing anything MSR-related?

It never returns incorrect values to a guest.  It does however return
incorrect values to the L0 hypervisor, which is expecting to do
arithmetic based on the L1 TSC value, and this fails catastrophically
when it receives values for other nested guests.

> I'm assuming that some code in KVM also uses kvm_read_msr and assumes it
> gets the TSC value for L1, not for the guest currently running (L2 or L1).

Exactly.

> I don't understand why it needs to assume that... Why would it be wrong to
> return L2's TSC, and just remember that *changing

Re: Nested VMX - L1 hangs on running L2

2011-07-20 Thread Zachary Amsden

On Wed, Jul 20, 2011 at 12:52 PM, Nadav Har'El  wrote:
>
> > No, both patches are wrong.
>
> Guys, thanks for looking into this bug. I'm afraid I'm still at a loss at
> why a TSC bug would even cause a guest lockup :(
>
> When Avi Kivity saw my nested TSC handling code he remarked "this is
> probably wrong". When I asked him where it was wrong, he basically said
> that he didn't know where, but TSC handling code is always wrong ;-)
> And it turns out he was right.
>
> > The correct fix is to make kvm_get_msr() return the L1 guest TSC at all
> times.
> >  We are serving the L1 guest in this hypervisor, not the L2 guest, and so
>
> > should never read its TSC for any reason.
> ...
> > allows the L2 guest to overwrite the L1 guest TSC, which at first seems
> wrong,
> > but is in fact the correct virtualization of a security hole in the L1
> guest.
>
> I think I'm beginning to see the error in my ways...
>
> When L1 lets L2 (using the MSR bitmap) direct read/write access to the TSC,
> it doesn't want L0 to be "clever" and give L2 its own separate TSC (like
> I do now), but rather gives it full control over L1's TSC - so reading or
> writing it should actually return L1's TSC, and the TSC_OFFSET in vmcs12
> is to be ignored.

>
> So basically, if I understand correctly, what I need to change is
> in prepare_vmcs02(), if the MSR_IA32_TSC is on the MSR bitmap (read?
> write?), instead of doing
>        vmcs_write64(TSC_OFFSET,
>                vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> I just need to do
>        vmcs_write64(TSC_OFFSET,
>                vmx->nested.vmcs01_tsc_offset);
> thereby giving L2 exactly the same TSC that L1 had.
> Brandan, if I remember correctly you once tried this sort of fix and
> it actually worked?
>
> Then, guest_read_tsc() will return (without need to change this code)
> the correct L1 TSC.
>

Note quite.

kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc) should always return the L1 TSC,
regardless of the setting of any MSR bitmap. The reason why is that it
is being called by the L0 hypervisor kernel, which handles only
interactions with the L1 MSRs.

If L1 wants to configure L2 to have its own private TSC (99.99% of use
cases), it should mask off access to the TSC MSR in the permission
bitmap. Then
when L2 reads or writes the MSR (not the TSC, which is protected by
the RDTSC instruction, and a different trap), an exit will be
generated to L0, which will read the permission bitmaps, determine the
access is failed, and forward the exit to L1, which will deal with
reading or writing the L2 MSR. L0 NEVER deals directly with reading or
writing the L2 MSR in this scenario, how that is handled is supposed
to be determined by L1.

If L1 wants to configure L2 to be able to read and write the L1 TSC
instead, it should enable access to the TSC MSR in the MSR permission
bitmap.  Then when L2 reads or write the MSR, an exit will be
generated to L0, which will read the permission bitmaps, determine the
access is a success, and directly read or write the L1 TSC.  This
requires changing both the offset of the running L2 (in the active L0
VMCS) and the saved offset of the inactive L1.

However, when computing the L2 tsc offset field to use for a running
guest, BOTH the L1 and L2 offsets should be added.  This is exactly
what nested hardware would do.  This implies that to be correct in the
second scenario, you need to forget the change you made to the offset
of the running L2 (in the active L0 VMCS) when the L2 guest stops
running.  You want it to be live on hardware, but not stored in any
saved TSC_OFFSET field for the L2 guest - it seems illogical, but it
is not.  The reason for that is that the L2 guest was updating ONLY
the L1 TSC, not the L2 TSC_OFFSET (which is still controlled by L1).

Note the following invariants: For N>1, L0 will never under any
circumstance set or adjust the L{N} TSC virtual VMCS field.  That is
only for the L{N-1} guest to do, and any attempt to do so by any other
than L{N-1} is not a correct virtualization.  L0 will only set the
hardware VMCS field to the offset required for virtualizing each level
L1-L{N}.

Also, the invariant for nested hardware:  the observed TSC of an L{N}
guest is equal to the hardware TSC, plus the TSC offset field for all
L1-L{N}.  If the hardware does not support nested virtualization, this
invariant must be guaranteed by the L0 hypervisor, which is simulating
nested hardware by adding up all of the nested offset fields.

Neither of these facts involves checking the status of permission bits
in the MSR map.  That is only done for determining success or failure
of an MSR read or write, which may be forwarded to some L{N} as a
virtual exit or #GP fault.

Yes, it is tricky, but once you start thinking of it in simple
hardware terms, you will realize there is no need to be clever, just
precise.  If it seems counterintuitive to think about the second case,
consider the scenario when L2 and L1 are cooperative, and L1 sets the
L2 TSC_OFFSET to zero

Fwd: [KVM TSC emulation 9/9] Add software TSC emulation

2011-06-21 Thread Zachary Amsden




 Original Message 
Subject:[KVM TSC emulation 9/9] Add software TSC emulation
Date:   Mon, 20 Jun 2011 16:59:37 -0700
From:   Zachary Amsden 
To: 	Avi Kivity , Marcelo Tosatti , 
Glauber Costa , Frank Arnold , 
Joerg Roedel , Jan Kiszka 
, linux-...@vger.kernel.org, 
linux-ker...@vger.kernel.org, Zachary Amsden , Avi 
Kivity , Marcelo Tosatti , Glauber 
Costa , Frank Arnold , Joerg 
Roedel , Jan Kiszka , 
linux-...@vger.kernel.org
CC: 	Zachary Amsden , Zachary Amsden 





When hardware assistance is unavailable to scale the TSC, or it is
not possible to keep in sync, add a software virtualization mode
where the TSC is trapped and thus guaranteed to always have perfect
synchronization.

Currently this behavior defaults to on; how and when the decision to
use trapping is made is likely to be a matter of debate.  For now,
just make it possible.

Signed-off-by: Zachary Amsden
---
 arch/x86/kvm/svm.c |   26 +-
 arch/x86/kvm/vmx.c |   28 +++-
 arch/x86/kvm/x86.c |   34 +++---
 arch/x86/kvm/x86.h |5 +
 4 files changed, 80 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index dcab00e..fc4583d 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -185,6 +185,7 @@ module_param(nested, int, S_IRUGO);

 static void svm_flush_tlb(struct kvm_vcpu *vcpu);
 static void svm_complete_interrupts(struct vcpu_svm *svm);
+static void svm_set_tsc_trapping(struct kvm_vcpu *vcpu, bool trap);

 static int nested_svm_exit_handled(struct vcpu_svm *svm);
 static int nested_svm_intercept(struct vcpu_svm *svm);
@@ -912,13 +913,18 @@ static void svm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 
user_tsc_khz, bool scale)
u64 khz;

/* Guest TSC same frequency as host TSC? */
-   if (!scale) {
+   if (!scale&&  !check_tsc_unstable()) {
svm->tsc_ratio = TSC_RATIO_DEFAULT;
return;
}

/* TSC scaling supported? */
if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR)) {
+   if (kvm_software_tsc) {
+   pr_debug("kvm: using TSC trapping\n");
+   svm_set_tsc_trapping(vcpu, true);
+   return;
+   }
if (user_tsc_khz>  tsc_khz) {
vcpu->arch.tsc_catchup = 1;
vcpu->arch.tsc_always_catchup = 1;
@@ -1184,6 +1190,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, 
unsigned int id)
svm->vmcb_pa = page_to_pfn(page)<<  PAGE_SHIFT;
svm->asid_generation = 0;
init_vmcb(svm);
+   kvm_set_tsc_khz(&svm->vcpu, kvm_max_tsc_khz);
kvm_write_tsc(&svm->vcpu, 0);

err = fx_init(&svm->vcpu);
@@ -1303,6 +1310,15 @@ static void svm_clear_vintr(struct vcpu_svm *svm)
clr_intercept(svm, INTERCEPT_VINTR);
 }

+static void svm_set_tsc_trapping(struct kvm_vcpu *vcpu, bool trap)
+{
+   struct vcpu_svm *svm = to_svm(vcpu);
+   if (trap)
+   set_intercept(svm, INTERCEPT_RDTSC);
+   else
+   clr_intercept(svm, INTERCEPT_RDTSC);
+}
+
 static struct vmcb_seg *svm_seg(struct kvm_vcpu *vcpu, int seg)
 {
struct vmcb_save_area *save =&to_svm(vcpu)->vmcb->save;
@@ -2732,6 +2748,13 @@ static int task_switch_interception(struct vcpu_svm *svm)
return 1;
 }

+static int rdtsc_interception(struct vcpu_svm *svm)
+{
+   svm->next_rip = kvm_rip_read(&svm->vcpu) + 2;
+   kvm_read_tsc(&svm->vcpu);
+   return 1;
+}
+
 static int cpuid_interception(struct vcpu_svm *svm)
 {
svm->next_rip = kvm_rip_read(&svm->vcpu) + 2;
@@ -3178,6 +3201,7 @@ static int (*svm_exit_handlers[])(struct vcpu_svm *svm) = 
{
[SVM_EXIT_SMI]  = nop_on_interception,
[SVM_EXIT_INIT] = nop_on_interception,
[SVM_EXIT_VINTR]= interrupt_window_interception,
+   [SVM_EXIT_RDTSC]= rdtsc_interception,
[SVM_EXIT_CPUID]= cpuid_interception,
[SVM_EXIT_IRET] = iret_interception,
[SVM_EXIT_INVD] = emulate_on_interception,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 780fe12..65066b4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -606,6 +606,7 @@ static void kvm_cpu_vmxon(u64 addr);
 static void kvm_cpu_vmxoff(void);
 static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
+static void vmx_set_tsc_trapping(struct kvm_vcpu *vcpu, bool trap);

 static DEFINE_PER_CPU(struct vmcs *, vmxarea);
 static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
@@ -1756,9 +1757,14 @@ static u64 guest_read_tsc(void)
  */
 static void vmx_set_tsc

Fwd: [KVM TSC emulation 8/9] Track TSC synchronization in generations

2011-06-21 Thread Zachary Amsden




 Original Message 
Subject:[KVM TSC emulation 8/9] Track TSC synchronization in generations
Date:   Mon, 20 Jun 2011 16:59:36 -0700
From:   Zachary Amsden 
To: 	Avi Kivity , Marcelo Tosatti , 
Glauber Costa , Frank Arnold , 
Joerg Roedel , Jan Kiszka 
, linux-...@vger.kernel.org, 
linux-ker...@vger.kernel.org, Zachary Amsden , Avi 
Kivity , Marcelo Tosatti , Glauber 
Costa , Frank Arnold , Joerg 
Roedel , Jan Kiszka , 
linux-...@vger.kernel.org
CC: 	Zachary Amsden , Zachary Amsden 





This allows us to track the original nanosecond and counter values
at each phase of TSC writing by the guest.  This gets us perfect
offset matching for stable TSC systems, and perfect software
computed TSC matching for machines with unstable TSC.

Signed-off-by: Zachary Amsden
---
 arch/x86/include/asm/kvm_host.h |   10 ++--
 arch/x86/kvm/x86.c  |   41 +++---
 2 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 226897f..198ce8b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -391,10 +391,11 @@ struct kvm_vcpu_arch {
struct page *time_page;
u64 last_guest_tsc;
u64 last_kernel_ns;
-   u64 last_tsc_nsec;
-   u64 last_tsc_write;
u64 last_host_tsc;
u64 tsc_offset_adjustment;
+   u64 this_tsc_nsec;
+   u64 this_tsc_write;
+   u8  this_tsc_generation;
bool tsc_catchup;
bool tsc_always_catchup;
s8 virtual_tsc_shift;
@@ -468,9 +469,12 @@ struct kvm_arch {
s64 kvmclock_offset;
raw_spinlock_t tsc_write_lock;
u64 last_tsc_nsec;
-   u64 last_tsc_offset;
u64 last_tsc_write;
u32 last_tsc_khz;
+   u64 cur_tsc_nsec;
+   u64 cur_tsc_write;
+   u64 cur_tsc_offset;
+   u8  cur_tsc_generation;

struct kvm_xen_hvm_config xen_hvm_config;

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cb8876b..09e67fb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1029,10 +1029,10 @@ static void kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 
this_tsc_khz)

 static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
 {
-   u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.last_tsc_nsec,
+   u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec,
  vcpu->arch.virtual_tsc_mult,
  vcpu->arch.virtual_tsc_shift);
-   tsc += vcpu->arch.last_tsc_write;
+   tsc += vcpu->arch.this_tsc_write;
return tsc;
 }

@@ -1068,7 +1068,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
if (nsdiff<  NSEC_PER_SEC&&
vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
if (!check_tsc_unstable()) {
-   offset = kvm->arch.last_tsc_offset;
+   offset = kvm->arch.cur_tsc_offset;
pr_debug("kvm: matched tsc offset for %llu\n", data);
} else {
u64 delta = nsec_to_cycles(vcpu, elapsed);
@@ -1076,20 +1076,45 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
}
+   } else {
+   /*
+* We split periods of matched TSC writes into generations.
+* For each generation, we track the original measured
+* nanosecond time, offset, and write, so if TSCs are in
+* sync, we can match exact offset, and if not, we can match
+* exact software computaion in compute_guest_tsc()
+*
+* These values are tracked in kvm->arch.cur_xxx variables.
+*/
+   kvm->arch.cur_tsc_generation++;
+   kvm->arch.cur_tsc_nsec = ns;
+   kvm->arch.cur_tsc_write = data;
+   kvm->arch.cur_tsc_offset = offset;
+   pr_debug("kvm: new tsc generation %u, clock %llu\n",
+kvm->arch.cur_tsc_generation, data);
}
+
+   /*
+* We also track th most recent recorded KHZ, write and time to
+* allow the matching interval to be extended at each write.
+*/
kvm->arch.last_tsc_nsec = ns;
kvm->arch.last_tsc_write = data;
-   kvm->arch.last_tsc_offset = offset;
kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
-   kvm_x86_ops->write_tsc_offset(vcpu, offset);
-   raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);

/* Reset of TSC must disable overshoot protection below */
vcpu->arch.hv_clock.tsc_timestamp = 0;
-

Fwd: [KVM TSC emulation 7/9] Don't mark TSC unstable due to S4 suspend

2011-06-21 Thread Zachary Amsden




 Original Message 
Subject:[KVM TSC emulation 7/9] Don't mark TSC unstable due to S4 
suspend
Date:   Mon, 20 Jun 2011 16:59:35 -0700
From:   Zachary Amsden 
To: 	Avi Kivity , Marcelo Tosatti , 
Glauber Costa , Frank Arnold , 
Joerg Roedel , Jan Kiszka 
, linux-...@vger.kernel.org, 
linux-ker...@vger.kernel.org, Zachary Amsden , Avi 
Kivity , Marcelo Tosatti , Glauber 
Costa , Frank Arnold , Joerg 
Roedel , Jan Kiszka , 
linux-...@vger.kernel.org
CC: 	Zachary Amsden , Zachary Amsden 





During a host suspend, TSC may go backwards, which KVM interprets
as an unstable TSC.  Technically, KVM should not be marking the
TSC unstable, which causes the TSC clocksource to go bad, but we
need to be adjusting the TSC offsets in such a case.

Dealing with this issue is a little tricky as the only place we
can reliably do it is before much of the timekeeping infrastructure
is up and running.  On top of this, we are not in a KVM thread
context, so we may not be able to safely access VCPU fields.
Instead, we compute our best known hardware offset at power-up and
stash it to be applied to all VCPUs when they actually start running.

Signed-off-by: Zachary Amsden
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/x86.c  |   93 --
 2 files changed, 89 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c2854da..226897f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -394,6 +394,7 @@ struct kvm_vcpu_arch {
u64 last_tsc_nsec;
u64 last_tsc_write;
u64 last_host_tsc;
+   u64 tsc_offset_adjustment;
bool tsc_catchup;
bool tsc_always_catchup;
s8 virtual_tsc_shift;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 10950f7..cb8876b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2152,6 +2152,14 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
}

kvm_x86_ops->vcpu_load(vcpu, cpu);
+
+   /* Apply any externally detected TSC adjustments (due to suspend) */
+   if (unlikely(vcpu->arch.tsc_offset_adjustment)) {
+   adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustment);
+   vcpu->arch.tsc_offset_adjustment = 0;
+   set_bit(KVM_REQ_CLOCK_UPDATE,&vcpu->requests);
+   }
+
if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
/* Make sure TSC doesn't go backwards */
s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
@@ -6221,13 +6229,88 @@ int kvm_arch_hardware_enable(void *garbage)
struct kvm *kvm;
struct kvm_vcpu *vcpu;
int i;
+   int ret;
+   u64 local_tsc;
+   u64 max_tsc = 0;
+   bool stable, backwards_tsc = false;

kvm_shared_msr_cpu_online();
-   list_for_each_entry(kvm,&vm_list, vm_list)
-   kvm_for_each_vcpu(i, vcpu, kvm)
-   if (vcpu->cpu == smp_processor_id())
-   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-   return kvm_x86_ops->hardware_enable(garbage);
+   ret = kvm_x86_ops->hardware_enable(garbage);
+   if (ret != 0)
+   return ret;
+
+   local_tsc = native_read_tsc();
+   stable = !check_tsc_unstable();
+   list_for_each_entry(kvm,&vm_list, vm_list) {
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   if (!stable&&  vcpu->cpu == smp_processor_id())
+   set_bit(KVM_REQ_CLOCK_UPDATE,&vcpu->requests);
+   if (stable&&  vcpu->arch.last_host_tsc>  local_tsc) {
+   backwards_tsc = true;
+   if (vcpu->arch.last_host_tsc>  max_tsc)
+   max_tsc = vcpu->arch.last_host_tsc;
+   }
+   }
+   }
+
+   /*
+* Sometimes, even reliable TSCs go backwards.  This happens on
+* platforms that reset TSC during suspend or hibernate actions, but
+* maintain synchronization.  We must compensate.  Fortunately, we can
+* detect that condition here, which happens early in CPU bringup,
+* before any KVM threads can be running.  Unfortunately, we can't
+* bring the TSCs fully up to date with real time, as we aren't yet far
+* enough into CPU bringup that we know how much real time has actually
+* elapsed; our helper function, get_kernel_ns() will be using boot
+* variables that haven't been updated yet.
+*
+* So we simply find the maximum observed TSC above, then record the
+* adjustment to TSC in each VCPU.  When the VCPU later gets loaded,
+* the adjustment will be applied.  Note that we accumulate
+* adjustments,

Fwd: [KVM TSC emulation 6/9] Allow adjust_tsc_offset to be in host or guest cycles

2011-06-21 Thread Zachary Amsden




 Original Message 
Subject: 	[KVM TSC emulation 6/9] Allow adjust_tsc_offset to be in host 
or guest cycles

Date:   Mon, 20 Jun 2011 16:59:34 -0700
From:   Zachary Amsden 
To: 	Avi Kivity , Marcelo Tosatti , 
Glauber Costa , Frank Arnold , 
Joerg Roedel , Jan Kiszka 
, linux-...@vger.kernel.org, 
linux-ker...@vger.kernel.org, Zachary Amsden , Avi 
Kivity , Marcelo Tosatti , Glauber 
Costa , Frank Arnold , Joerg 
Roedel , Jan Kiszka , 
linux-...@vger.kernel.org
CC: 	Zachary Amsden , Zachary Amsden 





Redefine the API to take a parameter indicating whether an
adjustment is in host or guest cycles.

Signed-off-by: Zachary Amsden
---
 arch/x86/include/asm/kvm_host.h |   13 -
 arch/x86/kvm/svm.c  |6 +-
 arch/x86/kvm/vmx.c  |2 +-
 arch/x86/kvm/x86.c  |2 +-
 4 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3b9fdb5..c2854da 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -599,7 +599,7 @@ struct kvm_x86_ops {
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
int (*get_lpage_level)(void);
bool (*rdtscp_supported)(void);
-   void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment);
+   void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment, bool 
host);

void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3);

@@ -630,6 +630,17 @@ struct kvm_arch_async_pf {

 extern struct kvm_x86_ops *kvm_x86_ops;

+static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
+  s64 adjustment)
+{
+   kvm_x86_ops->adjust_tsc_offset(vcpu, adjustment, false);
+}
+
+static inline void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 
adjustment)
+{
+   kvm_x86_ops->adjust_tsc_offset(vcpu, adjustment, true);
+}
+
 int kvm_mmu_module_init(void);
 void kvm_mmu_module_exit(void);

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 47f557e..dcab00e 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -957,10 +957,14 @@ static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, 
u64 offset)
mark_dirty(svm->vmcb, VMCB_INTERCEPTS);
 }

-static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
+static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment, bool 
host)
 {
struct vcpu_svm *svm = to_svm(vcpu);

+   WARN_ON(adjustment<  0);
+   if (host)
+   adjustment = svm_scale_tsc(vcpu, adjustment);
+
svm->vmcb->control.tsc_offset += adjustment;
if (is_guest_mode(vcpu))
svm->nested.hsave->control.tsc_offset += adjustment;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index bc3ecfd..780fe12 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1782,7 +1782,7 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, 
u64 offset)
to_vmx(vcpu)->nested.vmcs01_tsc_offset;
 }

-static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
+static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment, bool 
host)
 {
u64 offset = vmcs_read64(TSC_OFFSET);
vmcs_write64(TSC_OFFSET, offset + adjustment);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8fe988a..10950f7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1125,7 +1125,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
if (vcpu->tsc_catchup) {
u64 tsc = compute_guest_tsc(v, kernel_ns);
if (tsc>  tsc_timestamp) {
-   kvm_x86_ops->adjust_tsc_offset(v, tsc - tsc_timestamp);
+   adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
tsc_timestamp = tsc;
}
}
--
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Fwd: [KVM TSC emulation 5/9] Add last_host_tsc tracking back to KVM

2011-06-21 Thread Zachary Amsden




 Original Message 
Subject:[KVM TSC emulation 5/9] Add last_host_tsc tracking back to KVM
Date:   Mon, 20 Jun 2011 16:59:33 -0700
From:   Zachary Amsden 
To: 	Avi Kivity , Marcelo Tosatti , 
Glauber Costa , Frank Arnold , 
Joerg Roedel , Jan Kiszka 
, linux-...@vger.kernel.org, 
linux-ker...@vger.kernel.org, Zachary Amsden , Avi 
Kivity , Marcelo Tosatti , Glauber 
Costa , Frank Arnold , Joerg 
Roedel , Jan Kiszka , 
linux-...@vger.kernel.org
CC: 	Zachary Amsden , Zachary Amsden 





The variable last_host_tsc was removed from upstream code.  I am adding
it back for two reasons.  First, it is unnecessary to use guest TSC
computation to conclude information about the host TSC.  The guest may
set the TSC backwards (this case handled by the previous patch), but
the computation of guest TSC (and fetching an MSR) is significanlty more
work and complexity than simply reading the hardware counter.  In addition,
we don't actually need the guest TSC for any part of the computation,
by always recomputing the offset, we can eliminate the need to deal with
the current offset and any scaling factors that may apply.

The second reason is that later on, we are going to be using the host
TSC value to restore TSC offsets after a host S4 suspend, so we need to
be reading the host values, not the guest values here.

Signed-off-by: Zachary Amsden
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/x86.c  |9 +++--
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 865f051..3b9fdb5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -393,6 +393,7 @@ struct kvm_vcpu_arch {
u64 last_kernel_ns;
u64 last_tsc_nsec;
u64 last_tsc_write;
+   u64 last_host_tsc;
bool tsc_catchup;
bool tsc_always_catchup;
s8 virtual_tsc_shift;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3bd2ca3..8fe988a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2154,11 +2154,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
kvm_x86_ops->vcpu_load(vcpu, cpu);
if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
/* Make sure TSC doesn't go backwards */
-   s64 tsc_delta;
-   u64 tsc;
-
-   kvm_get_msr(vcpu, MSR_IA32_TSC,&tsc);
-   tsc_delta = tsc - vcpu->arch.last_guest_tsc;
+   s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
+   native_read_tsc() - vcpu->arch.last_host_tsc;
if (tsc_delta<  0)
mark_tsc_unstable("KVM discovered backwards TSC");
if (check_tsc_unstable()) {
@@ -2178,7 +2175,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
-   kvm_get_msr(vcpu, MSR_IA32_TSC,&vcpu->arch.last_guest_tsc);
+   vcpu->arch.last_host_tsc = native_read_tsc();
 }

 static int is_efer_nx(void)
--
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Fwd: [KVM TSC emulation 4/9] Fix last_guest_tsc / tsc_offset semantics

2011-06-21 Thread Zachary Amsden

 Original Message 
Subject:[KVM TSC emulation 4/9] Fix last_guest_tsc / tsc_offset 
semantics
Date:   Mon, 20 Jun 2011 16:59:32 -0700
From:   Zachary Amsden 
To: 	Avi Kivity , Marcelo Tosatti , 
Glauber Costa , Frank Arnold , 
Joerg Roedel , Jan Kiszka 
, linux-...@vger.kernel.org, 
linux-ker...@vger.kernel.org, Zachary Amsden , Avi 
Kivity , Marcelo Tosatti , Glauber 
Costa , Frank Arnold , Joerg 
Roedel , Jan Kiszka , 
linux-...@vger.kernel.org
CC: 	Zachary Amsden , Zachary Amsden 

The variable last_guest_tsc was being used as an ad-hoc indicator
that guest TSC has been initialized and recorded correctly.  However,
it may not have been, it could be that guest TSC has been set to some
large value, the back to a small value (by, say, a software reboot).

This defeats the logic and causes KVM to falsely assume that the
guest TSC has gone backwards, marking the host TSC unstable, which
is undesirable behavior.

In addition, rather than try to compute an offset adjustment for the
TSC on unstable platforms, just recompute the whole offset.  This
allows us to get rid of one callsite for adjust_tsc_offset, which
is problematic because the units it takes are in guest units, but
here, the computation was originally being done in host units.

Doing this, and also recording last_guest_tsc when the TSC is written
allow us to remove the tricky logic which depended on last_guest_tsc
being zero to indicate a reset of uninitialized value.

Instead, we now have the guarantee that the guest TSC offset is
always at least something which will get us last_guest_tsc.

Signed-off-by: Zachary Amsden
---
 arch/x86/kvm/x86.c |   11 ++-
 1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2176714..3bd2ca3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1088,6 +1088,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
vcpu->arch.hv_clock.tsc_timestamp = 0;
vcpu->arch.last_tsc_write = data;
vcpu->arch.last_tsc_nsec = ns;
+   vcpu->arch.last_guest_tsc = data;
 }
 EXPORT_SYMBOL_GPL(kvm_write_tsc);

@@ -1156,7 +1157,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 * observed by the guest and ensure the new system time is greater.
 */
max_kernel_ns = 0;
-   if (vcpu->hv_clock.tsc_timestamp&&  vcpu->last_guest_tsc) {
+   if (vcpu->hv_clock.tsc_timestamp) {
max_kernel_ns = vcpu->last_guest_tsc -
vcpu->hv_clock.tsc_timestamp;
max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
@@ -2157,13 +2158,13 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
u64 tsc;

kvm_get_msr(vcpu, MSR_IA32_TSC,&tsc);
-   tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
-tsc - vcpu->arch.last_guest_tsc;
-
+   tsc_delta = tsc - vcpu->arch.last_guest_tsc;
if (tsc_delta<  0)
mark_tsc_unstable("KVM discovered backwards TSC");
if (check_tsc_unstable()) {
-   kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
+   u64 offset = kvm_x86_ops->compute_tsc_offset(vcpu,
+   vcpu->arch.last_guest_tsc);
+   kvm_x86_ops->write_tsc_offset(vcpu, offset);
vcpu->arch.tsc_catchup = 1;
}
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Fwd: [KVM TSC emulation 3/9] Leave TSC synchronization window open with each new sync

2011-06-21 Thread Zachary Amsden

 Original Message 
Subject: 	[KVM TSC emulation 3/9] Leave TSC synchronization window open 
with each new sync

Date:   Mon, 20 Jun 2011 16:59:31 -0700
From:   Zachary Amsden 
To: 	Avi Kivity , Marcelo Tosatti , 
Glauber Costa , Frank Arnold , 
Joerg Roedel , Jan Kiszka 
, linux-...@vger.kernel.org, 
linux-ker...@vger.kernel.org, Zachary Amsden , Avi 
Kivity , Marcelo Tosatti , Glauber 
Costa , Frank Arnold , Joerg 
Roedel , Jan Kiszka , 
linux-...@vger.kernel.org
CC: 	Zachary Amsden , Zachary Amsden 

Currently, when the TSC is written by the guest, the variable
ns is updated to force the current write to appear to have taken
place at the time of the first write in this sync phase.  This
leaves a cliff at the end of the match window where updates will
fall of the end.  There are two scenarios where this can be a
problem in practe - first, on a system with a large number of
VCPUs, the sync period may last for an extended period of time.

The second way this can happen is if the VM reboots very rapidly
and we catch a VCPU TSC synchronization just around the edge.
We may be unaware of the reboot, and thus the first VCPU might
synchronize with an old set of the timer (at, say 0.97 seconds
ago, when first powered on).  The second VCPU can come in 0.04
seconds later to try to synchronize, but it misses the window
because it is just over the threshold.

Instead, stop doing this artificial setback of the ns variable
and just update it with every write of the TSC.

It may be observed that doing so causes values computed by
compute_guest_tsc to diverge slightly across CPUs - note that
the last_tsc_ns and last_tsc_write variable are used here, and
now they last_tsc_ns will be different for each VCPU, reflecting
the actual time of the update.

However, compute_guest_tsc is used only for guests which already
have TSC stability issues, and further, note that the previous
patch has caused last_tsc_write to be incremented by the difference
in nanoseconds, converted back into guest cycles.  As such, only
boundary rounding errors should be visible, which given the
resolution in nanoseconds, is going to only be a few cycles and
only visible in cross-CPU consistency tests.  The problem can be
fixed by adding a new set of variables to track the start offset
and start write value for the current sync cycle.

Signed-off-by: Zachary Amsden
---
 arch/x86/kvm/x86.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 457bd79..2176714 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1076,7 +1076,6 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
}
-   ns = kvm->arch.last_tsc_nsec;
}
kvm->arch.last_tsc_nsec = ns;
kvm->arch.last_tsc_write = data;
--
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Fwd: [KVM TSC emulation 2/9] Improve TSC offset matching

2011-06-21 Thread Zachary Amsden




 Original Message 
Subject:[KVM TSC emulation 2/9] Improve TSC offset matching
Date:   Mon, 20 Jun 2011 16:59:30 -0700
From:   Zachary Amsden 
To: 	Avi Kivity , Marcelo Tosatti , 
Glauber Costa , Frank Arnold , 
Joerg Roedel , Jan Kiszka 
, linux-...@vger.kernel.org, 
linux-ker...@vger.kernel.org, Zachary Amsden , Avi 
Kivity , Marcelo Tosatti , Glauber 
Costa , Frank Arnold , Joerg 
Roedel , Jan Kiszka , 
linux-...@vger.kernel.org
CC: 	Zachary Amsden , Zachary Amsden 





There are a few improvements that can be made to the TSC offset
matching code.  First, we don't need to call the 128-bit multiply
(especially on a constant number), the code works much nicer to
do computation in nanosecond units.

Second, the way everything is setup with software TSC rate scaling,
we currently have per-cpu rates.  Obviously this isn't too desirable
to use in practice, but if for some reason we do change the rate of
all VCPUs at runtime, then reset the TSCs, we will only want to
match offsets for VCPUs running at the same rate.

Finally, for the case where we have an unstable host TSC, but
rate scaling is being done in hardware, we should call the platform
code to compute the TSC offset, so the math is reorganized to recompute
the base instead, then transform the base into an offset using the
existing API.

Signed-off-by: Zachary Amsden
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/x86.c  |   37 ++---
 2 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 184cd38..865f051 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -468,6 +468,7 @@ struct kvm_arch {
u64 last_tsc_nsec;
u64 last_tsc_offset;
u64 last_tsc_write;
+   u32 last_tsc_khz;

struct kvm_xen_hvm_config xen_hvm_config;

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 776895a..457bd79 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1041,33 +1041,39 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
struct kvm *kvm = vcpu->kvm;
u64 offset, ns, elapsed;
unsigned long flags;
-   s64 sdiff;
+   s64 nsdiff;

raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
ns = get_kernel_ns();
elapsed = ns - kvm->arch.last_tsc_nsec;
-   sdiff = data - kvm->arch.last_tsc_write;
-   if (sdiff<  0)
-   sdiff = -sdiff;
+
+   /* n.b - signed multiplication and division required */
+   nsdiff = data - kvm->arch.last_tsc_write;
+   nsdiff = (nsdiff * 1000) / vcpu->arch.virtual_tsc_khz;
+   nsdiff -= elapsed;
+   if (nsdiff<  0)
+   nsdiff = -nsdiff;

/*
-* Special case: close write to TSC within 5 seconds of
-* another CPU is interpreted as an attempt to synchronize
-* The 5 seconds is to accommodate host load / swapping as
-* well as any reset of TSC during the boot process.
-*
-* In that case, for a reliable TSC, we can match TSC offsets,
-* or make a best guest using elapsed value.
-*/
-   if (sdiff<  nsec_to_cycles(vcpu, 5ULL * NSEC_PER_SEC)&&
-   elapsed<  5ULL * NSEC_PER_SEC) {
+* Special case: TSC write with a small delta (1 second) of virtual
+* cycle time against real time is interpreted as an attempt to
+* synchronize the CPU.
+ *
+* For a reliable TSC, we can match TSC offsets, and for an unstable
+* TSC, we add elapsed time in this computation.  We could let the
+* compensation code attempt to catch up if we fall behind, but
+* it's better to try to match offsets from the beginning.
+ */
+   if (nsdiff<  NSEC_PER_SEC&&
+   vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
if (!check_tsc_unstable()) {
offset = kvm->arch.last_tsc_offset;
pr_debug("kvm: matched tsc offset for %llu\n", data);
} else {
u64 delta = nsec_to_cycles(vcpu, elapsed);
-   offset += delta;
+   data += delta;
+   offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
}
ns = kvm->arch.last_tsc_nsec;
@@ -1075,6 +1081,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
kvm->arch.last_tsc_nsec = ns;
kvm->arch.last_tsc_write = data;
kvm->arch.last_tsc_offset = offset;
+   kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
kvm_x86_ops->write_tsc_offset(vcpu, offset);

Fwd: [KVM TSC emulation 1/9] Infrastructure for software and hardware based TSC rate scaling

2011-06-21 Thread Zachary Amsden




 Original Message 
Subject: 	[KVM TSC emulation 1/9] Infrastructure for software and 
hardware based TSC rate scaling

Date:   Mon, 20 Jun 2011 16:59:29 -0700
From:   Zachary Amsden 
To: 	Avi Kivity , Marcelo Tosatti , 
Glauber Costa , Frank Arnold , 
Joerg Roedel , Jan Kiszka 
, linux-...@vger.kernel.org, 
linux-ker...@vger.kernel.org, Zachary Amsden , Avi 
Kivity , Marcelo Tosatti , Glauber 
Costa , Frank Arnold , Joerg 
Roedel , Jan Kiszka , 
linux-...@vger.kernel.org
CC: 	Zachary Amsden , Zachary Amsden 





This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate.  Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code.  This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.

There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed.  Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used.  The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.

In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.

Signed-off-by: Zachary Amsden
---
 arch/x86/include/asm/kvm_host.h |9 ++--
 arch/x86/kvm/svm.c  |   20 ++
 arch/x86/kvm/vmx.c  |   16 +--
 arch/x86/kvm/x86.c  |   82 ++
 include/linux/kvm.h |1 +
 5 files changed, 68 insertions(+), 60 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index da6bbee..184cd38 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -393,10 +393,11 @@ struct kvm_vcpu_arch {
u64 last_kernel_ns;
u64 last_tsc_nsec;
u64 last_tsc_write;
-   u32 virtual_tsc_khz;
bool tsc_catchup;
-   u32  tsc_catchup_mult;
-   s8   tsc_catchup_shift;
+   bool tsc_always_catchup;
+   s8 virtual_tsc_shift;
+   u32 virtual_tsc_mult;
+   u32 virtual_tsc_khz;

bool nmi_pending;
bool nmi_injected;
@@ -604,7 +605,7 @@ struct kvm_x86_ops {

bool (*has_wbinvd_exit)(void);

-   void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz);
+   void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool 
scale);
void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);

u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 475d1c9..47f557e 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -905,20 +905,25 @@ static u64 svm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc)
return _tsc;
 }

-static void svm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
+static void svm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool 
scale)
 {
struct vcpu_svm *svm = to_svm(vcpu);
u64 ratio;
u64 khz;

-   /* TSC scaling supported? */
-   if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR))
+   /* Guest TSC same frequency as host TSC? */
+   if (!scale) {
+   svm->tsc_ratio = TSC_RATIO_DEFAULT;
return;
+   }

-   /* TSC-Scaling disabled or guest TSC same frequency as host TSC? */
-   if (user_tsc_khz == 0) {
-   vcpu->arch.virtual_tsc_khz = 0;
-   svm->tsc_ratio = TSC_RATIO_DEFAULT;
+   /* TSC scaling supported? */
+   if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR)) {
+   if (user_tsc_khz>  tsc_khz) {
+   vcpu->arch.tsc_catchup = 1;
+   vcpu->arch.tsc_always_catchup = 1;
+   } else
+   WARN(1, "user requested TSC rate below hardware 
speed\n");
return;
}

@@ -933,7 +938,6 @@ static void svm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 
user_tsc_khz)
user_tsc_khz);
return;
}
-   vcpu->arch.virtual_tsc_khz = user_tsc_khz;
svm->tsc_ratio = ratio;
 }

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f5b49c7..bc3ecfd 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1751,13 +1751,19 @@ static u64 guest_read_tsc(void)
 }

 /*
- * Empty call-back. Needs to be implemented when VM

Re: KVM: Remaining body of TSC emulation work

2011-06-21 Thread Zachary Amsden


On 06/21/2011 01:50 AM, Jan Kiszka wrote:

On 2011-06-21 01:59, Zachary Amsden wrote:
   

In-Reply-To:

This is the remaining bulk of work I have related to TSC emulation.
In summary, I believe this fixes all known issues with TSC.  A few
rather subtle issues are cleaned up, S4 suspend is fixed, and the
API for adjust_tsc_offset is expanded to take arguments in either
guest cycles or optionally host cycles.  The final patch adds
software TSC emulation, with a configurable default setting.

Note that TSC trapping will only be used when other methods fail
due to unstable TSC or lack of scaling hardware.  With the improved
offset matching, I was even able to get SMP guests to boot with a
TSC clock; cycle testing showed a maximum backwards leap of around
0.25 ms, which is actually fairly good.  With the last patch applied,
software TSC emulation kicks in and the backwards TSC count even on
my broken hardware dropped to zero.

Some of this code (the S4 suspend compensation) has already been
ported into RHEL to fix various bugs - however upstream had diverged
a bit from the original course I planned; I had to add a few things
that had been optimized out of upstream back in (last_host_tsc).

In the course of doing this, I think the new code looks much
cleaner, with well documented and easy to understand variables.
Yes, there are a lot of tracking variables to maintain this whole
infrastructure - and yes, some of them appear to be redundant of
easily computable from others - but in actuality, the information
they provide is easy to understand, and the resulting code is much
easier to verify than a complex system where some quantities may
be undefined or computed on the fly and thus causing subtle races.

Zach

 

You want this to show up on kvm@vger.kernel.org as well, no? :)
   


Yeah, oops on that.


Looks like there is still mark_tsc_unstable in KVM code, so we aren't
yet at the point of ultimate cleanness. Nevertheless, we see that I'll
test suspend to RAM with this soon.
   


Yeah, I decided not to remove it... should it fire, we've detected 
either a bad hardware TSC or some previously unseen platform behavior.  
We're obligated to report it, and if it turns out to be another issue 
like suspend to RAM, we now have the machinery to fix it.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH x86 kvm] Fix register corruption in pvclock_scale_delta

2011-06-15 Thread Zachary Amsden

The 128-bit multiply in pvclock.h was missing an output constraint for 
EDX which caused a register corruption to appear.  Thanks to Ulrich for 
diagnosing the EDX corruption and Avi for providing this fix (which now 
I am sending back to you Avi..)


Zach
>From Avi and Ulrich, fix constraints for 128-bit multiply

Signed-off-by: Zachary Amsden 

diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 31d84ac..a518c0a 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -22,6 +22,8 @@ static inline u64 pvclock_scale_delta(u64 delta, u32 
mul_frac, int shift)
u64 product;
 #ifdef __i386__
u32 tmp1, tmp2;
+#else
+   ulong tmp;
 #endif
 
if (shift < 0)
@@ -42,8 +44,11 @@ static inline u64 pvclock_scale_delta(u64 delta, u32 
mul_frac, int shift)
: "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) );
 #elif defined(__x86_64__)
__asm__ (
-   "mul %%rdx ; shrd $32,%%rdx,%%rax"
-   : "=a" (product) : "0" (delta), "d" ((u64)mul_frac) );
+   "mul %[mul_frac] ; shrd $32, %[hi], %[lo]"
+   : [lo]"=a"(product),
+ [hi]"=d"(tmp)
+   : "0"(delta),
+ [mul_frac]"rm"((u64)mul_frac));
 #else
 #error implement me!
 #endif

Re: [Qemu-devel] [PATCH v4 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts

2011-05-11 Thread Zachary Amsden


On 05/09/2011 12:03 AM, Ulrich Obergfell wrote:

Loss of periodic timer interrupts caused by delayed callbacks and by
interrupt coalescing is compensated by gradually injecting additional
interrupts during subsequent timer intervals, starting at a rate of
one additional interrupt per interval. The injection of additional
interrupts is based on a backlog of unaccounted HPET clock periods
(new HPETTimer field 'ticks_not_accounted'). The backlog increases
due to delayed callbacks and coalesced interrupts, and it decreases
if an interrupt was injected successfully. If the backlog increases
while compensation is still in progress, the rate at which additional
interrupts are injected is increased too. A limit is imposed on the
backlog and on the rate.

Injecting additional timer interrupts to compensate lost interrupts
can alleviate long term time drift. However, on a short time scale,
this method can have the side effect of making virtual machine time
intermittently pass slower and faster than real time (depending on
the guest's time keeping algorithm). Compensation is disabled by
default and can be enabled for guests where this behaviour may be
acceptable.

Signed-off-by: Ulrich Obergfell
---
  hw/hpet.c |   70 +++-
  1 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index e57c654..519fc6b 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -31,6 +31,7 @@
  #include "hpet_emul.h"
  #include "sysbus.h"
  #include "mc146818rtc.h"
+#include

  //#define HPET_DEBUG
  #ifdef HPET_DEBUG
@@ -41,6 +42,9 @@

  #define HPET_MSI_SUPPORT0

+#define MAX_TICKS_NOT_ACCOUNTED (uint64_t)5 /* 5 sec */
+#define MAX_IRQ_RATE(uint32_t)10
+
  struct HPETState;
  typedef struct HPETTimer {  /* timers */
  uint8_t tn; /*timer number*/
@@ -324,14 +328,35 @@ static const VMStateDescription vmstate_hpet = {
  }
  };

+static void hpet_timer_driftfix_reset(HPETTimer *t)
+{
+if (t->state->driftfix&&  timer_is_periodic(t)) {
+t->ticks_not_accounted = t->prev_period = t->period;
   


This is rather confusing.  Clearly, ticks_not_accounted isn't actually 
ticks not accounted, it's a different variable entirely which is based 
on the current period.


If it were actually ticks_not_accounted, it should be reset to zero.


+t->irq_rate = 1;
+t->divisor = 1;
+}
+}
+
+static bool hpet_timer_has_tick_backlog(HPETTimer *t)
+{
+uint64_t backlog = 0;
+
+if (t->ticks_not_accounted>= t->period + t->prev_period) {
+backlog = t->ticks_not_accounted - (t->period + t->prev_period);
+}
+return (backlog>= t->period);
+}
+
  /*
   * timer expiration callback
   */
  static void hpet_timer(void *opaque)
  {
  HPETTimer *t = opaque;
+HPETState *s = t->state;
  uint64_t diff;
-
+int irq_delivered = 0;
+uint32_t period_count = 0;
  uint64_t period = t->period;
  uint64_t cur_tick = hpet_get_ticks(t->state);

@@ -339,13 +364,37 @@ static void hpet_timer(void *opaque)
  if (t->config&  HPET_TN_32BIT) {
  while (hpet_time_after(cur_tick, t->cmp)) {
  t->cmp = (uint32_t)(t->cmp + t->period);
+t->ticks_not_accounted += t->period;
+period_count++;
  }
  } else {
  while (hpet_time_after64(cur_tick, t->cmp)) {
  t->cmp += period;
+t->ticks_not_accounted += period;
+period_count++;
  }
  }
  diff = hpet_calculate_diff(t, cur_tick);
+if (s->driftfix) {
+if (t->ticks_not_accounted>  MAX_TICKS_NOT_ACCOUNTED) {
+t->ticks_not_accounted = t->period + t->prev_period;
+}
+if (hpet_timer_has_tick_backlog(t)) {
+if (t->irq_rate == 1 || period_count>  1) {
+t->irq_rate++;
+t->irq_rate = MIN(t->irq_rate, MAX_IRQ_RATE);
+}
+if (t->divisor == 0) {
+assert(period_count);
+}
+if (period_count) {
+t->divisor = t->irq_rate;
+}
+diff /= t->divisor--;
   


Why subtracting from the divisor?  Shouldn't the divisor and irq_rate 
change in lockstep?



+} else {
+t->irq_rate = 1;
+}
+}
  qemu_mod_timer(t->qemu_timer,
 qemu_get_clock_ns(vm_clock) + 
(int64_t)ticks_to_ns(diff));
  } else if (t->config&  HPET_TN_32BIT&&  !timer_is_periodic(t)) {
@@ -356,7 +405,22 @@ static void hpet_timer(void *opaque)
  t->wrap_flag = 0;
  }
  }
-update_irq(t, 1);
+if (s->driftfix&&  timer_is_periodic(t)&&  period != 0) {
+if (t->ticks_not_accounted>= t->period + t->prev_period) {
+irq_delivered = update_irq(t, 1);
+if (irq_del

Re: [Qemu-devel] [PATCH v4 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts

2011-05-11 Thread Zachary Amsden


On 05/09/2011 12:03 AM, Ulrich Obergfell wrote:

Loss of periodic timer interrupts caused by delayed callbacks and by
interrupt coalescing is compensated by gradually injecting additional
interrupts during subsequent timer intervals, starting at a rate of
one additional interrupt per interval. The injection of additional
interrupts is based on a backlog of unaccounted HPET clock periods
(new HPETTimer field 'ticks_not_accounted'). The backlog increases
due to delayed callbacks and coalesced interrupts, and it decreases
if an interrupt was injected successfully. If the backlog increases
while compensation is still in progress, the rate at which additional
interrupts are injected is increased too. A limit is imposed on the
backlog and on the rate.

Injecting additional timer interrupts to compensate lost interrupts
can alleviate long term time drift. However, on a short time scale,
this method can have the side effect of making virtual machine time
intermittently pass slower and faster than real time (depending on
the guest's time keeping algorithm). Compensation is disabled by
default and can be enabled for guests where this behaviour may be
acceptable.

Signed-off-by: Ulrich Obergfell
---
  hw/hpet.c |   70 +++-
  1 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index e57c654..519fc6b 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -31,6 +31,7 @@
  #include "hpet_emul.h"
  #include "sysbus.h"
  #include "mc146818rtc.h"
+#include

  //#define HPET_DEBUG
  #ifdef HPET_DEBUG
@@ -41,6 +42,9 @@

  #define HPET_MSI_SUPPORT0

+#define MAX_TICKS_NOT_ACCOUNTED (uint64_t)5 /* 5 sec */
+#define MAX_IRQ_RATE(uint32_t)10
+
  struct HPETState;
  typedef struct HPETTimer {  /* timers */
  uint8_t tn; /*timer number*/
@@ -324,14 +328,35 @@ static const VMStateDescription vmstate_hpet = {
  }
  };

+static void hpet_timer_driftfix_reset(HPETTimer *t)
+{
+if (t->state->driftfix&&  timer_is_periodic(t)) {
+t->ticks_not_accounted = t->prev_period = t->period;
   


This is rather confusing.  Clearly, ticks_not_accounted isn't actually 
ticks not accounted, it's a different variable entirely which is based 
on the current period.


If it were actually ticks_not_accounted, it should be reset to zero.


+t->irq_rate = 1;
+t->divisor = 1;
+}
+}
+
+static bool hpet_timer_has_tick_backlog(HPETTimer *t)
+{
+uint64_t backlog = 0;
+
+if (t->ticks_not_accounted>= t->period + t->prev_period) {
+backlog = t->ticks_not_accounted - (t->period + t->prev_period);
+}
+return (backlog>= t->period);
+}
+
  /*
   * timer expiration callback
   */
  static void hpet_timer(void *opaque)
  {
  HPETTimer *t = opaque;
+HPETState *s = t->state;
  uint64_t diff;
-
+int irq_delivered = 0;
+uint32_t period_count = 0;
  uint64_t period = t->period;
  uint64_t cur_tick = hpet_get_ticks(t->state);

@@ -339,13 +364,37 @@ static void hpet_timer(void *opaque)
  if (t->config&  HPET_TN_32BIT) {
  while (hpet_time_after(cur_tick, t->cmp)) {
  t->cmp = (uint32_t)(t->cmp + t->period);
+t->ticks_not_accounted += t->period;
+period_count++;
  }
  } else {
  while (hpet_time_after64(cur_tick, t->cmp)) {
  t->cmp += period;
+t->ticks_not_accounted += period;
+period_count++;
  }
  }
  diff = hpet_calculate_diff(t, cur_tick);
+if (s->driftfix) {
+if (t->ticks_not_accounted>  MAX_TICKS_NOT_ACCOUNTED) {
+t->ticks_not_accounted = t->period + t->prev_period;
+}
+if (hpet_timer_has_tick_backlog(t)) {
+if (t->irq_rate == 1 || period_count>  1) {
+t->irq_rate++;
+t->irq_rate = MIN(t->irq_rate, MAX_IRQ_RATE);
+}
+if (t->divisor == 0) {
+assert(period_count);
+}
+if (period_count) {
+t->divisor = t->irq_rate;
+}
+diff /= t->divisor--;
   


Why subtracting from the divisor?  Shouldn't the divisor and irq_rate 
change in lockstep?



+} else {
+t->irq_rate = 1;
+}
+}
  qemu_mod_timer(t->qemu_timer,
 qemu_get_clock_ns(vm_clock) + 
(int64_t)ticks_to_ns(diff));
  } else if (t->config&  HPET_TN_32BIT&&  !timer_is_periodic(t)) {
@@ -356,7 +405,22 @@ static void hpet_timer(void *opaque)
  t->wrap_flag = 0;
  }
  }
-update_irq(t, 1);
+if (s->driftfix&&  timer_is_periodic(t)&&  period != 0) {
+if (t->ticks_not_accounted>= t->period + t->prev_period) {
+irq_delivered = update_irq(t, 1);
+if (irq_del

Re: 2.6.32 guest with paravirt clock enabled hangs on 2.6.37.6 host (w qemu-kvm-0.13.0)

2011-05-09 Thread Zachary Amsden


On 05/09/2011 11:25 AM, Nikola Ciprich wrote:

The guest, because latest kernels do not suffer this problem, so I'd like to
find fix so it can be pushed to -stable (we're using 2.6.32.x)
host is currently 2.6.37 (and i'm currently testing 2.6.38 as well)
n.


That's a pretty wide range to be bisecting, and I think we know for a 
fact there were some kvmclock related bugs in that range.


If you are looking for something causing problems with tcpdump, I'd 
suggest getting rid of kvmclock in your testing and using TSC instead; 
if you're looking to verify that kvmclock related changed have been 
backported to -stable, rather than bisect and run into bugs, it would 
probably be faster to check the commit logs for arch/x86/kvm/x86.c and 
make sure you're not missing anything from me or Glauber that has been 
applied to the most recent branch.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.32 guest with paravirt clock enabled hangs on 2.6.37.6 host (w qemu-kvm-0.13.0)

2011-05-09 Thread Zachary Amsden


On 05/08/2011 12:06 PM, Nikola Ciprich wrote:

OK,
I see.. the problem is, that I'm trying to hunt down bug causing hangs
when 2.6.32 guests try to run tcpdump - this seems to be reproducible even on 
latest 2.6.32.x, and seems like it depends on kvm-clock..
So I was thinking about bisecting between 2.6.32 and latest git which doesn't 
seem to suffer this problem but hitting another (different) problem in 2.6.32 
complicates thinks a bit :(
If somebody would have some hint on how to proceed, I'd be more then grateful..
cheers
n.
   


What are you bisecting, the host kernel or the guest kernel, and what 
version is the host kernel?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 27/30] nVMX: Additional TSC-offset handling

2011-05-09 Thread Zachary Amsden


On 05/08/2011 01:29 AM, Nadav Har'El wrote:

In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to
emulate this MSR write by L2 by modifying vmcs02.tsc_offset. We also need to
set vmcs12.tsc_offset, for this change to survive the next nested entry (see
prepare_vmcs02()).
   


Both changes look correct to me.

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Bug in KVM clock backwards compensation

2011-04-29 Thread Zachary Amsden


On 04/29/2011 01:40 AM, Joerg Roedel wrote:

On Thu, Apr 28, 2011 at 08:00:57PM -0700, Zachary Amsden wrote:
   

On 04/28/2011 01:20 PM, Joerg Roedel wrote:
 
   

This code checks how many guest tsc cycles have passed since this vCPU
was de-scheduled last time (and before it is running again). So since
the vCPU hasn't run in the meantime it had no chance to change its TSC.
Further, the other parameters like the TSC offset and the scaling
multiplier havn't changed too, so the only variable in the guest-tsc
calculation is the host-tsc.
So this calculation using the guest-tsc can detect backwards going
host-tsc as good as the old one. The benefit here is that we can feed
consistent values into adjust_tsc_offset().

   

While true, this is more complex than the original code.  The original
code here doesn't try to actually compensate for the guest TSC
difference - instead what it does is NULL any discovered host TSC delta:
 

Why should KVM care about the host-tsc going backwards when the
guest-tsc moves forward? I think the bottom-line of this code is to make
sure the guest-tsc does not jump around (or even jumps backwards).

I also don't agree that this is additional complexity. With these
changes we were able to get rid of the last_host_tsc variable which is
actually a simplification.

As I see it, there are two situations here:

1) On hosts without tsc-scaling the value of tsc_delta calculated from
the guest-tsc values is the same as when calculated with host-tsc
values, because the guest-tsc only differs by an offset from the
host-tsc.

2) On hosts with tsc-scaling these two tsc_delta values may be
different. If the guest has a different tsc-freq as the host, we
can't simply adjust the tsc by an offset calculated from the host-tsc
values. So tsc_delta has to be calculated using the guest-tsc.
   

 if (tsc_delta<  0)
 mark_tsc_unstable("KVM discovered backwards TSC");
 if (check_tsc_unstable()) {
 kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
 vcpu->arch.tsc_catchup = 1;
 }
 kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

Erasing that delta also erases elapsed time since the VCPU has last been
run, which isn't desirable, so it then sets tsc_catchup mode, which will
restore the proper TSC.  The request here triggers code which later
updates the TSC offset again.
 

I just looked again at the tsc_catchup thing. Since the clock-update is
forced in the code above, and the clock-update-code adjusts the tsc
itself if necessary, is it really necessary to do this here at all?
   


Unfortunately, yes, it is.  The code in the second or catchup phase can 
only correct an under adjusted clock, or we risk setting the guest clock 
backwards from what has been observed.  So to guarantee there is not a 
huge jump due to over-adjustment, we must eliminate the TSC delta when 
switching CPUs, and that needs to happen in the preempt notifier, not 
the clock update handler.


I agree it is simpler to do everything in terms of guest clock rate, but 
there is one variable which is thrown in to complicate things here - the 
host clock rate may have changed during this whole process.


Let me look at the code again and see if it is possible to get rid of 
the first, host based adjustment entirely.  Ideally we would just track 
things in terms of the guest clock and do all the computation inside the 
clock update request handler instead of having one adjustment in the 
preempt notifier and a separate one later when the clock update happens.


I had originally tried something like that but ran into issues where the 
rate computation got exceedingly difficult.  Maybe the code has been 
restructured enough now that it will work (the clock update didn't used 
to happen unless you had KVM clock...)


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Bug in KVM clock backwards compensation

2011-04-28 Thread Zachary Amsden


On 04/28/2011 01:20 PM, Joerg Roedel wrote:

On Thu, Apr 28, 2011 at 11:34:44AM -0700, Zachary Amsden wrote:
   

On 04/28/2011 12:13 AM, Roedel, Joerg wrote:
 
   

I see it different. This code wants to check if the _guest_ tsc moves
forwared (or at least not backwards). So it is fully legitimate to just
do this by reading the guest-tsc and compare it to the last one the
guest had.
   

That wasn't the intention when I wrote that code.  It's simply there to
detect backwards motion of the host TSC.  The guest TSC can legally go
backwards whenever the guest decides to change it, so checking the guest
TSC doesn't make sense here.
 

This code checks how many guest tsc cycles have passed since this vCPU
was de-scheduled last time (and before it is running again). So since
the vCPU hasn't run in the meantime it had no chance to change its TSC.
Further, the other parameters like the TSC offset and the scaling
multiplier havn't changed too, so the only variable in the guest-tsc
calculation is the host-tsc.
So this calculation using the guest-tsc can detect backwards going
host-tsc as good as the old one. The benefit here is that we can feed
consistent values into adjust_tsc_offset().
   


While true, this is more complex than the original code.  The original 
code here doesn't try to actually compensate for the guest TSC 
difference - instead what it does is NULL any discovered host TSC delta:


if (tsc_delta < 0)
mark_tsc_unstable("KVM discovered backwards TSC");
if (check_tsc_unstable()) {
kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
vcpu->arch.tsc_catchup = 1;
}
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

Erasing that delta also erases elapsed time since the VCPU has last been 
run, which isn't desirable, so it then sets tsc_catchup mode, which will 
restore the proper TSC.  The request here triggers code which later 
updates the TSC offset again.


To avoid complexity, I think it's simplest to do the first computation 
in terms of the host TSC.



Yes, with tsc-scaling, the machines already have stable TSCs - the above
test is for older hardware which could have problems, and can be
reverted back to the original code without worrying about switching
units.
 

This is the case pratically. But architecturally the tsc-scaling feature
does not depend on a constant tsc, so we can make no such assumtions.
Additionally, it may happen that Linux mis-detects an unstable tsc for
some reason (broken BIOS, bug in the code, ...).  Therefore I think it
is dangerous to assume that this code will never run on tsc-scaling
capable hosts. And if it does and we don't manage the tsc-offset units
right, we may see very weird behavior.
   


I agree, it is best to handle this case - hardware can and will change - 
but the TSC adjustment in terms of guest rate should be done under the 
atomic protection right before entering hardware virtualized mode - here:


I left compute_guest_tsc in place to recompute time in guest units here, 
even if the underlying hardware rate changes.


/*
 * We may have to catch up the TSC to match elapsed wall clock
 * time for two reasons, even if kvmclock is used.
 *   1) CPU could have been running below the maximum TSC rate
 *   2) Broken TSC compensation resets the base at each VCPU
 *  entry to avoid unknown leaps of TSC even when running
 *  again on the same CPU.  This may cause apparent elapsed
 *  time to disappear, and the guest to stand still or run
 *  very slowly.
 */
if (vcpu->tsc_catchup) {
u64 tsc = compute_guest_tsc(v, kernel_ns);
if (tsc > tsc_timestamp) {
kvm_x86_ops->adjust_tsc_offset(v, tsc - 
tsc_timestamp);

tsc_timestamp = tsc;
}
}

So yeah, the code is getting pretty complex but we'd like to avoid that 
as much as possible - so I would prefer to have the hardware backwards 
compensation separate from the guest rate computation by doing this:


step 1) remove any backwards hardware TSC delta (in hardware units)
step 2) recompute guest TSC from a stable clock (gotten from kernel_ns) 
and apply adjustment (in guest units)


So it appears you can just share most of the logic of guest TSC catchup 
mode.


What do you think?

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Bug in KVM clock backwards compensation

2011-04-28 Thread Zachary Amsden


On 04/28/2011 12:22 AM, Roedel, Joerg wrote:

On Thu, Apr 28, 2011 at 03:06:01AM -0400, Jan Kiszka wrote:
   

And /me still wonders (like I did when this first popped up) if the
proper place of determining TSC stability really have to be KVM.

If the Linux core fails to detect some instability and KVM has to jump
in, shouldn't we better improve the core's detection abilities and make
use of them in KVM? Conceptually this looks like we are currently just
working around a core deficit in KVM.
 

Yes, good question. Has this ever triggered on a real machine (not
counting the suspend/resume issue in)?
   


Yes... some platforms don't go haywire until you start using power 
mangement, TSC is stable before that, but not afterwards, and depending 
on the version of the kernel, KVM might detect this before the kernel does.


Honestly, the code is obsolete, but still useful for those who build KVM 
as an external module on older kernels using the kvm-kmod system.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Bug in KVM clock backwards compensation

2011-04-28 Thread Zachary Amsden


On 04/28/2011 12:13 AM, Roedel, Joerg wrote:

On Thu, Apr 28, 2011 at 02:59:57AM -0400, Zachary Amsden wrote:
   

So I've been going over the new code changes to the TSC related code and
I don't like one particular set of changes.  In particular, here:

  kvm_x86_ops->vcpu_load(vcpu, cpu);
  if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
  /* Make sure TSC doesn't go backwards */
  s64 tsc_delta;
  u64 tsc;

  kvm_get_msr(vcpu, MSR_IA32_TSC,&tsc);
  tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
   tsc - vcpu->arch.last_guest_tsc;

  if (tsc_delta<  0)
  mark_tsc_unstable("KVM discovered backwards TSC");
  if (check_tsc_unstable()) {
  kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
  vcpu->arch.tsc_catchup = 1;
  }


The point of this code fragment is to test the host clock to see if it
is stable, because we may have just come back from an idle phase which
stopped the TSC, switched CPUs, or come back from a deep sleep state
which reset the host TSC.
 

I see it different. This code wants to check if the _guest_ tsc moves
forwared (or at least not backwards). So it is fully legitimate to just
do this by reading the guest-tsc and compare it to the last one the
guest had.
   


That wasn't the intention when I wrote that code.  It's simply there to 
detect backwards motion of the host TSC.  The guest TSC can legally go 
backwards whenever the guest decides to change it, so checking the guest 
TSC doesn't make sense here.



I saw a patch floating around that touched this code recently, but I
think there's a definite issue here that needs addressing.
 

In fact, this change was done to address one of your concerns. You
mentioned that the values passed to adjust_tsc_offset() were in
unconsistent units in my first version of tsc-scaling. This was a right
objection because one call-site used guest-tsc-units while the other
used host-tsc-units. This change intended to fix that by using
guest-tsc-units always for adjust_tsc_offset().

Not that the guest and the host tsc have the same units on current
machines. But with tsc-scaling these units are different.
   


Yes, with tsc-scaling, the machines already have stable TSCs - the above 
test is for older hardware which could have problems, and can be 
reverted back to the original code without worrying about switching units.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Bug in KVM clock backwards compensation

2011-04-28 Thread Zachary Amsden


On 04/28/2011 12:06 AM, Jan Kiszka wrote:

On 2011-04-28 08:59, Zachary Amsden wrote:
   

So I've been going over the new code changes to the TSC related code and
I don't like one particular set of changes.  In particular, here:

 kvm_x86_ops->vcpu_load(vcpu, cpu);
 if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
 /* Make sure TSC doesn't go backwards */
 s64 tsc_delta;
 u64 tsc;

 kvm_get_msr(vcpu, MSR_IA32_TSC,&tsc);
 tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
  tsc - vcpu->arch.last_guest_tsc;

 if (tsc_delta<  0)
 mark_tsc_unstable("KVM discovered backwards TSC");
 if (check_tsc_unstable()) {
 kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
 vcpu->arch.tsc_catchup = 1;
 }


The point of this code fragment is to test the host clock to see if it
is stable, because we may have just come back from an idle phase which
stopped the TSC, switched CPUs, or come back from a deep sleep state
which reset the host TSC.

However, the above code is fetching the guest TSC instead of the host
TSC, which isn't the way it is supposed to work.

I saw a patch floating around that touched this code recently, but I
think there's a definite issue here that needs addressing.
 

And /me still wonders (like I did when this first popped up) if the
proper place of determining TSC stability really have to be KVM.
   


No, it's not.  I have a patch which fixes that.  Was in the process of 
merging it into my local tree when I found this.



If the Linux core fails to detect some instability and KVM has to jump
in, shouldn't we better improve the core's detection abilities and make
use of them in KVM? Conceptually this looks like we are currently just
working around a core deficit in KVM.
   


100% correct you are.

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bug in KVM clock backwards compensation

2011-04-28 Thread Zachary Amsden

So I've been going over the new code changes to the TSC related code and 
I don't like one particular set of changes.  In particular, here:


kvm_x86_ops->vcpu_load(vcpu, cpu);
if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
/* Make sure TSC doesn't go backwards */
s64 tsc_delta;
u64 tsc;

kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc);
tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
 tsc - vcpu->arch.last_guest_tsc;

if (tsc_delta < 0)
mark_tsc_unstable("KVM discovered backwards TSC");
if (check_tsc_unstable()) {
kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
vcpu->arch.tsc_catchup = 1;
}


The point of this code fragment is to test the host clock to see if it 
is stable, because we may have just come back from an idle phase which 
stopped the TSC, switched CPUs, or come back from a deep sleep state 
which reset the host TSC.


However, the above code is fetching the guest TSC instead of the host 
TSC, which isn't the way it is supposed to work.


I saw a patch floating around that touched this code recently, but I 
think there's a definite issue here that needs addressing.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] KVM: X86: Make tsc_delta calculation a function of guest tsc

2011-04-19 Thread Zachary Amsden



On 2011-04-19 08:46, Roedel, Joerg wrote:
   

On Mon, Apr 18, 2011 at 08:02:35PM -0400, Zachary Amsden wrote:
 

On Sat, Apr 16, 2011 at 06:09:17PM +0200, Jan Kiszka wrote:

 

On 2011-03-25 09:44, Joerg Roedel wrote:

   


 

+   tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
+tsc - vcpu->arch.last_guest_tsc;

 


 

This patch appears to cause troubles to Linux guests on TSC clocksource
and APIC highres timer. The first boot after qemu start is always fine,
but after a reboot the guest timer appears to fire incorrectly or even
not at all.

Was this patch tested with a guest reboot scenario as well? Does it
account for the TSC being reset to 0 on reboot?

   

Hmm, probably the last_guest_tsc is not updated correctly in this
scenario. I will have a look tomorrow.

Joerg


 

To avoid this problem, when the TSC is reset, the overshoot protection
where last_guest_tsc is used is specifically disabled:

  /* Reset of TSC must disable overshoot protection below */
  vcpu->arch.hv_clock.tsc_timestamp = 0;
  vcpu->arch.last_tsc_write = data;
  vcpu->arch.last_tsc_nsec = ns;

You can probably use the same test - last_guest_tsc is only valid if
tsc_timestamp above != 0.
   

Yes, this should also work. But the other way we get rid of
last_host_tsc which is also a good thing :)
 

That reminds me: I think we still have the bug that KVM overeagerly
declares the host's TSC unstable after resume from S3 because the TSC
appears to go backward when KVM checks. Or should this have been fixed now?

   


That should have been fixed... let me check if the patch has made it 
upstream.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/6] KVM: X86: Make tsc_delta calculation a function of guest tsc

2011-04-18 Thread Zachary Amsden



On Sat, Apr 16, 2011 at 06:09:17PM +0200, Jan Kiszka wrote:
   

On 2011-03-25 09:44, Joerg Roedel wrote:
 
   

+   tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
+tsc - vcpu->arch.last_guest_tsc;
   
   

This patch appears to cause troubles to Linux guests on TSC clocksource
and APIC highres timer. The first boot after qemu start is always fine,
but after a reboot the guest timer appears to fire incorrectly or even
not at all.

Was this patch tested with a guest reboot scenario as well? Does it
account for the TSC being reset to 0 on reboot?
 

Hmm, probably the last_guest_tsc is not updated correctly in this
scenario. I will have a look tomorrow.

Joerg

   


To avoid this problem, when the TSC is reset, the overshoot protection 
where last_guest_tsc is used is specifically disabled:


/* Reset of TSC must disable overshoot protection below */
vcpu->arch.hv_clock.tsc_timestamp = 0;
vcpu->arch.last_tsc_write = data;
vcpu->arch.last_tsc_nsec = ns;

You can probably use the same test - last_guest_tsc is only valid if 
tsc_timestamp above != 0.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] fix regression caused by e48672fa25e879f7ae21785c7efd187738139593

2011-04-12 Thread Zachary Amsden


On 04/11/2011 12:12 PM, Nikola Ciprich wrote:

Hello Zachary,
what is the current status, are You going to post this patch to Avi?
I'd like to see one (or both) in stable eventually, I think it's good 
candidate..
BR
nik
   


I think for upstream the newer patch is the way to go, but I would like 
to see it get some exposure first as it did change locking I think in 
one place.


I'm travelling right now with limited time and network access, difficult 
to resend that patch, but Avi, can you apply it to your tree?


Thanks,

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] fix regression caused by e48672fa25e879f7ae21785c7efd187738139593

2011-03-25 Thread Zachary Amsden


On 03/09/2011 05:36 PM, Nikola Ciprich wrote:

commit 387b9f97750444728962b236987fbe8ee8cc4f8c moved 
kvm_request_guest_time_update(vcpu),
breaking 32bit SMP guests using kvm-clock. Fix this by moving (new) clock 
update function
to proper place.

Signed-off-by: Nikola Ciprich
---
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4c27144..ba3f76f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2101,8 +2101,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (check_tsc_unstable()) {
kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
vcpu->arch.tsc_catchup = 1;
-   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
}
+   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
if (vcpu->cpu != cpu)
kvm_migrate_timers(vcpu);
vcpu->cpu = cpu;

   



So something bothers me still about this bug.  What you did correctly 
restores the old behavior - but it shouldn't be fixing a bug.


The only reason you need to schedule an update for the KVM clock area is 
if a new VCPU has been created, you have an unstable TSC.. or something 
changes the VM's kvmclock offset.


So this change could in fact be hiding an underlying bug - either an 
unstable TSC is not being properly reported, the KVM clock offset is 
being changed, we are missing a KVM clock update for secondary VCPUs - 
or something else we don't yet understand is going on.


Nikola, can you try the patch below, which reverts your change and 
attempts to fix other possible sources of the problem, and see if it 
still reproduces?


Thanks,

Zach
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 58f517b..42618fb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2127,8 +2127,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (check_tsc_unstable()) {
kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
vcpu->arch.tsc_catchup = 1;
+   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
}
-   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+   if (vcpu->cpu == -1)
+   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
if (vcpu->cpu != cpu)
kvm_migrate_timers(vcpu);
vcpu->cpu = cpu;
@@ -3534,6 +3536,8 @@ long kvm_arch_vm_ioctl(struct file *filp,
struct kvm_clock_data user_ns;
u64 now_ns;
s64 delta;
+   struct kvm_vcpu *vcpu;
+   int i;
 
r = -EFAULT;
if (copy_from_user(&user_ns, argp, sizeof(user_ns)))
@@ -3549,6 +3553,8 @@ long kvm_arch_vm_ioctl(struct file *filp,
delta = user_ns.clock - now_ns;
local_irq_enable();
kvm->arch.kvmclock_offset = delta;
+   kvm_for_each_vcpu(i, vcpu, kvm)
+   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
break;
}
case KVM_GET_CLOCK: {

Re: [PATCHv2] fix regression caused by e48672fa25e879f7ae21785c7efd187738139593

2011-03-09 Thread Zachary Amsden


On 03/09/2011 05:36 PM, Nikola Ciprich wrote:

commit 387b9f97750444728962b236987fbe8ee8cc4f8c moved 
kvm_request_guest_time_update(vcpu),
breaking 32bit SMP guests using kvm-clock. Fix this by moving (new) clock 
update function
to proper place.

Signed-off-by: Nikola Ciprich
---
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4c27144..ba3f76f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2101,8 +2101,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (check_tsc_unstable()) {
kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
vcpu->arch.tsc_catchup = 1;
-   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
}
+   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
if (vcpu->cpu != cpu)
kvm_migrate_timers(vcpu);
vcpu->cpu = cpu;

   

ACK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] add missing guest clock update removed by e48672fa25e879f7ae21785c7efd187738139593

2011-03-09 Thread Zachary Amsden


On 03/09/2011 02:30 PM, Nikola Ciprich wrote:

Can you try moving the kvm_make_request() inside the if conditional and
see if it that also fixes it?
 

yes, changing to:
if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
/* Make sure TSC doesn't go backwards */
s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :

is also OK.

what about changing:
if (check_tsc_unstable()) {
kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
vcpu->arch.tsc_catchup = 1;
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
}

to:

if (check_tsc_unstable()) {
kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
vcpu->arch.tsc_catchup = 1;
}
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

this fixes thinks for me as well..
n.
?
   


Can you send a patch which does that?  I think this is the correct fix.

Thanks,

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] add missing guest clock update removed by e48672fa25e879f7ae21785c7efd187738139593

2011-03-08 Thread Zachary Amsden


On 03/07/2011 05:18 AM, Nikola Ciprich wrote:

e48672fa25e879f7ae21785c7efd187738139593 removed 
kvm_request_guest_time_update(vcpu);
this breaks 32bit SMP guests using virtio-clock.
thus add unconditional call to kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) to 
fix the
problem.

Signed-off-by: Nikola Ciprich
---
  arch/x86/kvm/x86.c |1 +
  1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bcc0efc..4c27144 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2091,6 +2091,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
}

kvm_x86_ops->vcpu_load(vcpu, cpu);
+   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
/* Make sure TSC doesn't go backwards */
s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :


   


Can you try moving the kvm_make_request() inside the if conditional and 
see if it that also fixes it?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: regression - 2.6.36 -> 2.6.37 - kvm - 32bit SMP guests don't boot

2011-03-06 Thread Zachary Amsden


On 03/05/2011 02:21 AM, Nikola Ciprich wrote:
   

Can you try this patch to see if it fixes the problem?
 

You haven't read my replies, did you? ;-)
kvm_request_guest_time_update seems to have been
removed, and kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu)
seems to be used instead, adding it fixes the problem.
That's what I was going to use in the patch... :)
   


I did read your mail, but I was working on an old tree... because of 
that transformation, this fix will unfortunately have to be back and 
forward ported by hand.


Did you try just that change right applied on top of the patch 
(e48672fa25e879f7ae21785c7efd187738139593) implicated by bisect?


It will be great to know if that change alone fixes the problem, if so, 
the fix you propose is probably the right one for upstream.


Thanks,

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: regression - 2.6.36 -> 2.6.37 - kvm - 32bit SMP guests don't boot

2011-03-04 Thread Zachary Amsden


On 03/04/2011 05:36 PM, Nikola Ciprich wrote:

I think although the long term plan is to just do this update once in
your case (stable tsc), this update is needed.

Why don't you send a patch to re-include it ?

 

Yes, I'll gladly submit patch, one question, is this OK
to just add calling kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) before
the conditional (as I did in my test), or should it go somewhere to else {..}
section? it's called inside the conditional again, which will cause it
to be called twice in some cases, is it OK?
n.

   


Can you try this patch to see if it fixes the problem?

Thanks,

Zach
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 468fafa..ba05303 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1866,6 +1866,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
}
 
kvm_x86_ops->vcpu_load(vcpu, cpu);
+   kvm_request_guest_time_update(vcpu);
if (unlikely(vcpu->cpu != cpu)) {
/* Make sure TSC doesn't go backwards */
s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :

Re: regression - 2.6.36 -> 2.6.37 - kvm - 32bit SMP guests don't boot

2011-03-04 Thread Zachary Amsden


On 03/04/2011 05:36 PM, Nikola Ciprich wrote:

I think although the long term plan is to just do this update once in
your case (stable tsc), this update is needed.

Why don't you send a patch to re-include it ?

 

Yes, I'll gladly submit patch, one question, is this OK
to just add calling kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) before
the conditional (as I did in my test), or should it go somewhere to else {..}
section? it's called inside the conditional again, which will cause it
to be called twice in some cases, is it OK?
n.
   


Let me write a patch to fix this..
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: regression - 2.6.36 -> 2.6.37 - kvm - 32bit SMP guests don't boot

2011-03-04 Thread Zachary Amsden


On 03/04/2011 02:09 PM, Glauber Costa wrote:

On Fri, 2011-03-04 at 19:27 +0100, Nikola Ciprich wrote:
   

Hello Zachary,

 

You don't see any messages about TSC being unstable or switching
clocksource after loading the KVM module?  And you are not suspending
the host or anything?
   

no messages, no suspending, nothing.


 

Can you try using "processor.max_cstate=1" on the host as a kernel
parameter and see if it makes a difference?
   

I tried it, no change..
n.
 

Zach,

I don't understand 100 % the logic behind all your tsc changes.
But kvm-clock-wise, most of the problems we had in the past were related
to the difference in resolution between the tsc and the host clocksource
(hpet, acpi_pm, etc), which in his case, it is a non-issue.

It does seem to me like some compensation logic kicked in, dismantling
an otherwise good tsc. He does have nonstop_tsc, which means it can't
get any better.

One thing I noticed when reading the culprit patch in bisect, is that in
vcpu_load(), there were previously a call to

  kvm_request_guest_time_update(vcpu)

that was removed without a counterpart addition. Any idea about why it
was done?
   


That's probably the source of the bug... I've been looking for that 
exact line, though, and I can't find it missing.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: regression - 2.6.36 -> 2.6.37 - kvm - 32bit SMP guests don't boot

2011-03-04 Thread Zachary Amsden


On 03/03/2011 05:01 PM, Nikola Ciprich wrote:

That sounds like a kernel which will be vulnerable to broken KVM clock
on 32-bit.  There's a kernel side fix that is needed, but why the server
side change triggers the problem needs more investigation.
 

OK, it's important for me that I can fix this by kernel parameter,
but if I can help somehow with debugging, please let me know.
thanks for Your time!
nik
   


You don't see any messages about TSC being unstable or switching 
clocksource after loading the KVM module?  And you are not suspending 
the host or anything?


Can you try using "processor.max_cstate=1" on the host as a kernel 
parameter and see if it makes a difference?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: regression - 2.6.36 -> 2.6.37 - kvm - 32bit SMP guests don't boot

2011-03-03 Thread Zachary Amsden


On 03/03/2011 04:06 PM, Nikola Ciprich wrote:

What is the exact kernel version you are using in the guest.
 

It's latest centos (2.6.18-194.32.1.el5), so I guess there are a lot
of fixes, but it's possible the kvm-clock is broken in it.
I can't influence what kernel is used there (at least not on customer's
guests), but I guess asking for adding clocksource kernel parameter is
not problem.

   


That sounds like a kernel which will be vulnerable to broken KVM clock 
on 32-bit.  There's a kernel side fix that is needed, but why the server 
side change triggers the problem needs more investigation.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: regression - 2.6.36 -> 2.6.37 - kvm - 32bit SMP guests don't boot

2011-03-03 Thread Zachary Amsden


On 03/03/2011 02:06 AM, Nikola Ciprich wrote:

No worries.  What mess?
 

twice sending the same mail, nevermind :)

   

I have two things you can try:

first is running a single VCPU guest, if you have not done so already.
 

yup, UP guest is fine, just SMP doesn't work.

   

Second is adding the bootparameter "clocksource=acpi_pm" to your guest
kernel.
 

yes, this makes SMP work too! I just realized when You were asking about current
clocksource, I told You only host source, not the guest. So I checked now,
and (at least for UP, I guess for SMP it's the same), the clocksource is
kvm-clock! So seems like it got broken with the TSC changes?
   


What is the exact kernel version you are using in the guest.

It appears that some earlier 32-bit versions of kvm-clock enabled 
kernels are still missing the required atomic check for backwards-time 
protection which would be needed on SMP.  This explains why 64-bit is 
fine, 32-bit is not.


Why this change triggers that problem still is a slight mystery, 
logically it should only affect the system if you have an unstable TSC.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: regression - 2.6.36 -> 2.6.37 - kvm - 32bit SMP guests don't boot

2011-03-02 Thread Zachary Amsden



(resend, sorry for the mess)
   


No worries.  What mess?

I have two things you can try:

first is running a single VCPU guest, if you have not done so already.

Second is adding the bootparameter "clocksource=acpi_pm" to your guest 
kernel.


If either of those fixes the problem, it very well have to do with this 
change and not that you may be missing later dependent patches.  This 
change should be nearly a 1-1 transformation, and if it is not, 
something is wrong.


What branch are you bisecting on, the kvm branch or the kernel tree 
itself?  It would be helpful to see the exact code in case any 
surrouding logic changed.


Thanks,

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: regression - 2.6.36 -> 2.6.37 - kvm - 32bit SMP guests don't boot

2011-02-28 Thread Zachary Amsden



On Mon, Feb 28, 2011 at 10:17:24AM -0500, Zachary Amsden wrote:
   

On 02/28/2011 09:32 AM, Nikola Ciprich wrote:
 

Does the bug you are hitting manifest on both Intel and AMD platforms?

 

I don't have any AMD box here, I'll try this out at my home box.


   

Further, do the systems you are hitting this on have stable or unstable
TSCs?

 

how do I find this out? I don't see any warning about TSC in guest, but I've
just started it..
n.
   

Before worrying about the guest, is the host TSC stable?  What is the
host clocksource?
 

not sure, I'm not setting anything specifically, is this snippet of dmesg 
relevant:

[1.148829] HPET: 8 timers in total, 5 timers will be used for per-cpu timer
[1.148934] hpet0: at MMIO 0xfed0, IRQs 2, 8, 40, 41, 42, 43, 44, 0
[1.149331] hpet0: 8 comparators, 64-bit 14.318180 MHz counter
[1.151831] hpet: hpet2 irq 40 for MSI
[1.151962] hpet: hpet3 irq 41 for MSI
[1.155930] hpet: hpet4 irq 42 for MSI
[1.159937] hpet: hpet5 irq 43 for MSI
[1.163943] hpet: hpet6 irq 44 for MSI
[1.175955] Switching to clocksource tsc

so I guess I'm using hpet?
n.


   
Looks like you are using tsc based on the last line.  Can you tell us 
please


cat /proc/cpuinfo
cat /sys/devices/system/clocksource/clocksource0/current_clocksource

and grep -i dmesg for these keywords: TSC, clock, hpet, stable, khz, kvm
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: regression - 2.6.36 -> 2.6.37 - kvm - 32bit SMP guests don't boot

2011-02-28 Thread Zachary Amsden


On 02/28/2011 09:32 AM, Nikola Ciprich wrote:

Does the bug you are hitting manifest on both Intel and AMD platforms?
 

I don't have any AMD box here, I'll try this out at my home box.

   

Further, do the systems you are hitting this on have stable or unstable
TSCs?
 

how do I find this out? I don't see any warning about TSC in guest, but I've
just started it..
n.


Before worrying about the guest, is the host TSC stable?  What is the 
host clocksource?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: regression - 2.6.36 -> 2.6.37 - kvm - 32bit SMP guests don't boot

2011-02-28 Thread Zachary Amsden


On 02/27/2011 12:20 PM, Nikola Ciprich wrote:

I was not aware of the thread.  Please cc me directly, or add a keyword
I track - timekeeping, TSC..
 

Hello Zachary, thanks for Your time looking at this!
   

That change alone may not bisect well; without further fixes on top of
it, you may end up with a hang or stall, which is likely to manifest in
a vendor-specific way.
 

I'm not sure I really understand You here, but this change is exactly to
what I got while bisecting. With later revisions, including this one,
32bit SMP guests don't boot, before it, they do..
   


Does the bug you are hitting manifest on both Intel and AMD platforms?

Further, do the systems you are hitting this on have stable or unstable 
TSCs?


Thanks,

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: regression - 2.6.36 -> 2.6.37 - kvm - 32bit SMP guests don't boot

2011-02-25 Thread Zachary Amsden


On 02/25/2011 05:48 AM, Nikola Ciprich wrote:

(CC: Zachary)

Hello,
Zachary, in case You haven't noticed the thread, we're trying
to find out the reason why 32bit SMP guests stopped working
in 2.6.37.
bisect shows this as the culprit:
   


I was not aware of the thread.  Please cc me directly, or add a keyword 
I track - timekeeping, TSC..



e48672fa25e879f7ae21785c7efd187738139593 is first bad commit
commit e48672fa25e879f7ae21785c7efd187738139593
Author: Zachary Amsden
Date:   Thu Aug 19 22:07:23 2010 -1000

 KVM: x86: Unify TSC logic

 Move the TSC control logic from the vendor backends into x86.c
 by adding adjust_tsc_offset to x86 ops.  Now all TSC decisions
 can be done in one place.

 Signed-off-by: Zachary Amsden
 Signed-off-by: Marcelo Tosatti
   


That change alone may not bisect well; without further fixes on top of 
it, you may end up with a hang or stall, which is likely to manifest in 
a vendor-specific way.


Basically there were a few differences in the platform code about how 
TSC was dealt with on systems which did not have stable clocks, this 
brought the logic into one location, but there was a slight change to 
the logic here.


Note very carefully, the logic on SVM is gated by a condition before 
this change:


if (unlikely(cpu != vcpu->cpu)) {
-   u64 delta;
-
-   if (check_tsc_unstable()) {
-   /*
-* Make sure that the guest sees a monotonically
-* increasing TSC.
-*/
-   delta = vcpu->arch.host_tsc - native_read_tsc();
-   svm->vmcb->control.tsc_offset += delta;
-   if (is_nested(svm))
-   svm->nested.hsave->control.tsc_offset += 
delta;

-   }
-   vcpu->cpu = cpu;
-   kvm_migrate_timers(vcpu);


So this only happens with a system which reports TSC as unstable.  After 
the change, KVM itself may report the TSC as unstable:


+   if (unlikely(vcpu->cpu != cpu)) {
+   /* Make sure TSC doesn't go backwards */
+   s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
+   native_read_tsc() - 
vcpu->arch.last_host_tsc;

+   if (tsc_delta < 0)
+   mark_tsc_unstable("KVM discovered backwards TSC");
+   if (check_tsc_unstable())
+   kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
+   kvm_migrate_timers(vcpu);
+   vcpu->cpu = cpu;
+   }

If the platform has very small TSC deltas across CPUs, but indicates the 
TSC is stable, this could result in KVM marking the TSC unstable.  If 
that is the case, this compensation logic will kick in to avoid 
backwards TSCs.


Note however, that the logic is not perfect; time which passes while not 
running on any CPU will be erased, as the delta compensation removes not 
just backwards, but any elapsed time from the TSC.  In extreme cases, 
this could result in time appearing to stand still with guests 
failing to boot.


This was addressed with a later change, which catches up the missing time:

commit c285545f813d7b0ce989fd34e42ad1fe785dc65d
Author: Zachary Amsden 
Date:   Sat Sep 18 14:38:15 2010 -1000

KVM: x86: TSC catchup mode

Negate the effects of AN TYM spell while kvm thread is preempted by 
tracking
conversion factor to the highest TSC rate and catching the TSC up 
when it has
fallen behind the kernel view of time.  Note that once triggered, 
we don't

turn off catchup mode.

A slightly more clever version of this is possible, which only does 
catchup
when TSC rate drops, and which specifically targets only CPUs with 
broken
TSC, but since these all are considered unstable_tsc(), this patch 
covers

all necessary cases.

Signed-off-by: Zachary Amsden 
Signed-off-by: Marcelo Tosatti 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/6] KVM support for TSC scaling

2011-02-21 Thread Zachary Amsden


On 02/21/2011 12:28 PM, Roedel, Joerg wrote:

On Sun, Feb 13, 2011 at 10:19:19AM -0500, Avi Kivity wrote:
   

On 02/09/2011 07:29 PM, Joerg Roedel wrote:
 

Hi Avi, Marcelo,

here is the patch-set to implement the TSC-scaling feature of upcoming
AMD CPUs. When this feature is supported the CPU provides a new MSR
which holds a multiplier for the hardware TSC which is applied on the
value rdtsc[p] and reads of MSR 0x10. This feature can be used to
emulate a given tsc frequency for the guest.
Patch 1 is not directly related to this patch-set because it only fixes
a bug which prevented me from testing these patches. In fact it fixes
the same bug Andre sent a patch for. But after the discussion about his
patch he told me to just post my patch and thus here it is.

   

Questions:
- the tsc multiplier really is a multiplier, right?  Not an addend that
is added every cycle.
 

Yes, it is a real multiplier. But writes to the TSC-MSR will change the
unscaled TSC value.

   

So

  wrmsr(TSC, 1e9)
  wrmsr(TSC_MULT, 2.)
  t = rdtsc()

will return about 2e9, not 1e9 + 2*(time to execute the code snippet) ?
 

Right. And if you exchange the two wrmsr calls it will still give you
the same result.

   

- what's the cost of wrmsr(TSC_MULT)?
 

Hard to tell by now because I only have numbers for pre-production
hardware.

   

There are really two ways to implement this feature.  One is fully
generic, like you did.  The other is to implement it at the host level -
have a sysfs file and/or kernel parameter for the desired tsc frequency,
write it once, and forget about it.  Trust management to set the host
tsc frequency to the same value on all hosts in a migration cluster.
 

The motivation here is mostly the flexibility. Scale the TSC for the
whole migration cluster only makes sense if all hosts there support the
feature. But the most likely scenario is that existing migration
clusters will be extended by new machines and guests will be migrated
there. And these guests should be able to see the same TSC frequency on
the new host as the had on the old one. The older machines in the
cluster may even have different TSC frequencys. With this flexible
implementation those scenarios are possible. A host-wide setting for the
scaling will make the feature useless in those (common) scenarios.
   


It's also possible to scale the TSCs of the cluster to be matching 
outside of the framework of KVM.  In that case, the VCPU client (qemu) 
simply needs to be smart enough to not request the TSC rate be scaled.  
That approach is completely compatible with this implementation.


If you do indeed want to have mixed speed VMs running on a single host, 
that can also be done with the approach here.


Combining the two - supporting a standard cluster rate via host scaling, 
plus a variable rate for martian VMs (those not conforming to the 
standard cluster rate) would require some more work, as the multiplier 
written back on exit from a martian would not be 1.0, rather something 
else.  Everything else should work as long as tsc_khz still expresses 
the natural rate of the TSC, even when scaled to a standard cluster 
rate.  In that case, you can also pursue Avi's suggestion of skipping 
the MSR loads for VMs where the rate matches the host rate.


Adding an export to the kernel indicating the currently applied scaling 
rate may not be a bad idea if you want to support such an implementation 
in the future.


I did have one slight concern about scaling in general.  What happens 
when the CPU khz rate is not uniformly detected across machines or 
clusters?  In general, it does vary a bit, I see differences out to the 
5th digit of precision on the same machine.  This is close enough to be 
within the range of NTP correction (500 ppm), but also small enough to 
represent real clock differences (and of course, there is some 
measurement error).


If you are within the threshold where NTP can correct the time, you may 
not want to apply a multiplier to the TSC at all.  Again, this decision 
can be made in the userspace component, but it's an important 
consideration to bring up for the qemu patches that will be required to 
support this.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/6] KVM: X86: Delegate tsc-offset calculation to architecture code

2011-02-11 Thread Zachary Amsden


On 02/09/2011 12:29 PM, Joerg Roedel wrote:

With TSC scaling in SVM the tsc-offset needs to be
calculated differently. This patch propagates this
calculation into the architecture specific modules so that
this complexity can be handled there.

Signed-off-by: Joerg Roedel
---
  arch/x86/include/asm/kvm_host.h |1 +
  arch/x86/kvm/svm.c  |   11 ++-
  arch/x86/kvm/vmx.c  |6 ++
  arch/x86/kvm/x86.c  |   10 +-
  4 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9686950..8c40425 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -593,6 +593,7 @@ struct kvm_x86_ops {
void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);

bool (*use_virtual_tsc_khz)(struct kvm_vcpu *vcpu);
+   u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
   


So I've gone over this series and the only issue I see so far is with 
this patch, and it doesn't have to do with upstream code, rather with 
code I was about to send.


Logically, the compensation done by adjust_tsc_offset should also be 
scaled; currently, this happens only for reasons, both of which are 
meant to deal with unstable TSCs; since TSC scaling won't happen on 
those processors with unstable TSCs, we don't need to worry about it there.


I have an upcoming patch which does complicate things a bit, which deals 
with host suspend.  In that case, the host TSC goes backwards and the 
offsets needs to be recomputed.  However, there is no convenient time to 
set the offset again; on VMX, the hardware will not yet be set up and so 
can't easily write the offset field in the VMCS.  We also can't put a 
synchronization barrier on all the VCPUs to write the offset before they 
start running without getting into a difficult situation with locking.


So instead, the approach I took was to re-use the adjust_tsc_offset 
function and accumulate offsets to apply.


For SVM with TSC scaling, the offset to apply as an adjustment in this 
case needs to be scaled.  Setting guest TSC (gtsc) equal to the new 
guest TSC (gstc'), we have:


gtsc = htsc * mult + offset
gstc' = htsc' * mult + offset'
gtsc' = gtsc
offset' = htsc * mult + offset - htsc' * mult
offset' = (htsc - htsc') * mult + offset

so, delta offset needs to = (htsc - htsc') * mult

We will instead be passing (htsc - htsc') as the adjustment value; the 
solution seems simple, we have to scale it up as well in the 
adjust_tsc_offset function.


However, the problem is that we need a new architecture specific 
function or API change because not all call sites for adjust_tsc want to 
have adjustments scaled - the call site dealing with tsc_catchup is 
actually working in guest cycles already, so should not be scaled again.


We could have separate functions to adjust TSC cycles by either guest or 
host cycle amounts, or pass a flag to adjust_tsc_offset indicating 
whether the adjustment is to be applied in guest cycles or host cycles.


The resulting API will be slightly asymmetric, as compute_tsc_offset 
lets the generic code compute in terms of hardware offsets, but in the 
adjustment case, there isn't an easy way to expose the ability to 
compute in hardware offset terms.


One slight pity is that we won't be able to resue 
svm_compute_tsc_offset, as the applied delta won't be based off a read 
of the tsc.  I can't really find a better API though, in case offsets 
are computed differently on different hardware (such as multiplying 
after the offset), then we need a function to convert guest cycles back 
to hardware cycles.


As usual, with the TSC code, it is going to require a lot of commenting 
to explain this.


Your code in general looks good.

Cheers,

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Convert tsc_write_lock to raw_spinlock

2011-02-07 Thread Zachary Amsden


On 02/07/2011 10:00 AM, Jan Kiszka wrote:

On 2011-02-07 15:11, Zachary Amsden wrote:
   

On 02/07/2011 06:35 AM, Jan Kiszka wrote:
 

On 2011-02-04 22:03, Zachary Amsden wrote:

   

On 02/04/2011 04:49 AM, Jan Kiszka wrote:

 

Code under this lock requires non-preemptibility. Ensure this also over
-rt by converting it to raw spinlock.


   

Oh dear, I had forgotten about that.  I believe kvm_lock might have the
same assumption in a few places regarding clock.

 

I only found a problematic section in kvmclock_cpufreq_notifier. Didn't
see this during my tests as I have CPUFREQ disabled in my .config.

We may need something like this as converting kvm_lock would likely be
overkill:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 36f54fb..971ee0d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4530,7 +4530,7 @@ static int kvmclock_cpufreq_notifier(struct 
notifier_block *nb, unsigned long va
struct cpufreq_freqs *freq = data;
struct kvm *kvm;
struct kvm_vcpu *vcpu;
-   int i, send_ipi = 0;
+   int i, me, send_ipi = 0;

/*
 * We allow guests to temporarily run on slowing clocks,
@@ -4583,9 +4583,11 @@ static int kvmclock_cpufreq_notifier(struct 
notifier_block *nb, unsigned long va
kvm_for_each_vcpu(i, vcpu, kvm) {
if (vcpu->cpu != freq->cpu)
continue;
+   me = get_cpu();
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-   if (vcpu->cpu != smp_processor_id())
+   if (vcpu->cpu != me)
send_ipi = 1;
+   put_cpu();
}
}
spin_unlock(&kvm_lock);

Jan


   

That looks like a good solution, and I do believe that is the only place
the lock is used in that fashion - please add a comment though in the
giant comment block above that preemption protection is needed for RT.
Also, gcc should catch this, but moving the me variable into the
kvm_for_each_vcpu loop should allow for better register allocation.

The only other thing I can think of is that RT lock preemption may break
some of the CPU initialization semantics enforced by kvm_lock if you
happen to get a hotplug event just as the module is loading.  That
should be rare, but if it is indeed a bug, it would be nice to fix, it
would be a panic for sure not to initialize VMX.
 

Hmm, is a cpu hotplug notifier allowed to run sleepy code? Can't
imagine. So we already have a strong reason to convert kvm_lock to a
raw_spinlock which obsoletes the above workaround.
   


I don't know as it is allowed to sleep, it doesn't call any sleeping 
functions to my knowledge.  What worries me in the RT case is that the 
spinlock acquired for hardware_enable might be preempted and run on 
another CPU, which obviously isn't what you want.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Convert tsc_write_lock to raw_spinlock

2011-02-07 Thread Zachary Amsden


On 02/07/2011 06:35 AM, Jan Kiszka wrote:

On 2011-02-04 22:03, Zachary Amsden wrote:
   

On 02/04/2011 04:49 AM, Jan Kiszka wrote:
 

Code under this lock requires non-preemptibility. Ensure this also over
-rt by converting it to raw spinlock.

   

Oh dear, I had forgotten about that.  I believe kvm_lock might have the
same assumption in a few places regarding clock.
 

I only found a problematic section in kvmclock_cpufreq_notifier. Didn't
see this during my tests as I have CPUFREQ disabled in my .config.

We may need something like this as converting kvm_lock would likely be
overkill:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 36f54fb..971ee0d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4530,7 +4530,7 @@ static int kvmclock_cpufreq_notifier(struct 
notifier_block *nb, unsigned long va
struct cpufreq_freqs *freq = data;
struct kvm *kvm;
struct kvm_vcpu *vcpu;
-   int i, send_ipi = 0;
+   int i, me, send_ipi = 0;

/*
 * We allow guests to temporarily run on slowing clocks,
@@ -4583,9 +4583,11 @@ static int kvmclock_cpufreq_notifier(struct 
notifier_block *nb, unsigned long va
kvm_for_each_vcpu(i, vcpu, kvm) {
if (vcpu->cpu != freq->cpu)
continue;
+   me = get_cpu();
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-   if (vcpu->cpu != smp_processor_id())
+   if (vcpu->cpu != me)
send_ipi = 1;
+   put_cpu();
}
}
spin_unlock(&kvm_lock);

Jan

   


That looks like a good solution, and I do believe that is the only place 
the lock is used in that fashion - please add a comment though in the 
giant comment block above that preemption protection is needed for RT.  
Also, gcc should catch this, but moving the me variable into the 
kvm_for_each_vcpu loop should allow for better register allocation.


The only other thing I can think of is that RT lock preemption may break 
some of the CPU initialization semantics enforced by kvm_lock if you 
happen to get a hotplug event just as the module is loading.  That 
should be rare, but if it is indeed a bug, it would be nice to fix, it 
would be a panic for sure not to initialize VMX.


Cheers,

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Convert tsc_write_lock to raw_spinlock

2011-02-04 Thread Zachary Amsden


On 02/04/2011 04:49 AM, Jan Kiszka wrote:

Code under this lock requires non-preemptibility. Ensure this also over
-rt by converting it to raw spinlock.
   


Oh dear, I had forgotten about that.  I believe kvm_lock might have the 
same assumption in a few places regarding clock.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM TSC trapping / migration 2/2] Add TSC KHZ MSR

2011-01-18 Thread Zachary Amsden


On 01/14/2011 06:00 AM, Juan Quintela wrote:

Marcelo Tosatti  wrote:
   

On Fri, Jan 07, 2011 at 10:44:20AM -1000, Zachary Amsden wrote:
 

On 01/07/2011 12:48 AM, Marcelo Tosatti wrote:
   

On Thu, Jan 06, 2011 at 12:10:45AM -1000, Zachary Amsden wrote:
 

Use an MSR to allow "soft" migration to hosts which do not support
TSC trapping.  Rather than make this a required element of any
migration protocol, we allow the TSC rate to be exported as a data
field (useful in its own right), but we also allow a one time write
of the MSR during VM creation.  The result is that for the common
use case, no protocol change is required to communicate TSC rate
to the receiving host.
   

Migration to hosts which do not support the feature can be achieved by
saving/restoring the TSC rate + flags in a subsection. A subsection
seems more appropriate than an MSR for this.
 

Yes, I looked at that, but it looked to me like a subsection was
intended for an optional feature which MUST be present on the
destination if the source is using the feature.  This way, newer
hosts without the feature enabled can migrate to older hosts which
do not support the feature.
   

Right. But you can use a subsection to achieve the same effect. Just
consider that the source is not using the feature if you want to migrate
to an older host without support for it. Juan, is there a problem to
use subsections in this fashion?

With the MSR scheme, there is no way for management to enforce support
of the feature on the destination (at least not that i can see). And
you create an MSR that does not exist on real hardware.

 

The TSC rate migration is slightly different; we may wish to migrate
from a host with the TSC rate feature enabled to a host which does
not support the TSC rate feature.  This is exactly the current
behavior, the TSC rate will change on that migration, and I wanted
to preserve that behavior.  I don't advise that mode of usage, but
there may be use cases for it and it should be decided by policy,
not dictated by our feature set.

That said, I'm happy to remove the MSR if we truly don't want to
support that mode of usage.
   

Ok, I chime it late.

We are adding a new MSR to the comunication with userspace.  So far so
good, but this new field, need to be transmited to the "other end" of
the migration.  This means a new field for migration (notice that this
is independtly if this is an MSR or not).

 VMSTATE_UINT64_V(system_time_msr, CPUState, 11),
 VMSTATE_UINT64_V(wall_clock_msr, CPUState, 11),
   


Oh, wow.  I thought the MSRs were sent automatically by qemu based on 
what MSRs the kvm module told it were available.  It looks like EFER and 
STAR and friends are all special cased as part of CPUstate.


So my approach has been doomed from the beginning.


This are the values that we are sending now.
We are getting now, a new value, the problem is how to migrate it.

Solutions:
- create a new field in a new field in CPUState, and up the version.
   That would make backward migration impossible.
- create a new field, and sent it only it is has been used with a
   subsection.  This makes migration backwards if this was not used.

- but, it appears that there if this features is not "known" on
   destination, we can use the old way to migrate information.

   BIG WARNING HERE: I don't claim to understand how clocks work at all

   There is not a way to convince old qemu/kernels to ignore new fields
   for good reason.  So the only solution here is to encode this new
   vcpu->kvm->arch.virtual_tsc_khz in the two previous fields, in a way
   that is understable for old qemu/new qemu.  old qemu will use old
   method, new qemu will use a new method.

   If there is not a common encoding that will work for both old/new
   method, I can't really think of a way to make things work here :(

And as per the warning, I can't think of a way to encode
"virtual_tsc_khz" into "system_time_msr" and "wall_clock_msr" off-hand.

To make things clearer about optional features, after "lots" of
discussions, it was decided that target of migration will never ignore
anything sent, that means that the only one that can decide not to sent
a feature/value is the "source" of the migration.  There is no way to
express:

- try this method, if you don't know
- try this other second best, ...
   


That decides it then, the feature is migrated in the state it is set if 
it is enabled.


This makes things much simpler all around.

Cheers,

Zach

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM TSC trapping / migration 1/2] Add TSC trapping for SVM and VMX

2011-01-09 Thread Zachary Amsden


On 01/07/2011 01:23 AM, Marcelo Tosatti wrote:

On Thu, Jan 06, 2011 at 12:10:44AM -1000, Zachary Amsden wrote:
   

Reasons to trap the TSC are numerous, but we want to avoid it as much
as possible for performance reasons.

We provide two conservative modes via modules parameters and userspace
hinting.  First, the module can be loaded with "tsc_auto=1" as a module
parameter, which turns on conservative TSC trapping only when it is
required (when unstable TSC or faster KHZ CPU is detected).

For userspace hinting, we enable trapping only if necessary.  Userspace
can hint that a VM needs a fixed frequency TSC, and also that SMP
stability will be required.  In that case, we conservatively turn on
trapping when it is needed.  In addition, users may now specify the
desired TSC rate at which to run.  If this rate differs significantly
from the host rate, trapping will be enabled.

There is also an override control to allow TSC trapping to be turned on
or off unconditionally for testing.

We indicate to pvclock users that the TSC is being trapped, to allow
avoiding overhead and directly using RDTSCP (only for SVM).  This
optimization is not yet implemented.

Signed-off-by: Zachary Amsden
---
  arch/x86/include/asm/kvm_host.h|6 +-
  arch/x86/include/asm/pvclock-abi.h |1 +
  arch/x86/kvm/svm.c |   20 ++
  arch/x86/kvm/vmx.c |   21 +++
  arch/x86/kvm/x86.c |  113 +---
  arch/x86/kvm/x86.h |2 +
  include/linux/kvm.h|   15 +
  7 files changed, 168 insertions(+), 10 deletions(-)
 

- Docs / test case please.
   


Yes, will do.


- KVM_TSC_CONTROL ioctl ignores flags field.
   


Oops.


- What is the purpose of PVCLOCK_TSC_TRAPPED_BIT?
   


To allow RDTSCP optimizations for KVM clock when TSC is trapped (because 
a userspace application requires strict TSC).



- Fail to see purpose of module parameters. Configuration from qemu
   should be enough?
   


For users with older versions of qemu who wish to take advantage of the 
feature, or performance / bug testing.  And oops here, the tsc_trap 
should not default to on.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM TSC trapping / migration 2/2] Add TSC KHZ MSR

2011-01-07 Thread Zachary Amsden


On 01/07/2011 12:48 AM, Marcelo Tosatti wrote:

On Thu, Jan 06, 2011 at 12:10:45AM -1000, Zachary Amsden wrote:
   

Use an MSR to allow "soft" migration to hosts which do not support
TSC trapping.  Rather than make this a required element of any
migration protocol, we allow the TSC rate to be exported as a data
field (useful in its own right), but we also allow a one time write
of the MSR during VM creation.  The result is that for the common
use case, no protocol change is required to communicate TSC rate
to the receiving host.
 

Migration to hosts which do not support the feature can be achieved by
saving/restoring the TSC rate + flags in a subsection. A subsection
seems more appropriate than an MSR for this.
   


Yes, I looked at that, but it looked to me like a subsection was 
intended for an optional feature which MUST be present on the 
destination if the source is using the feature.  This way, newer hosts 
without the feature enabled can migrate to older hosts which do not 
support the feature.


The TSC rate migration is slightly different; we may wish to migrate 
from a host with the TSC rate feature enabled to a host which does not 
support the TSC rate feature.  This is exactly the current behavior, the 
TSC rate will change on that migration, and I wanted to preserve that 
behavior.  I don't advise that mode of usage, but there may be use cases 
for it and it should be decided by policy, not dictated by our feature set.


That said, I'm happy to remove the MSR if we truly don't want to support 
that mode of usage.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM TSC trapping / migration 1/2] Add TSC trapping for SVM and VMX

2011-01-06 Thread Zachary Amsden


On 01/06/2011 12:38 PM, Alexander Graf wrote:




Sure, I'm not saying your patch is bad or goes in the wrong direction. I'd just 
think it'd be awesome to have an easy way for the guest OS to know that 
something as crucial as TSC reading speed got changed, hopefully even TSC 
frequency. Having any form of notification leaves open doors for someone to 
implement something (think proprietary OSs or out-of-service OSs here). Having 
no notification leaves us with no choice but taking the penalty and keeping the 
guest less informed than it has to be.
   


We do - register kvmclock and check to make sure the version before and 
after time computations to be sure the frequency hasn't changed.


This doesn't even require an interrupt.

   
 


   

Would it make sense to add a kvmclock interrupt to notify the guest of such a 
change?

   

kvmclock is immune to frequency changes, so it needs no interrupt, it just has 
a version controlled shared area, which is reset.

 


   

We indicate to pvclock users that the TSC is being trapped, to allow
avoiding overhead and directly using RDTSCP (only for SVM).  This
optimization is not yet implemented.

 


   

That doesn't sound to me like they're unaffected?

   

On Intel RDTSCP traps along with RDTSC.  This means that you can't have a 
trapping, constant rate TSC for userspace without also paying the overhead for 
reading the TSC for kvmclock.  This is not true on SVM, where RDTSCP is a 
separate trap, allowing optimization.
 

So how does the guest know that something changed when it's migrated from an 
AMD machine to an Intel machine?
   


That can and never should happen.  Simply too much state in the guest 
depends on CPU type, different workarounds are enabled for things, and 
even different instruction sets are activated.


There is no reward for the kind of complexity involved.

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM TSC trapping / migration 2/2] Add TSC KHZ MSR

2011-01-06 Thread Zachary Amsden


On 01/06/2011 01:40 AM, Alexander Graf wrote:

On 06.01.2011, at 12:27, Zachary Amsden wrote:

   

On 01/06/2011 12:34 AM, Alexander Graf wrote:
 

Am 06.01.2011 um 11:10 schrieb Zachary Amsden:


   

Use an MSR to allow "soft" migration to hosts which do not support
TSC trapping.  Rather than make this a required element of any
migration protocol, we allow the TSC rate to be exported as a data
field (useful in its own right), but we also allow a one time write
of the MSR during VM creation.  The result is that for the common
use case, no protocol change is required to communicate TSC rate
to the receiving host.

This allows administrative tools to configure migration policy
as they see appropriate.  Rather than dictate this policy with the
KVM implementation, we properly allow migration to hosts which both
do and do not support setting of the TSC rate on the receiving end.
If it is wished to not support migration to a host which lacks
support for the TSC rate feature, that can be coordinated externally.

 

Isn't there a real hw equivalent of such a register? It might make more sense 
to just implement that then.


   

Unfortunately, no.
 

Bleks. I couldn't find anything in AMD documentation either. Intel 
documentation is usually hard to find and incomplete anyways, so maybe 
something's hiding there - but if it's hidden so well it's no use to implement 
either.
   


While it makes perfect logical sense to us software people, from a 
hardware perspective it is ridiculous.  There is no signal input to the 
processor which can be counted and returned by a "CPU frequency MSR".  
The CPU frequency is simply what the CPU is clocked at.  And you can't 
burn it onto the die (think of overclocking), unless you have your own 
internal reference oscillator.


No, a CPU frequency MSR is impossible to implement sensibly in hardware 
because


1) if you have an internal clock, you have cross-CPU drift and can't 
easily share a common bus.
2) if you have an external clock, you have no way to measure the 
frequency of it without an internal reference, which again will vary, 
leading to different measured frequency on each CPU.


The only thing that sort of makes sense is a one-time writable MSR that 
is programmed by the BIOS to return the CPU frequency, as retrieved from 
the chipset.  This is merely a convenience for software people who can't 
be bothered to query the chipset directly or measure against a reference 
clock.


So as much as we'd like it, it really does make little sense from a 
hardware perspective.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM TSC trapping / migration 1/2] Add TSC trapping for SVM and VMX

2011-01-06 Thread Zachary Amsden


On 01/06/2011 01:38 AM, Alexander Graf wrote:

On 06.01.2011, at 12:30, Zachary Amsden wrote:

   

On 01/06/2011 12:41 AM, Alexander Graf wrote:
 

Am 06.01.2011 um 11:10 schrieb Zachary Amsden:


   

Reasons to trap the TSC are numerous, but we want to avoid it as much
as possible for performance reasons.

We provide two conservative modes via modules parameters and userspace
hinting.  First, the module can be loaded with "tsc_auto=1" as a module
parameter, which turns on conservative TSC trapping only when it is
required (when unstable TSC or faster KHZ CPU is detected).

For userspace hinting, we enable trapping only if necessary.  Userspace
can hint that a VM needs a fixed frequency TSC, and also that SMP
stability will be required.  In that case, we conservatively turn on
trapping when it is needed.  In addition, users may now specify the
desired TSC rate at which to run.  If this rate differs significantly
from the host rate, trapping will be enabled.

There is also an override control to allow TSC trapping to be turned on
or off unconditionally for testing.

We indicate to pvclock users that the TSC is being trapped, to allow
avoiding overhead and directly using RDTSCP (only for SVM).  This
optimization is not yet implemented.

 

When migrating, the implementation could switch from non-trapped to trapped, 
making it less attractive. The guest however does not get notified about this 
change. Same for the other way around.

   

That's a policy decision to be made by the userspace agent.  It's better than 
the current situation, where there is no control at all of TSC rate.  Here, 
we're flexible either way.

Also note, moving to a faster processor, trapping kicks in... but the processor 
is faster, so no actual loss is noticed, and the problem corrects when the VM 
is power cycled.
 

Hrm. But even then the guest should be notified to enable it to act accordingly 
and just recalibrate instead of reboot, no? I'm not saying this is particularly 
interesting for kvmclock enabled guests, but think of all the<  2.6.2x Linux, 
*BSD, Solaris, Windows etc. VMs out there that might have an easy means of 
triggering recalibration (or at least could introduce it), but writing a new clock 
source is a lot of work.
   


That's why I implemented trapping.  So they can migrate and we don't 
need to change the OS.



Of course, sending the notification through a userspace agent would also work. 
That one would have to be notified about the change too though.
   


It's far too complex and far too small of a use case to be worth the 
effort.  Windows doesn't particularly care, and most HALs can be 
switched into a mode where TSC is not used.


Linux actually does support CPU frequency recalibration, but it is 
triggered differently based on the particular form of CPU frequency 
switching supported by the platform / chipset.  Since that isn't 
universal, and we pass through many features of the hardware (CPUID and 
such), there is no reliable way I know of to emulate CPU frequency 
switching for the guest without kernel modifications.  The best bet 
there would be a kernel module providing a KVM cpufreq driver, which 
could be ported to the relevant non-clocksource kernels.


This amount of effort, however, begs the question - if you are going to 
all this trouble, why not port kvmclock support to those kernel?


Solaris 10 and later do have some better virtualization friendly clock 
support.  BSD - we'd probably have to trap.


Again, if the overhead is significant, blah.  Today you have no choice 
but to accept sloppy timekeeping.  You lose nothing with this patch, but 
do gain the flexibility to choose either correct TSC timekeeping or 
native speed TSC.  There are scenarios where both of those can be met 
(uniform speed deployment / virt friendly guest), there are scenarios 
where sloppy timekeeping is appropriate (KVM clock used), and there are 
scenarios where correct timekeeping is appropriate (BSD, earlier 
TSC-based linux, or user-space TSC required).


   

Would it make sense to add a kvmclock interrupt to notify the guest of such a 
change?
   

kvmclock is immune to frequency changes, so it needs no interrupt, it just has 
a version controlled shared area, which is reset.
 


   

We indicate to pvclock users that the TSC is being trapped, to allow
avoiding overhead and directly using RDTSCP (only for SVM).  This
optimization is not yet implemented.
 
   

That doesn't sound to me like they're unaffected?
   


On Intel RDTSCP traps along with RDTSC.  This means that you can't have 
a trapping, constant rate TSC for userspace without also paying the 
overhead for reading the TSC for kvmclock.  This is not true on SVM, 
where RDTSCP is a separate trap, allowing optimization.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a messa

Re: [KVM TSC trapping / migration 1/2] Add TSC trapping for SVM and VMX

2011-01-06 Thread Zachary Amsden


On 01/06/2011 01:32 AM, Avi Kivity wrote:

On 01/06/2011 12:10 PM, Zachary Amsden wrote:

Reasons to trap the TSC are numerous, but we want to avoid it as much
as possible for performance reasons.

We provide two conservative modes via modules parameters and userspace
hinting.  First, the module can be loaded with "tsc_auto=1" as a module
parameter, which turns on conservative TSC trapping only when it is
required (when unstable TSC or faster KHZ CPU is detected).

For userspace hinting, we enable trapping only if necessary.  Userspace
can hint that a VM needs a fixed frequency TSC, and also that SMP
stability will be required.  In that case, we conservatively turn on
trapping when it is needed.  In addition, users may now specify the
desired TSC rate at which to run.  If this rate differs significantly
from the host rate, trapping will be enabled.

There is also an override control to allow TSC trapping to be turned on
or off unconditionally for testing.

We indicate to pvclock users that the TSC is being trapped, to allow
avoiding overhead and directly using RDTSCP (only for SVM).  This
optimization is not yet implemented.

Signed-off-by: Zachary Amsden
---
  arch/x86/include/asm/kvm_host.h|6 +-
  arch/x86/include/asm/pvclock-abi.h |1 +
  arch/x86/kvm/svm.c |   20 ++
  arch/x86/kvm/vmx.c |   21 +++
  arch/x86/kvm/x86.c |  113 
+---

  arch/x86/kvm/x86.h |2 +
  include/linux/kvm.h|   15 +
  7 files changed, 168 insertions(+), 10 deletions(-)



Haven't reviewed yet, but Documentation/kvm/api.txt is missing here.



That will be included when I port to upstream head.  When dealing with 
software documentation, too much is never enough.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM TSC trapping / migration 1/2] Add TSC trapping for SVM and VMX

2011-01-06 Thread Zachary Amsden


On 01/06/2011 12:41 AM, Alexander Graf wrote:

Am 06.01.2011 um 11:10 schrieb Zachary Amsden:

   

Reasons to trap the TSC are numerous, but we want to avoid it as much
as possible for performance reasons.

We provide two conservative modes via modules parameters and userspace
hinting.  First, the module can be loaded with "tsc_auto=1" as a module
parameter, which turns on conservative TSC trapping only when it is
required (when unstable TSC or faster KHZ CPU is detected).

For userspace hinting, we enable trapping only if necessary.  Userspace
can hint that a VM needs a fixed frequency TSC, and also that SMP
stability will be required.  In that case, we conservatively turn on
trapping when it is needed.  In addition, users may now specify the
desired TSC rate at which to run.  If this rate differs significantly
from the host rate, trapping will be enabled.

There is also an override control to allow TSC trapping to be turned on
or off unconditionally for testing.

We indicate to pvclock users that the TSC is being trapped, to allow
avoiding overhead and directly using RDTSCP (only for SVM).  This
optimization is not yet implemented.
 

When migrating, the implementation could switch from non-trapped to trapped, 
making it less attractive. The guest however does not get notified about this 
change. Same for the other way around.
   


That's a policy decision to be made by the userspace agent.  It's better 
than the current situation, where there is no control at all of TSC 
rate.  Here, we're flexible either way.


Also note, moving to a faster processor, trapping kicks in... but the 
processor is faster, so no actual loss is noticed, and the problem 
corrects when the VM is power cycled.



Would it make sense to add a kvmclock interrupt to notify the guest of such a 
change?


kvmclock is immune to frequency changes, so it needs no interrupt, it 
just has a version controlled shared area, which is reset.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM TSC trapping / migration 2/2] Add TSC KHZ MSR

2011-01-06 Thread Zachary Amsden


On 01/06/2011 12:34 AM, Alexander Graf wrote:

Am 06.01.2011 um 11:10 schrieb Zachary Amsden:

   

Use an MSR to allow "soft" migration to hosts which do not support
TSC trapping.  Rather than make this a required element of any
migration protocol, we allow the TSC rate to be exported as a data
field (useful in its own right), but we also allow a one time write
of the MSR during VM creation.  The result is that for the common
use case, no protocol change is required to communicate TSC rate
to the receiving host.

This allows administrative tools to configure migration policy
as they see appropriate.  Rather than dictate this policy with the
KVM implementation, we properly allow migration to hosts which both
do and do not support setting of the TSC rate on the receiving end.
If it is wished to not support migration to a host which lacks
support for the TSC rate feature, that can be coordinated externally.
 

Isn't there a real hw equivalent of such a register? It might make more sense 
to just implement that then.

   


Unfortunately, no.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[KVM TSC trapping / migration 1/2] Add TSC trapping for SVM and VMX

2011-01-06 Thread Zachary Amsden

Reasons to trap the TSC are numerous, but we want to avoid it as much
as possible for performance reasons.

We provide two conservative modes via modules parameters and userspace
hinting.  First, the module can be loaded with "tsc_auto=1" as a module
parameter, which turns on conservative TSC trapping only when it is
required (when unstable TSC or faster KHZ CPU is detected).

For userspace hinting, we enable trapping only if necessary.  Userspace
can hint that a VM needs a fixed frequency TSC, and also that SMP
stability will be required.  In that case, we conservatively turn on
trapping when it is needed.  In addition, users may now specify the
desired TSC rate at which to run.  If this rate differs significantly
from the host rate, trapping will be enabled.

There is also an override control to allow TSC trapping to be turned on
or off unconditionally for testing.

We indicate to pvclock users that the TSC is being trapped, to allow
avoiding overhead and directly using RDTSCP (only for SVM).  This
optimization is not yet implemented.

Signed-off-by: Zachary Amsden 
---
 arch/x86/include/asm/kvm_host.h|6 +-
 arch/x86/include/asm/pvclock-abi.h |1 +
 arch/x86/kvm/svm.c |   20 ++
 arch/x86/kvm/vmx.c |   21 +++
 arch/x86/kvm/x86.c |  113 +---
 arch/x86/kvm/x86.h |2 +
 include/linux/kvm.h|   15 +
 7 files changed, 168 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ff651b7..6cce67a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -452,6 +452,8 @@ struct kvm_arch {
u32 virtual_tsc_khz;
u32 virtual_tsc_mult;
s8 virtual_tsc_shift;
+   bool tsc_trapping;
+   u32 tsc_flags;
 
struct kvm_xen_hvm_config xen_hvm_config;
 
@@ -575,6 +577,8 @@ struct kvm_x86_ops {
int (*get_lpage_level)(void);
bool (*rdtscp_supported)(void);
void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment);
+   void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
+   void (*set_tsc_trapping)(struct kvm_vcpu *vcpu, bool trap);
 
void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3);
 
@@ -582,8 +586,6 @@ struct kvm_x86_ops {
 
bool (*has_wbinvd_exit)(void);
 
-   void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
-
const struct trace_print_flags *exit_reasons_str;
 };
 
diff --git a/arch/x86/include/asm/pvclock-abi.h 
b/arch/x86/include/asm/pvclock-abi.h
index 35f2d19..315ead5 100644
--- a/arch/x86/include/asm/pvclock-abi.h
+++ b/arch/x86/include/asm/pvclock-abi.h
@@ -40,5 +40,6 @@ struct pvclock_wall_clock {
 } __attribute__((__packed__));
 
 #define PVCLOCK_TSC_STABLE_BIT (1 << 0)
+#define PVCLOCK_TSC_TRAPPED_BIT (1 << 1)
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_X86_PVCLOCK_ABI_H */
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index c929d00..af48be9 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -806,6 +806,8 @@ static void init_vmcb(struct vcpu_svm *svm)
(1ULL << INTERCEPT_MONITOR) |
(1ULL << INTERCEPT_MWAIT);
 
+   kvm_setup_tsc_trapping(&svm->vcpu);
+
control->iopm_base_pa = iopm_base;
control->msrpm_base_pa = __pa(svm->msrpm);
control->int_ctl = V_INTR_MASKING_MASK;
@@ -1038,6 +1040,15 @@ static void svm_clear_vintr(struct vcpu_svm *svm)
svm->vmcb->control.intercept &= ~(1ULL << INTERCEPT_VINTR);
 }
 
+static void svm_set_tsc_trapping(struct kvm_vcpu *vcpu, bool trap)
+{
+   struct vcpu_svm *svm = to_svm(vcpu);
+   if (trap)
+   svm->vmcb->control.intercept |= 1ULL << INTERCEPT_RDTSC;
+   else
+   svm->vmcb->control.intercept &= ~(1ULL << INTERCEPT_RDTSC);
+}
+
 static struct vmcb_seg *svm_seg(struct kvm_vcpu *vcpu, int seg)
 {
struct vmcb_save_area *save = &to_svm(vcpu)->vmcb->save;
@@ -2497,6 +2508,13 @@ static int task_switch_interception(struct vcpu_svm *svm)
return 1;
 }
 
+static int rdtsc_interception(struct vcpu_svm *svm)
+{
+   svm->next_rip = kvm_rip_read(&svm->vcpu) + 2;
+   kvm_read_tsc(&svm->vcpu);
+   return 1;
+}
+
 static int cpuid_interception(struct vcpu_svm *svm)
 {
svm->next_rip = kvm_rip_read(&svm->vcpu) + 2;
@@ -2833,6 +2851,7 @@ static int (*svm_exit_handlers[])(struct vcpu_svm *svm) = 
{
[SVM_EXIT_SMI]  = nop_on_interception,
[SVM_EXIT_INIT] = nop_on_interception,
[SVM_EXIT_VINTR]= interrupt_window_interception,
+   [SVM_EXIT_RDTSC]= rdtsc_interception,
[SVM_EXIT_CPUID]

[KVM TSC trapping / migration 2/2] Add TSC KHZ MSR

2011-01-06 Thread Zachary Amsden

Use an MSR to allow "soft" migration to hosts which do not support
TSC trapping.  Rather than make this a required element of any
migration protocol, we allow the TSC rate to be exported as a data
field (useful in its own right), but we also allow a one time write
of the MSR during VM creation.  The result is that for the common
use case, no protocol change is required to communicate TSC rate
to the receiving host.

This allows administrative tools to configure migration policy
as they see appropriate.  Rather than dictate this policy with the
KVM implementation, we properly allow migration to hosts which both
do and do not support setting of the TSC rate on the receiving end.
If it is wished to not support migration to a host which lacks
support for the TSC rate feature, that can be coordinated externally.

Signed-off-by: Zachary Amsden 
---
 arch/x86/include/asm/kvm_para.h |1 +
 arch/x86/kvm/x86.c  |   28 ++--
 include/linux/kvm.h |1 +
 3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 7b562b6..723e60d 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -32,6 +32,7 @@
 /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
 #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
 #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
+#define MSR_KVM_TSC_KHZ0x4b564d02
 
 #define KVM_MAX_MMU_OP_BATCH   32
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bbcd582..ff5addc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -786,10 +786,11 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN7
+#define KVM_SAVE_MSRS_BEGIN8
 static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
+   MSR_KVM_TSC_KHZ,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
HV_X64_MSR_APIC_ASSIST_PAGE,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
@@ -1608,6 +1609,25 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
}
break;
}
+   case MSR_KVM_TSC_KHZ: {
+   struct kvm_arch *arch = &vcpu->kvm->arch;
+   /*
+* If userspace indicates a fixed rate TSC, accept attempts
+* to set the TSC rate to support incoming migrations.
+* This is a write once MSR, to prevent guest tampering.
+*/
+   if (arch->tsc_flags & KVM_TSC_FLAG_KHZ_MSR_WRITABLE) {
+   if (data > KVM_TSC_MAX_KHZ || data < KVM_TSC_MIN_KHZ)
+   return 1;
+
+   spin_lock(&arch->clock_lock);
+   kvm_arch_set_tsc_khz(vcpu->kvm, data);
+   kvm_setup_tsc_trapping(vcpu);
+   arch->tsc_flags &= ~KVM_TSC_FLAG_KHZ_MSR_WRITABLE;
+   spin_unlock(&arch->clock_lock);
+   }
+   break;
+   }
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1884,6 +1904,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 *pdata)
case MSR_KVM_SYSTEM_TIME_NEW:
data = vcpu->arch.time;
break;
+   case MSR_KVM_TSC_KHZ:
+   data = vcpu->kvm->arch.virtual_tsc_khz;
+   break;
case MSR_IA32_P5_MC_ADDR:
case MSR_IA32_P5_MC_TYPE:
case MSR_IA32_MCG_CAP:
@@ -3615,7 +3638,8 @@ long kvm_arch_vm_ioctl(struct file *filp,
r = -EINVAL;
if (user_tsc.flags &
~(KVM_TSC_FLAG_FIXED_RATE |
- KVM_TSC_FLAG_SMP_COHERENCY))
+ KVM_TSC_FLAG_SMP_COHERENCY |
+ KVM_TSC_FLAG_KHZ_MSR_WRITABLE))
goto out;
 
if (user_tsc.tsc_khz &&
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index cb97e53..0e316d8 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -628,6 +628,7 @@ struct kvm_tsc_control {
 
 #define KVM_TSC_FLAG_FIXED_RATE(1 << 0)
 #define KVM_TSC_FLAG_SMP_COHERENCY (1 << 1)
+#define KVM_TSC_FLAG_KHZ_MSR_WRITABLE  (1 << 2)
 
 #define KVM_TSC_MIN_KHZ 16000  /* 16 MHz, slower than first Pentium */
 #define KVM_TSC_MAX_KHZ 1  /* 100 GHz, good for a few years */
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

KVM TSC trapping

2011-01-06 Thread Zachary Amsden

On top of my last patchset, I now implement TSC trapping and
a flexible migration scheme for maintaining stable TSC across
migration.  Since it is administratively configured, it can
be selectively enabled only for VMs which require it.  In
particular, VMs which use KVM clock probably do not want it.

We will need some administrative controls in qemu and in libvirt
to make full use of this, but the design allows flexible and
fairly simple control.  In addition, users can override these
settings with module parameters to forcibly disable or enable
TSC trapping globally, for testing, workarounds, or performance.

This may be slightly out of date, but I would like to solicit
feedback about the patches, #2 especially.

--Zach

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM Clock Synchronization 2/4] Keep TSC synchronized across host suspend

2011-01-04 Thread Zachary Amsden


On 01/04/2011 05:36 AM, Marcelo Tosatti wrote:

On Tue, Dec 28, 2010 at 07:38:18PM -1000, Zachary Amsden wrote:
   

During a host suspend, TSC may go backwards, which KVM interprets
as an unstable TSC.  Technically, KVM should not be marking the
TSC unstable, which causes the TSC clocksource to go bad, but
should be adjusting the TSC offsets in such a case.

Signed-off-by: Zachary Amsden
---
  arch/x86/include/asm/kvm_host.h |1 +
  arch/x86/kvm/x86.c  |   66 +++---
  2 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9e6fe39..63a82b0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -386,6 +386,7 @@ struct kvm_vcpu_arch {
u64 last_kernel_ns;
u64 last_tsc_nsec;
u64 last_tsc_write;
+   u64 tsc_offset_adjustment;
bool tsc_catchup;

bool nmi_pending;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b9118f4..b509c01 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2042,12 +2042,20 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
}

kvm_x86_ops->vcpu_load(vcpu, cpu);
+
+   /* Apply any externally detected TSC adjustments (due to suspend) */
+   if (unlikely(vcpu->arch.tsc_offset_adjustment)) {
+   kvm_x86_ops->adjust_tsc_offset(vcpu,
+   vcpu->arch.tsc_offset_adjustment);
+   vcpu->arch.tsc_offset_adjustment = 0;
+   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+   }
if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
/* Make sure TSC doesn't go backwards */
s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
native_read_tsc() - vcpu->arch.last_host_tsc;
if (tsc_delta<  0)
-   mark_tsc_unstable("KVM discovered backwards TSC");
+   WARN_ON(!check_tsc_unstable());
if (check_tsc_unstable()) {
kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
vcpu->arch.tsc_catchup = 1;
@@ -5778,14 +5786,60 @@ int kvm_arch_hardware_enable(void *garbage)
  {
struct kvm *kvm;
struct kvm_vcpu *vcpu;
-   int i;
+   int i, ret;
+   u64 local_tsc;
+   u64 max_tsc = 0;
+   bool stable, backwards_tsc = false;

kvm_shared_msr_cpu_online();
-   list_for_each_entry(kvm,&vm_list, vm_list)
-   kvm_for_each_vcpu(i, vcpu, kvm)
-   if (vcpu->cpu == smp_processor_id())
+   local_tsc = native_read_tsc();
+   stable = !check_tsc_unstable();
+   ret = kvm_x86_ops->hardware_enable(garbage);
+   if (ret)
+   return ret;
+
+   list_for_each_entry(kvm,&vm_list, vm_list) {
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   if (!stable&&  vcpu->cpu == smp_processor_id())
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-   return kvm_x86_ops->hardware_enable(garbage);
+   if (stable&&  vcpu->arch.last_host_tsc>  local_tsc) {
+   backwards_tsc = true;
+   if (vcpu->arch.last_host_tsc>  max_tsc)
+   max_tsc = vcpu->arch.last_host_tsc;
+   }
+   }
+   }
+
+   /*
+* Sometimes, reliable TSCs go backwards.  This happens
+* on platforms that reset TSC during suspend or hibernate
+* actions, but maintain synchronization.  We must compensate.
+* Unfortunately, we can't bring the TSCs fully up to date
+* with real time.  The reason is that we aren't yet far
+* enough into CPU bringup that we know how much real time
+* has actually elapsed; our helper function, get_kernel_ns()
+* will be using boot variables that haven't been updated yet.
+* We simply find the maximum observed TSC above, then record
+* the adjustment to TSC in each VCPU.  When the VCPU later
+* gets loaded, the adjustment will be applied.  Note that we
+* accumulate adjustments, in case multiple suspend cycles
+* happen before the VCPU gets a chance to run again.
+*
+* Note that unreliable TSCs will be compensated already by
+* the logic in vcpu_load, which sets the TSC to catchup mode.
+* This will catchup all VCPUs to real time, but cannot
+* guarantee synchronization.
+*/
+   if (backwards_tsc) {
+   u64 delta_cyc = max_tsc - local_tsc;
+   list_for_each_entry(kvm,&vm_list, vm_list)
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+

Re: [KVM Clock Synchronization 4/4] Add master clock for KVM clock

2011-01-04 Thread Zachary Amsden


On 01/04/2011 08:20 AM, Marcelo Tosatti wrote:

On Tue, Dec 28, 2010 at 07:38:20PM -1000, Zachary Amsden wrote:
   

On systems with synchronized TSCs, we still have VCPU individual
KVM clocks, each with their own computed offset.  As this all happens
at different times, the computed KVM clock offset can vary, causing a
globally visible backwards clock.  Currently this is protected against
by using an atomic compare to ensure it does not happen.

This change should remove that requirement.

Signed-off-by: Zachary Amsden
---
  arch/x86/include/asm/kvm_host.h |1 +
  arch/x86/kvm/x86.c  |   42 ++-
  2 files changed, 42 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8d829b8..ff651b7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -445,6 +445,7 @@ struct kvm_arch {
unsigned long irq_sources_bitmap;
s64 kvmclock_offset;
spinlock_t clock_lock;
+   struct pvclock_vcpu_time_info master_clock;
u64 last_tsc_nsec;
u64 last_tsc_offset;
u64 last_tsc_write;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 59d5999..a339e50 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1116,6 +1116,38 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
return 0;

/*
+* If there is a stable TSC, we use a master reference clock for
+* the KVM clock; otherwise, individual computations for each VCPU
+* would exhibit slight drift relative to each other, which could
+* cause global time to go backwards.
+*
+* If the master clock has no TSC timestamp, that means we must
+* recompute the clock as either some real time has elapsed during
+* a suspend cycle, or we are measuring the clock for the first time
+* during VM creation (or following a migration).  Since master clock
+* changes should happen only at rare occasions, so we can ignore
+* the precautions below.
+*/
+   if (!check_tsc_unstable()) {
+   struct pvclock_vcpu_time_info *master =
+   &v->kvm->arch.master_clock;
+   if (vcpu->hv_clock.version != master->version) {
+   spin_lock(&v->kvm->arch.clock_lock);
+   WARN_ON(master->version<  vcpu->hv_clock.version);
+   if (!master->tsc_timestamp) {
+   pr_debug("KVM: computing new master clock\n");
+   update_pvclock(v, master, tsc_timestamp,
+   kernel_ns, tsc_khz);
+   }
+   memcpy(&vcpu->hv_clock, master, sizeof(*master));
+   spin_unlock(&v->kvm->arch.clock_lock);
+   update_user_kvmclock(v,&vcpu->hv_clock);
+   } else
+   pr_debug("ignoring spurious KVM clock update");
+   return 0;
+   }
 

This assumes guest TSC is synchronized across vcpus... Is this always
true?
   


By the kernel definition of stable TSC, yes.


Also, for stable TSC hosts, kvmclock update is performed only on VM
creation / host resume these days... Can you describe the problem in
more detail?
   


The problem is that even if it is done only once, all the VCPUs perform 
the kvmclock update at different times, so they measure different 
kernel_ns values and hardware TSC values.  There will be a spread of 
measurement there, which is only +/- a few hundred cycles, but since 
there is a difference, the global view of kvm clock across multiple 
VCPUs can go backwards.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[KVM Clock Synchronization 3/4] Refactor KVM clock update code

2010-12-28 Thread Zachary Amsden

Refactor this to make upcoming steps easier to follow.
This should be 100% code motion and renaming.

Signed-off-by: Zachary Amsden 
---
 arch/x86/include/asm/kvm_host.h |2 +-
 arch/x86/kvm/x86.c  |   77 +++
 2 files changed, 46 insertions(+), 33 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 63a82b0..8d829b8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -444,7 +444,7 @@ struct kvm_arch {
 
unsigned long irq_sources_bitmap;
s64 kvmclock_offset;
-   spinlock_t tsc_write_lock;
+   spinlock_t clock_lock;
u64 last_tsc_nsec;
u64 last_tsc_offset;
u64 last_tsc_write;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b509c01..59d5999 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -996,7 +996,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
s64 sdiff;
u64 delta;
 
-   spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
+   spin_lock_irqsave(&kvm->arch.clock_lock, flags);
offset = data - native_read_tsc();
ns = get_kernel_ns();
elapsed = ns - kvm->arch.last_tsc_nsec;
@@ -1034,7 +1034,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
kvm->arch.last_tsc_offset = offset;
}
kvm_x86_ops->write_tsc_offset(vcpu, offset);
-   spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
+   spin_unlock_irqrestore(&kvm->arch.clock_lock, flags);
 
/* Reset of TSC must disable overshoot protection below */
vcpu->arch.hv_clock.tsc_timestamp = 0;
@@ -1043,22 +1043,50 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 }
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
+static void update_pvclock(struct kvm_vcpu *v,
+  struct pvclock_vcpu_time_info *pvclock,
+  u64 tsc_timestamp,
+  s64 kernel_ns,
+  unsigned long tsc_khz)
+{
+   if (unlikely(v->arch.hw_tsc_khz != tsc_khz)) {
+   kvm_get_time_scale(NSEC_PER_SEC / 1000, tsc_khz,
+  &pvclock->tsc_shift,
+  &pvclock->tsc_to_system_mul);
+   v->arch.hw_tsc_khz = tsc_khz;
+   }
+   pvclock->tsc_timestamp = tsc_timestamp;
+   pvclock->system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
+   pvclock->flags = 0;
+}
+
+static void update_user_kvmclock(struct kvm_vcpu *v,
+struct pvclock_vcpu_time_info *pvclock)
+{
+   struct kvm_vcpu_arch *vcpu = &v->arch;
+   void *shared_kaddr;
+
+   shared_kaddr = kmap_atomic(vcpu->time_page, KM_USER0);
+   memcpy(shared_kaddr + vcpu->time_offset, pvclock, sizeof(*pvclock));
+   kunmap_atomic(shared_kaddr, KM_USER0);
+   mark_page_dirty(v->kvm, vcpu->time >> PAGE_SHIFT);
+}
+
 static int kvm_guest_time_update(struct kvm_vcpu *v)
 {
unsigned long flags;
struct kvm_vcpu_arch *vcpu = &v->arch;
-   void *shared_kaddr;
-   unsigned long this_tsc_khz;
-   s64 kernel_ns, max_kernel_ns;
+   unsigned long tsc_khz;
+   s64 kernel_ns;
u64 tsc_timestamp;
 
/* Keep irq disabled to prevent changes to the clock */
local_irq_save(flags);
kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
kernel_ns = get_kernel_ns();
-   this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
+   tsc_khz = __get_cpu_var(cpu_tsc_khz);
 
-   if (unlikely(this_tsc_khz == 0)) {
+   if (unlikely(tsc_khz == 0)) {
local_irq_restore(flags);
kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
return 1;
@@ -1108,32 +1136,23 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 * guest.  To protect against this, we must compute the system time as
 * observed by the guest and ensure the new system time is greater.
 */
-   max_kernel_ns = 0;
if (vcpu->hv_clock.tsc_timestamp && vcpu->last_guest_tsc) {
-   max_kernel_ns = vcpu->last_guest_tsc -
-   vcpu->hv_clock.tsc_timestamp;
+   s64 max_kernel_ns = vcpu->last_guest_tsc -
+   vcpu->hv_clock.tsc_timestamp;
max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
vcpu->hv_clock.tsc_to_system_mul,
vcpu->hv_clock.tsc_shift);
max_kernel_ns += vcpu->last_kernel_ns;
+   if (max_kernel_ns > kernel_ns)
+   kernel_ns = max_kernel_ns;
}
 
-   if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
-   kvm_get_time_scale(NSEC_PER_SEC / 1000, this_tsc_khz,
-

[KVM Clock Synchronization 4/4] Add master clock for KVM clock

2010-12-28 Thread Zachary Amsden

On systems with synchronized TSCs, we still have VCPU individual
KVM clocks, each with their own computed offset.  As this all happens
at different times, the computed KVM clock offset can vary, causing a
globally visible backwards clock.  Currently this is protected against
by using an atomic compare to ensure it does not happen.

This change should remove that requirement.

Signed-off-by: Zachary Amsden 
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/x86.c  |   42 ++-
 2 files changed, 42 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8d829b8..ff651b7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -445,6 +445,7 @@ struct kvm_arch {
unsigned long irq_sources_bitmap;
s64 kvmclock_offset;
spinlock_t clock_lock;
+   struct pvclock_vcpu_time_info master_clock;
u64 last_tsc_nsec;
u64 last_tsc_offset;
u64 last_tsc_write;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 59d5999..a339e50 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1116,6 +1116,38 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
return 0;
 
/*
+* If there is a stable TSC, we use a master reference clock for
+* the KVM clock; otherwise, individual computations for each VCPU
+* would exhibit slight drift relative to each other, which could
+* cause global time to go backwards.
+*
+* If the master clock has no TSC timestamp, that means we must
+* recompute the clock as either some real time has elapsed during
+* a suspend cycle, or we are measuring the clock for the first time
+* during VM creation (or following a migration).  Since master clock
+* changes should happen only at rare occasions, so we can ignore
+* the precautions below.
+*/
+   if (!check_tsc_unstable()) {
+   struct pvclock_vcpu_time_info *master =
+   &v->kvm->arch.master_clock;
+   if (vcpu->hv_clock.version != master->version) {
+   spin_lock(&v->kvm->arch.clock_lock);
+   WARN_ON(master->version < vcpu->hv_clock.version);
+   if (!master->tsc_timestamp) {
+   pr_debug("KVM: computing new master clock\n");
+   update_pvclock(v, master, tsc_timestamp,
+   kernel_ns, tsc_khz);
+   }
+   memcpy(&vcpu->hv_clock, master, sizeof(*master));
+   spin_unlock(&v->kvm->arch.clock_lock);
+   update_user_kvmclock(v, &vcpu->hv_clock);
+   } else
+   pr_debug("ignoring spurious KVM clock update");
+   return 0;
+   }
+
+   /*
 * Time as measured by the TSC may go backwards when resetting the base
 * tsc_timestamp.  The reason for this is that the TSC resolution is
 * higher than the resolution of the other clock scales.  Thus, many
@@ -3482,7 +3514,11 @@ long kvm_arch_vm_ioctl(struct file *filp,
r = 0;
now_ns = get_kernel_ns();
delta = user_ns.clock - now_ns;
+   spin_lock(&kvm->arch.clock_lock);
+   kvm->arch.master_clock.version += 2;
+   kvm->arch.master_clock.tsc_timestamp = 0;
kvm->arch.kvmclock_offset = delta;
+   spin_unlock(&kvm->arch.clock_lock);
break;
}
case KVM_GET_CLOCK: {
@@ -5845,11 +5881,14 @@ int kvm_arch_hardware_enable(void *garbage)
 */
if (backwards_tsc) {
u64 delta_cyc = max_tsc - local_tsc;
-   list_for_each_entry(kvm, &vm_list, vm_list)
+   list_for_each_entry(kvm, &vm_list, vm_list) {
kvm_for_each_vcpu(i, vcpu, kvm) {
vcpu->arch.tsc_offset_adjustment += delta_cyc;
vcpu->arch.last_host_tsc = 0;
}
+   kvm->arch.master_clock.tsc_timestamp = 0;
+   kvm->arch.master_clock.version += 2;
+   }
}
 
return 0;
@@ -5965,6 +6004,7 @@ struct  kvm *kvm_arch_create_vm(void)
set_bit(KVM_USERSPACE_IRQ_SOURCE_ID, &kvm->arch.irq_sources_bitmap);
 
spin_lock_init(&kvm->arch.clock_lock);
+   kvm->arch.master_clock.version = 1000;
 
return kvm;
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1203 matches

Mail list logo