[ 
https://issues.apache.org/jira/browse/SENTRY-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407239#comment-16407239
 ] 

Na Li commented on SENTRY-2184:
-------------------------------

The code to test the performance is
{code:java}
/**
* Verifies complete snapshot of HMS Paths can be persisted and retrieved 
properly.
*/
@Test
public void testRetrieveFullPathsImageUpdatePerformance() throws Exception {

// save full path snapshot
Map<String, Collection<String>> authzPaths = new HashMap<>();
String[] prefixes = {"/user/hive/warehouse"};
String dbName = "db1";
String tablePrefix = "table_";
int tableCount = 1000;
int pathCountPerTable = 33;

String tableDirectory = prefixes[0] + "/" + dbName + "/";
for (int tableIndex = 0; tableIndex < tableCount; tableIndex ++) {
String tableName = tablePrefix + tableIndex;
HashSet paths = new HashSet<String>(pathCountPerTable);
for (int pathIndex = 0; pathIndex < pathCountPerTable; pathIndex ++) {
paths.add( tableDirectory + tableName + "." + pathIndex);
}

authzPaths.put(dbName + "." + tableName, paths);
}

long notificationID = 11;
sentryStore.persistFullPathsImage(authzPaths, notificationID);

// measure retrive time
long before_noFix = System.nanoTime();
PathsUpdate pathsUpdate_noFix = 
sentryStore.retrieveFullPathsImageUpdateOld(prefixes);
long after_nofix = System.nanoTime();
long delta_noFix = after_nofix - before_noFix;
long delta_noFix_scaled = delta_noFix / tableCount;

// measure retrive time
long before_withFix = System.nanoTime();
PathsUpdate pathsUpdate_withfix = 
sentryStore.retrieveFullPathsImageUpdate(prefixes);
long after_withFix = System.nanoTime();
long delta_withFix = after_withFix - before_withFix;
long delta_withFix_scaled = delta_withFix / tableCount;

long diff = delta_noFix - delta_withFix;

double change = (double)diff * 100 / delta_noFix;
LOGGER.info("Total time for retrieveFullPathsImageUpdate is {} (no fix) versus 
{} (with fix) nanoseconds", delta_withFix_scaled, delta_noFix_scaled );
LOGGER.info("retrieveFullPathsImageUpdate normalized change is {} nanoseconds", 
delta_noFix_scaled - delta_withFix_scaled);
LOGGER.info("With fix, retrieveFullPathsImageUpdate spends {}% of time without 
fix", ((double)delta_withFix * 100) / delta_noFix);

}{code}
 

The output on my machine is
{code:java}
2018-03-20 18:28:00,609 (main) [INFO - 
org.apache.sentry.hdfs.HMSPaths.<init>(HMSPaths.java:687)] 
HMSPaths:[/user/hive/warehouse] Initialized
2018-03-20 18:28:02,312 (main) [INFO - 
org.apache.sentry.hdfs.HMSPathsDumper.createPathsDump(HMSPathsDumper.java:81)] 
Paths Dump created. 33005 total path strings, 0 duplicate strings found, 
compacted to 0 unique strings.
2018-03-20 18:28:02,549 (main) [INFO - 
org.apache.sentry.hdfs.HMSPaths.<init>(HMSPaths.java:687)] 
HMSPaths:[/user/hive/warehouse] Initialized
2018-03-20 18:28:03,279 (main) [INFO - 
org.apache.sentry.hdfs.HMSPathsDumper.createPathsDump(HMSPathsDumper.java:81)] 
Paths Dump created. 33005 total path strings, 0 duplicate strings found, 
compacted to 0 unique strings.
2018-03-20 18:28:03,379 (main) [INFO - 
org.apache.sentry.provider.db.service.persistent.TestSentryStore.testRetrieveFullPathsImageUpdatePerformance(TestSentryStore.java:3769)]
 Total time for retrieveFullPathsImageUpdate is 833457 (no fix) versus 2097712 
(with fix) nanoseconds
2018-03-20 18:28:03,380 (main) [INFO - 
org.apache.sentry.provider.db.service.persistent.TestSentryStore.testRetrieveFullPathsImageUpdatePerformance(TestSentryStore.java:3770)]
 retrieveFullPathsImageUpdate normalized change is 1264255 nanoseconds
2018-03-20 18:28:03,380 (main) [INFO - 
org.apache.sentry.provider.db.service.persistent.TestSentryStore.testRetrieveFullPathsImageUpdatePerformance(TestSentryStore.java:3771)]
 With fix, retrieveFullPathsImageUpdate spends 39.731749393862906% of time 
without fix
{code}
This means for my local test, where there is no network latency between the 
sentry server and DB , getting path full snapshot with fix spends only 40% of 
the time without the fix. So the improvement is 60%.

In real setup, network delay is usually more than 1 millisecond. If the network 
latency between sentry server and DB is 1 ms,
 * Without fix, each query for MPath takes at least 1 ms. The total query will 
be at least (1000 + 2.097712 = 1002) ms for this example (1000 tables).
 * With fix, the total query will be around (1 + 0.833457 = 1.8) ms for this 
example.
 * The time spend on getting path full snapshot without fix is 556.7 times of 
the time spend on getting path full snapshot with fix

 

> Performance Issue: MPath is queried for each MAuthzPathsMapping in full 
> snapshot
> --------------------------------------------------------------------------------
>
>                 Key: SENTRY-2184
>                 URL: https://issues.apache.org/jira/browse/SENTRY-2184
>             Project: Sentry
>          Issue Type: Bug
>          Components: Sentry
>    Affects Versions: 2.1.0
>            Reporter: Na Li
>            Assignee: Na Li
>            Priority: Critical
>         Attachments: SENTRY-2184.001.patch
>
>
> MAuthzPathsMapping contains list of MPath instances. From log message, when 
> getting path full snapshot at SentryStore.retrieveFullPathsImageCore(), 
> DataNucleus issues a query for all MPath instances associated with each 
> MAuthzPathsMapping. Therefore, getting full path image may take a very long 
> time.
> The solution is to get MPath in a batch when getting full path image.
> Log Message when DataNucleus issues a query for all MPath instances 
> associated with each MAuthzPathsMapping
> {code:java}
> 1) Initially, all MAuthzPathsMapping entries for current snapshot is queried.
> 2018-03-14 11:51:23,999 (main) [DEBUG - 
> org.datanucleus.util.Log4JLogger.debug(Log4JLogger.java:58)] SELECT 
> 'org.apache.sentry.provider.db.service.model.MAuthzPathsMapping' AS 
> NUCLEUS_TYPE,A0.AUTHZ_OBJ_NAME,A0.AUTHZ_SNAPSHOT_ID,A0.CREATE_TIME_MS,A0.AUTHZ_OBJ_ID
>  FROM AUTHZ_PATHS_MAPPING A0 WHERE A0.AUTHZ_SNAPSHOT_ID = <1>
> 2) call authzToPaths.getPathStrings() causes MPath to be queried for each 
> AUTHZ_OBJ_ID
> 2018-03-14 11:52:27,700 (main) [DEBUG - 
> org.datanucleus.util.Log4JLogger.debug(Log4JLogger.java:58)] SELECT 
> 'org.apache.sentry.provider.db.service.model.MPath' AS 
> NUCLEUS_TYPE,A0.PATH_NAME,A0.PATH_ID FROM AUTHZ_PATH A0 WHERE A0.AUTHZ_OBJ_ID 
> = <1>{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to