[
https://issues.apache.org/jira/browse/SENTRY-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407239#comment-16407239
]
Na Li commented on SENTRY-2184:
-------------------------------
The code to test the performance is
{code:java}
/**
* Verifies complete snapshot of HMS Paths can be persisted and retrieved
properly.
*/
@Test
public void testRetrieveFullPathsImageUpdatePerformance() throws Exception {
// save full path snapshot
Map<String, Collection<String>> authzPaths = new HashMap<>();
String[] prefixes = {"/user/hive/warehouse"};
String dbName = "db1";
String tablePrefix = "table_";
int tableCount = 1000;
int pathCountPerTable = 33;
String tableDirectory = prefixes[0] + "/" + dbName + "/";
for (int tableIndex = 0; tableIndex < tableCount; tableIndex ++) {
String tableName = tablePrefix + tableIndex;
HashSet paths = new HashSet<String>(pathCountPerTable);
for (int pathIndex = 0; pathIndex < pathCountPerTable; pathIndex ++) {
paths.add( tableDirectory + tableName + "." + pathIndex);
}
authzPaths.put(dbName + "." + tableName, paths);
}
long notificationID = 11;
sentryStore.persistFullPathsImage(authzPaths, notificationID);
// measure retrive time
long before_noFix = System.nanoTime();
PathsUpdate pathsUpdate_noFix =
sentryStore.retrieveFullPathsImageUpdateOld(prefixes);
long after_nofix = System.nanoTime();
long delta_noFix = after_nofix - before_noFix;
long delta_noFix_scaled = delta_noFix / tableCount;
// measure retrive time
long before_withFix = System.nanoTime();
PathsUpdate pathsUpdate_withfix =
sentryStore.retrieveFullPathsImageUpdate(prefixes);
long after_withFix = System.nanoTime();
long delta_withFix = after_withFix - before_withFix;
long delta_withFix_scaled = delta_withFix / tableCount;
long diff = delta_noFix - delta_withFix;
double change = (double)diff * 100 / delta_noFix;
LOGGER.info("Total time for retrieveFullPathsImageUpdate is {} (no fix) versus
{} (with fix) nanoseconds", delta_withFix_scaled, delta_noFix_scaled );
LOGGER.info("retrieveFullPathsImageUpdate normalized change is {} nanoseconds",
delta_noFix_scaled - delta_withFix_scaled);
LOGGER.info("With fix, retrieveFullPathsImageUpdate spends {}% of time without
fix", ((double)delta_withFix * 100) / delta_noFix);
}{code}
The output on my machine is
{code:java}
2018-03-20 18:28:00,609 (main) [INFO -
org.apache.sentry.hdfs.HMSPaths.<init>(HMSPaths.java:687)]
HMSPaths:[/user/hive/warehouse] Initialized
2018-03-20 18:28:02,312 (main) [INFO -
org.apache.sentry.hdfs.HMSPathsDumper.createPathsDump(HMSPathsDumper.java:81)]
Paths Dump created. 33005 total path strings, 0 duplicate strings found,
compacted to 0 unique strings.
2018-03-20 18:28:02,549 (main) [INFO -
org.apache.sentry.hdfs.HMSPaths.<init>(HMSPaths.java:687)]
HMSPaths:[/user/hive/warehouse] Initialized
2018-03-20 18:28:03,279 (main) [INFO -
org.apache.sentry.hdfs.HMSPathsDumper.createPathsDump(HMSPathsDumper.java:81)]
Paths Dump created. 33005 total path strings, 0 duplicate strings found,
compacted to 0 unique strings.
2018-03-20 18:28:03,379 (main) [INFO -
org.apache.sentry.provider.db.service.persistent.TestSentryStore.testRetrieveFullPathsImageUpdatePerformance(TestSentryStore.java:3769)]
Total time for retrieveFullPathsImageUpdate is 833457 (no fix) versus 2097712
(with fix) nanoseconds
2018-03-20 18:28:03,380 (main) [INFO -
org.apache.sentry.provider.db.service.persistent.TestSentryStore.testRetrieveFullPathsImageUpdatePerformance(TestSentryStore.java:3770)]
retrieveFullPathsImageUpdate normalized change is 1264255 nanoseconds
2018-03-20 18:28:03,380 (main) [INFO -
org.apache.sentry.provider.db.service.persistent.TestSentryStore.testRetrieveFullPathsImageUpdatePerformance(TestSentryStore.java:3771)]
With fix, retrieveFullPathsImageUpdate spends 39.731749393862906% of time
without fix
{code}
This means for my local test, where there is no network latency between the
sentry server and DB , getting path full snapshot with fix spends only 40% of
the time without the fix. So the improvement is 60%.
In real setup, network delay is usually more than 1 millisecond. If the network
latency between sentry server and DB is 1 ms,
* Without fix, each query for MPath takes at least 1 ms. The total query will
be at least (1000 + 2.097712 = 1002) ms for this example (1000 tables).
* With fix, the total query will be around (1 + 0.833457 = 1.8) ms for this
example.
* The time spend on getting path full snapshot without fix is 556.7 times of
the time spend on getting path full snapshot with fix
> Performance Issue: MPath is queried for each MAuthzPathsMapping in full
> snapshot
> --------------------------------------------------------------------------------
>
> Key: SENTRY-2184
> URL: https://issues.apache.org/jira/browse/SENTRY-2184
> Project: Sentry
> Issue Type: Bug
> Components: Sentry
> Affects Versions: 2.1.0
> Reporter: Na Li
> Assignee: Na Li
> Priority: Critical
> Attachments: SENTRY-2184.001.patch
>
>
> MAuthzPathsMapping contains list of MPath instances. From log message, when
> getting path full snapshot at SentryStore.retrieveFullPathsImageCore(),
> DataNucleus issues a query for all MPath instances associated with each
> MAuthzPathsMapping. Therefore, getting full path image may take a very long
> time.
> The solution is to get MPath in a batch when getting full path image.
> Log Message when DataNucleus issues a query for all MPath instances
> associated with each MAuthzPathsMapping
> {code:java}
> 1) Initially, all MAuthzPathsMapping entries for current snapshot is queried.
> 2018-03-14 11:51:23,999 (main) [DEBUG -
> org.datanucleus.util.Log4JLogger.debug(Log4JLogger.java:58)] SELECT
> 'org.apache.sentry.provider.db.service.model.MAuthzPathsMapping' AS
> NUCLEUS_TYPE,A0.AUTHZ_OBJ_NAME,A0.AUTHZ_SNAPSHOT_ID,A0.CREATE_TIME_MS,A0.AUTHZ_OBJ_ID
> FROM AUTHZ_PATHS_MAPPING A0 WHERE A0.AUTHZ_SNAPSHOT_ID = <1>
> 2) call authzToPaths.getPathStrings() causes MPath to be queried for each
> AUTHZ_OBJ_ID
> 2018-03-14 11:52:27,700 (main) [DEBUG -
> org.datanucleus.util.Log4JLogger.debug(Log4JLogger.java:58)] SELECT
> 'org.apache.sentry.provider.db.service.model.MPath' AS
> NUCLEUS_TYPE,A0.PATH_NAME,A0.PATH_ID FROM AUTHZ_PATH A0 WHERE A0.AUTHZ_OBJ_ID
> = <1>{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)