[GitHub] [hive] pvary commented on a change in pull request #3053: HIVE-25980: Support HiveMetaStoreChecker.checkTable operation with multi-threaded

GitBox Fri, 25 Feb 2022 00:54:58 -0800


pvary commented on a change in pull request #3053:
URL: https://github.com/apache/hive/pull/3053#discussion_r814583976




##########
File path: 
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreChecker.java
##########
@@ -303,56 +304,132 @@ void checkTable(Table table, PartitionIterable parts, 
byte[] filterExp, CheckRes
     if (tablePath == null) {
       return;
     }
-    FileSystem fs = tablePath.getFileSystem(conf);
-    if (!fs.exists(tablePath)) {
+    final FileSystem[] fs = {tablePath.getFileSystem(conf)};
+    if (!fs[0].exists(tablePath)) {
       result.getTablesNotOnFs().add(table.getTableName());
       return;
     }
 
     Set<Path> partPaths = new HashSet<>();
 
-    // check that the partition folders exist on disk
-    for (Partition partition : parts) {
-      if (partition == null) {
-        // most likely the user specified an invalid partition
-        continue;
-      }
-      Path partPath = getDataLocation(table, partition);
-      if (partPath == null) {
-        continue;
-      }
-      fs = partPath.getFileSystem(conf);
+    int threadCount = MetastoreConf.getIntVar(conf, 
MetastoreConf.ConfVars.METASTORE_MSCK_FS_HANDLER_THREADS_COUNT);
+
+    final ExecutorService pool = (threadCount > 1) ?
+        Executors.newFixedThreadPool(threadCount,
+            new ThreadFactoryBuilder()
+                .setDaemon(true)
+                .setNameFormat("CheckTable-PartitionOptimizer-%d").build()) : 
null;
 
-      CheckResult.PartitionResult prFromMetastore = new 
CheckResult.PartitionResult();
-      prFromMetastore.setPartitionName(getPartitionName(table, partition));
-      prFromMetastore.setTableName(partition.getTableName());
-      if (!fs.exists(partPath)) {
-        result.getPartitionsNotOnFs().add(prFromMetastore);
+    try {
+      Queue<Future<String>> futures = new LinkedList<>();
+      if (pool != null) {
+        // check that the partition folders exist on disk using multi-thread
+        for (Partition partition : parts) {

Review comment:
       I think this will fetch all of the partitions from the partition 
iterator immediately and keep them in memory.
   
   The goal was with the partition iterator to prevent OOM when there are big 
tables with huge number of partitions. We do not want every partition in the 
memory once, so the iterator fetched them in batches, and after we did not use 
them we let the GC take care of the batch.
   
   With this change I expect that we create a `Future` immediately for all of 
the partitions and we will keep all of the partitions in memory until all of 
the checks are finished.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hive] pvary commented on a change in pull request #3053: HIVE-25980: Support HiveMetaStoreChecker.checkTable operation with multi-threaded

Reply via email to