paul-rogers commented on a change in pull request #2026: DRILL-7330: Implement 
metadata usage for all format plugins
URL: https://github.com/apache/drill/pull/2026#discussion_r392607688
 
 

 ##########
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##########
 @@ -634,6 +642,62 @@ public NonInterestingColumnsMetadata 
getNonInterestingColumnsMetadata() {
     return nonInterestingColumnsMetadata;
   }
 
+  /**
+   * Returns {@link TableMetadataProviderBuilder} instance based on specified
+   * {@link MetadataProviderManager} source.
+   *
+   * @param source metadata provider manager
+   * @return {@link TableMetadataProviderBuilder} instance
+   */
+  protected abstract TableMetadataProviderBuilder 
tableMetadataProviderBuilder(MetadataProviderManager source);
+
+  /**
+   * Returns {@link TableMetadataProviderBuilder} instance which may provide 
metadata
+   * without using Drill Metastore.
+   *
+   * @param source metadata provider manager
+   * @return {@link TableMetadataProviderBuilder} instance
+   */
+  protected abstract TableMetadataProviderBuilder 
defaultTableMetadataProviderBuilder(MetadataProviderManager source);
+
+  /**
+   * Compares the last modified time of files obtained from specified 
selection with
+   * the Metastore last modified time to determine whether Metastore metadata
+   * is not outdated. If metadata is outdated, {@link MetadataException} will 
be thrown.
+   *
+   * @param selection the source of files to check
+   * @throws MetadataException if metadata is outdated
+   */
+  protected void checkMetadataConsistency(FileSelection selection, 
Configuration fsConf) throws IOException {
+    if (metadataProvider.checkMetadataVersion()) {
+      DrillFileSystem fileSystem =
+          
ImpersonationUtil.createFileSystem(ImpersonationUtil.resolveUserName(getUserName()),
 fsConf);
+
+      List<FileStatus> fileStatuses = 
FileMetadataInfoCollector.getFileStatuses(selection, fileSystem);
+
+      long lastModifiedTime = 
metadataProvider.getTableMetadata().getLastModifiedTime();
+
+      Set<Path> removedFiles = new 
HashSet<>(metadataProvider.getFilesMetadataMap().keySet());
+      Set<Path> newFiles = new HashSet<>();
+
+      boolean isChanged = false;
+
+      for (FileStatus fileStatus : fileStatuses) {
+        if 
(!removedFiles.remove(Path.getPathWithoutSchemeAndAuthority(fileStatus.getPath())))
 {
+          newFiles.add(fileStatus.getPath());
+        }
+        if (fileStatus.getModificationTime() > lastModifiedTime) {
+          isChanged = true;
+          break;
+        }
+      }
 
 Review comment:
   The above may be costly for millions of files. Do we have a way that an 
external system can ping us when a file is added? (HDFS has some 
partially-completed feature to send an event when the file system changes.) Or, 
can Drill with the files available until an external system (or the user) tells 
us to refresh? Impala had a `REFRESH METADATA` command for this purpose. One of 
the big problems was that people would run it before every query; as a result 
Impala spent more time refreshing metadata than doing actual work. Would be 
nice if Drill didn't have to learn that same lesson.
   
   Can we do that refresh in parallel with the current metadata so that queries 
continue to run with the existing metadata as we gather the new set?
   
   Again, imagine the case that @dobesv recently explained: millions of files, 
data constantly arriving.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to