paul-rogers commented on a change in pull request #2026: DRILL-7330: Implement
metadata usage for all format plugins
URL: https://github.com/apache/drill/pull/2026#discussion_r392607688
##########
File path:
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
##########
@@ -634,6 +642,62 @@ public NonInterestingColumnsMetadata
getNonInterestingColumnsMetadata() {
return nonInterestingColumnsMetadata;
}
+ /**
+ * Returns {@link TableMetadataProviderBuilder} instance based on specified
+ * {@link MetadataProviderManager} source.
+ *
+ * @param source metadata provider manager
+ * @return {@link TableMetadataProviderBuilder} instance
+ */
+ protected abstract TableMetadataProviderBuilder
tableMetadataProviderBuilder(MetadataProviderManager source);
+
+ /**
+ * Returns {@link TableMetadataProviderBuilder} instance which may provide
metadata
+ * without using Drill Metastore.
+ *
+ * @param source metadata provider manager
+ * @return {@link TableMetadataProviderBuilder} instance
+ */
+ protected abstract TableMetadataProviderBuilder
defaultTableMetadataProviderBuilder(MetadataProviderManager source);
+
+ /**
+ * Compares the last modified time of files obtained from specified
selection with
+ * the Metastore last modified time to determine whether Metastore metadata
+ * is not outdated. If metadata is outdated, {@link MetadataException} will
be thrown.
+ *
+ * @param selection the source of files to check
+ * @throws MetadataException if metadata is outdated
+ */
+ protected void checkMetadataConsistency(FileSelection selection,
Configuration fsConf) throws IOException {
+ if (metadataProvider.checkMetadataVersion()) {
+ DrillFileSystem fileSystem =
+
ImpersonationUtil.createFileSystem(ImpersonationUtil.resolveUserName(getUserName()),
fsConf);
+
+ List<FileStatus> fileStatuses =
FileMetadataInfoCollector.getFileStatuses(selection, fileSystem);
+
+ long lastModifiedTime =
metadataProvider.getTableMetadata().getLastModifiedTime();
+
+ Set<Path> removedFiles = new
HashSet<>(metadataProvider.getFilesMetadataMap().keySet());
+ Set<Path> newFiles = new HashSet<>();
+
+ boolean isChanged = false;
+
+ for (FileStatus fileStatus : fileStatuses) {
+ if
(!removedFiles.remove(Path.getPathWithoutSchemeAndAuthority(fileStatus.getPath())))
{
+ newFiles.add(fileStatus.getPath());
+ }
+ if (fileStatus.getModificationTime() > lastModifiedTime) {
+ isChanged = true;
+ break;
+ }
+ }
Review comment:
The above will be a very costly operation for millions of files. Do we have
a way that an external system can ping us when a file is added? Or, that Drill
can work with the files available until an external system tells us to refresh?
And, can we do that refresh in parallel with the current metadata so that
queries continue to run with the existing metadata as we gather the new set?
Again, imagine the case that @dobesv recently explained: millions of files,
data constantly arriving.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services