vvysotskyi commented on a change in pull request #2026: DRILL-7330: Implement
metadata usage for all format plugins
URL: https://github.com/apache/drill/pull/2026#discussion_r392681623
##########
File path:
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
##########
@@ -634,6 +642,62 @@ public NonInterestingColumnsMetadata
getNonInterestingColumnsMetadata() {
return nonInterestingColumnsMetadata;
}
+ /**
+ * Returns {@link TableMetadataProviderBuilder} instance based on specified
+ * {@link MetadataProviderManager} source.
+ *
+ * @param source metadata provider manager
+ * @return {@link TableMetadataProviderBuilder} instance
+ */
+ protected abstract TableMetadataProviderBuilder
tableMetadataProviderBuilder(MetadataProviderManager source);
+
+ /**
+ * Returns {@link TableMetadataProviderBuilder} instance which may provide
metadata
+ * without using Drill Metastore.
+ *
+ * @param source metadata provider manager
+ * @return {@link TableMetadataProviderBuilder} instance
+ */
+ protected abstract TableMetadataProviderBuilder
defaultTableMetadataProviderBuilder(MetadataProviderManager source);
+
+ /**
+ * Compares the last modified time of files obtained from specified
selection with
+ * the Metastore last modified time to determine whether Metastore metadata
+ * is not outdated. If metadata is outdated, {@link MetadataException} will
be thrown.
+ *
+ * @param selection the source of files to check
+ * @throws MetadataException if metadata is outdated
+ */
+ protected void checkMetadataConsistency(FileSelection selection,
Configuration fsConf) throws IOException {
+ if (metadataProvider.checkMetadataVersion()) {
+ DrillFileSystem fileSystem =
+
ImpersonationUtil.createFileSystem(ImpersonationUtil.resolveUserName(getUserName()),
fsConf);
+
+ List<FileStatus> fileStatuses =
FileMetadataInfoCollector.getFileStatuses(selection, fileSystem);
+
+ long lastModifiedTime =
metadataProvider.getTableMetadata().getLastModifiedTime();
+
+ Set<Path> removedFiles = new
HashSet<>(metadataProvider.getFilesMetadataMap().keySet());
+ Set<Path> newFiles = new HashSet<>();
+
+ boolean isChanged = false;
+
+ for (FileStatus fileStatus : fileStatuses) {
+ if
(!removedFiles.remove(Path.getPathWithoutSchemeAndAuthority(fileStatus.getPath())))
{
+ newFiles.add(fileStatus.getPath());
+ }
+ if (fileStatus.getModificationTime() > lastModifiedTime) {
+ isChanged = true;
+ break;
+ }
+ }
Review comment:
I agree that it may be costly. But in the case when we wouldn't do this
check now, we can obtain incorrect results. Regarding integrations with some
external systems which may do this, it is a good idea, but I don't know about
such systems. Currently, we either use actual metadata for queries or do not
use it at all.
Regarding making auto-refresh, there is another Jira
https://issues.apache.org/jira/browse/DRILL-7430 which is a holder for further
improvements for metastore, so we can discuss it there.
Regarding the case where data continuously arriving, I don't think that this
is the right case for using metastore, since refreshing the metadata, even with
our incremental update is too costly to do it so often.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services