paul-rogers commented on a change in pull request #2026: DRILL-7330: Implement
metadata usage for all format plugins
URL: https://github.com/apache/drill/pull/2026#discussion_r394695247
##########
File path:
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/easy/EasyGroupScan.java
##########
@@ -90,17 +95,14 @@ public EasyGroupScan(
@JsonProperty("selectionRoot") Path selectionRoot,
@JsonProperty("schema") TupleMetadata schema
) throws IOException {
- super(ImpersonationUtil.resolveUserName(userName));
+ super(ImpersonationUtil.resolveUserName(userName), columns,
ValueExpressions.BooleanExpression.TRUE);
this.selection = FileSelection.create(null, files, selectionRoot);
this.formatPlugin = engineRegistry.resolveFormat(storageConfig,
formatConfig, EasyFormatPlugin.class);
this.columns = columns == null ? ALL_COLUMNS : columns;
this.selectionRoot = selectionRoot;
- SimpleFileTableMetadataProviderBuilder builder =
- (SimpleFileTableMetadataProviderBuilder)
- new FileSystemMetadataProviderManager()
-
.builder(MetadataProviderManager.MetadataProviderKind.SCHEMA_STATS_ONLY);
- this.metadataProvider = builder.withLocation(selection.getSelectionRoot())
+ this.metadataProvider = defaultTableMetadataProviderBuilder(new
FileSystemMetadataProviderManager())
Review comment:
This will be a huge problem in an actual production system. Fixing it is
beyond the scope of this PR. I would suggest that the team think a bit about
how this can work longer term. As noted, Impala struggled with this issue for
years, so it is not simple.
One answer is to know when directories change. Use cached metadata for
unchanged directories (which will be most of data history) and expand only
those that are "live" (i.e. partitions for the last day or two.)
Caching is essential for good performance. In the past, Drill was s-l-o-w
when reading the cached Parquet metadata.
But, again, let's leave this issue for another project.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services