[
https://issues.apache.org/jira/browse/DRILL-7330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059505#comment-17059505
]
ASF GitHub Bot commented on DRILL-7330:
---------------------------------------
paul-rogers commented on pull request #2026: DRILL-7330: Implement metadata
usage for all format plugins
URL: https://github.com/apache/drill/pull/2026#discussion_r392623969
##########
File path:
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/easy/EasyGroupScan.java
##########
@@ -124,13 +127,16 @@ public EasyGroupScan(
// use file system metadata provider without specified schema and
statistics
metadataProviderManager = new FileSystemMetadataProviderManager();
}
- SimpleFileTableMetadataProviderBuilder builder =
- (SimpleFileTableMetadataProviderBuilder)
metadataProviderManager.builder(
- MetadataProviderManager.MetadataProviderKind.SCHEMA_STATS_ONLY);
+ DrillFileSystem fs =
+
ImpersonationUtil.createFileSystem(ImpersonationUtil.resolveUserName(userName),
formatPlugin.getFsConf());
- this.metadataProvider = builder.withLocation(selection.getSelectionRoot())
+ this.metadataProvider =
tableMetadataProviderBuilder(metadataProviderManager)
+ .withSelection(selection)
+ .withFileSystem(fs)
.build();
+ this.usedMetastore = metadataProviderManager.usesMetastore();
initFromSelection(selection, formatPlugin);
+ checkMetadataConsistency(selection, formatPlugin.getFsConf());
Review comment:
This has been nagging at me. For Parquet, metadata includes both partition
information and information about the insides of files (row groups, etc.) But,
for files other than Parquet, there is no useful information in metadata about
file contents. As a result, all of the benefit of metadata is to assist with
partition pruning. Metadata avoids the need to walk the directory tree.
However, in order to ensure that the metadata is consistent we... walk the
directory tree.
So, for files other than Parquet, are we gaining anything (other than more
complexity) by using metadata if we must check the tree on each query?
There *is* a gain if we can trust the metadata and avoid the walk of the
directory tree. (See comments elsewhere which no longer appear in this code
view since my comments overlapped with your next round of changes.)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Implement metadata usage for text format plugin
> -----------------------------------------------
>
> Key: DRILL-7330
> URL: https://issues.apache.org/jira/browse/DRILL-7330
> Project: Apache Drill
> Issue Type: Sub-task
> Reporter: Arina Ielchiieva
> Assignee: Vova Vysotskyi
> Priority: Major
> Fix For: 1.18.0
>
>
> 1. Change the current group scan to leverage Schema from Metastore;
> 2. Use stats for enabling additional logical planning rules for text format
> plugin. It will enable such optimizations as limit, filter push and so on.
> + add possibility to pass schema through schema file (using path or table
> root), inline.
> + check for other enhancements in analyze command
--
This message was sent by Atlassian Jira
(v8.3.4#803005)