[GitHub] [drill] paul-rogers commented on a change in pull request #2026: DRILL-7330: Implement metadata usage for all format plugins

GitBox Wed, 18 Mar 2020 16:37:25 -0700

paul-rogers commented on a change in pull request #2026: DRILL-7330: Implement 
metadata usage for all format plugins
URL: https://github.com/apache/drill/pull/2026#discussion_r394698856


 ##########
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/BaseParquetMetadataProvider.java
 ##########
 @@ -103,34 +110,39 @@
   // whether metadata for row groups should be collected to create files, 
partitions and table metadata
   private final boolean collectMetadata = false;
 
-  public BaseParquetMetadataProvider(List<ReadEntryWithPath> entries,
-                                     ParquetReaderConfig readerConfig,
-                                     String tableName,
-                                     Path tableLocation,
-                                     TupleMetadata schema,
-                                     DrillStatsTable statsTable) {
-    this(readerConfig, entries, tableName, tableLocation, schema, statsTable);
-  }
+  protected BaseParquetMetadataProvider(Builder<?> builder) {
+    if (builder.entries != null) {
+      // reuse previously stored metadata
+      this.entries = builder.entries;
+      this.tableName = builder.selectionRoot != null ? 
builder.selectionRoot.toUri().getPath() : "";
+      this.tableLocation = builder.selectionRoot;
+    } else if (builder.selection != null) {
+      this.entries = new ArrayList<>();
+      this.tableName = builder.selection.getSelectionRoot() != null ? 
builder.selection.getSelectionRoot().toUri().getPath() : "";
+      this.tableLocation = builder.selection.getSelectionRoot();
+    } else {
+      // case of hive parquet table
+      this.entries = new ArrayList<>();
+      this.tableName = null;
+      this.tableLocation = null;
+    }
 
-  public BaseParquetMetadataProvider(ParquetReaderConfig readerConfig,
-                                     List<ReadEntryWithPath> entries,
-                                     String tableName,
-                                     Path tableLocation,
-                                     TupleMetadata schema,
-                                     DrillStatsTable statsTable) {
-    this.entries = entries == null ? new ArrayList<>() : entries;
-    this.readerConfig = readerConfig == null ? 
ParquetReaderConfig.getDefaultInstance() : readerConfig;
-    this.tableName = tableName;
-    this.tableLocation = tableLocation;
-    this.schema = schema;
-    this.statsTable = statsTable;
-  }
+    SchemaProvider schemaProvider = 
builder.metadataProviderManager.getSchemaProvider();
+    TupleMetadata schema = builder.schema;
+    // schema passed into the builder has greater priority
+    if (schema == null && schemaProvider != null) {
+      try {
+        schema = schemaProvider.read().getSchema();
 
 Review comment:
   Please explain a bit more. At present, the `EasySubScan` has a `schema` 
attribute that contains a provided schema. Does the code here choose which 
schema appears in that field? Or, does the code here change that value?
   
   What I'm trying to understand is this: we now have three ways to specify 
schema: 1) Metastore, 2) Provided schema file, 3) Table function. (Just 
reviewed the documentation PR that describe this - very helpful.) What is the 
precedence? I'd think table functions are first. But, does that schema have to 
agree with the other schemas? Does the planner check?
   
   Can I have both a provided and metadata schema? If I have a provided schema, 
will the metadata system (`ANALYZE TABLE`) use it?
   
   How do those precedence rules affect the builder precedence here?
   
   I'm working on cleaning up scan-time schema usage, so I need to understand 
how you want all this to work.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #2026: DRILL-7330: Implement metadata usage for all format plugins

Reply via email to