Github user paul-rogers commented on a diff in the pull request:
https://github.com/apache/drill/pull/976#discussion_r144074677
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
---
@@ -156,20 +161,46 @@ public ScanBatch getBatch(FragmentContext context,
ParquetRowGroupScan rowGroupS
return new ScanBatch(rowGroupScan, context, oContext, readers,
implicitColumns);
}
- private static boolean isComplex(ParquetMetadata footer) {
- MessageType schema = footer.getFileMetaData().getSchema();
-
- for (Type type : schema.getFields()) {
- if (!type.isPrimitive()) {
- return true;
+ private static boolean isComplex(ParquetMetadata footer,
List<SchemaPath> columns) {
+ /*
+ ParquetRecordReader is not able to read any nested columns and is not
able to handle repeated columns.
+ It only handles flat column and optional column.
+ If it is a wildcard query, we check every columns in the metadata.
+ If not, we only check the projected columns.
+ */
--- End diff --
Very small request: this is a great Javadoc comment, so please use this
form:
```
/**
* Your comment here.
*/
```
It may also be worth pointing out that the algorithm here works regardless
of the form of the column:
* `a`: Must consider the type of column `a` in Parquet.
* `a.b`: The top level column `a` must be a structure in Parquet. (If not,
then presumably an error is thrown later on.) So, no need to check `b`.
* `a[10]`: The column `a` must be an array (repeated), so no need to check
the column `SchemaPath` itself. Again, presumably, Drill will throw an error
internally if it turns out that `a` is not an array.
---