Github user paul-rogers commented on a diff in the pull request:
https://github.com/apache/drill/pull/976#discussion_r143264522
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
---
@@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context,
ParquetRowGroupScan rowGroupS
return new ScanBatch(rowGroupScan, context, oContext, readers,
implicitColumns);
}
- private static boolean isComplex(ParquetMetadata footer) {
- MessageType schema = footer.getFileMetaData().getSchema();
+ private static boolean isComplex(ParquetMetadata footer,
List<SchemaPath> columns) {
+ if (Utilities.isStarQuery(columns)) {
+ MessageType schema = footer.getFileMetaData().getSchema();
- for (Type type : schema.getFields()) {
- if (!type.isPrimitive()) {
- return true;
+ for (Type type : schema.getFields()) {
+ if (!type.isPrimitive()) {
+ return true;
+ }
}
- }
- for (ColumnDescriptor col : schema.getColumns()) {
- if (col.getMaxRepetitionLevel() > 0) {
- return true;
+ for (ColumnDescriptor col : schema.getColumns()) {
+ if (col.getMaxRepetitionLevel() > 0) {
+ return true;
+ }
+ }
+ return false;
+ } else {
+ for (SchemaPath column : columns) {
+ if (isColumnComplex(footer.getFileMetaData().getSchema(), column))
{
+ return true;
+ }
}
+ return false;
+ }
+ }
+
+ private static boolean isColumnComplex(GroupType grouptype, SchemaPath
column) {
+ PathSegment.NameSegment root = column.getRootSegment();
+ if (!grouptype.containsField(root.getPath().toLowerCase())) {
+ return false;
+ }
+ Type type = grouptype.getType(root.getPath().toLowerCase());
--- End diff --
What are the semantics of `getType()`? Does it return null if there is no
such type? If so, then we can retrieve the type and check if it is null.
Otherwise, if it throws an exception, perhaps we can catch that. Either way, we
avoid two searches for the same column and two conversions of the path to lower
case.
Note also that a recent change deprecated `getPath()`. The preferred form
is `getName()` so it is clear that we are getting a single name part. If the
Parquet column is nested (a.b.c, say), then we have to explicitly handle the
multiple name parts as Drill does support names with dots and one cannot simply
concatenate or split a path to get a string. That is, `["a.b", "c"]` is two
fields, `["a", "b", "c"]` is three, but both are represented as a full path as
"a.b.c", creating an ambiguity.
---