[
https://issues.apache.org/jira/browse/DRILL-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16194996#comment-16194996
]
ASF GitHub Bot commented on DRILL-5797:
---------------------------------------
Github user paul-rogers commented on a diff in the pull request:
https://github.com/apache/drill/pull/976#discussion_r143264522
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
---
@@ -156,18 +160,39 @@ public ScanBatch getBatch(FragmentContext context,
ParquetRowGroupScan rowGroupS
return new ScanBatch(rowGroupScan, context, oContext, readers,
implicitColumns);
}
- private static boolean isComplex(ParquetMetadata footer) {
- MessageType schema = footer.getFileMetaData().getSchema();
+ private static boolean isComplex(ParquetMetadata footer,
List<SchemaPath> columns) {
+ if (Utilities.isStarQuery(columns)) {
+ MessageType schema = footer.getFileMetaData().getSchema();
- for (Type type : schema.getFields()) {
- if (!type.isPrimitive()) {
- return true;
+ for (Type type : schema.getFields()) {
+ if (!type.isPrimitive()) {
+ return true;
+ }
}
- }
- for (ColumnDescriptor col : schema.getColumns()) {
- if (col.getMaxRepetitionLevel() > 0) {
- return true;
+ for (ColumnDescriptor col : schema.getColumns()) {
+ if (col.getMaxRepetitionLevel() > 0) {
+ return true;
+ }
+ }
+ return false;
+ } else {
+ for (SchemaPath column : columns) {
+ if (isColumnComplex(footer.getFileMetaData().getSchema(), column))
{
+ return true;
+ }
}
+ return false;
+ }
+ }
+
+ private static boolean isColumnComplex(GroupType grouptype, SchemaPath
column) {
+ PathSegment.NameSegment root = column.getRootSegment();
+ if (!grouptype.containsField(root.getPath().toLowerCase())) {
+ return false;
+ }
+ Type type = grouptype.getType(root.getPath().toLowerCase());
--- End diff --
What are the semantics of `getType()`? Does it return null if there is no
such type? If so, then we can retrieve the type and check if it is null.
Otherwise, if it throws an exception, perhaps we can catch that. Either way, we
avoid two searches for the same column and two conversions of the path to lower
case.
Note also that a recent change deprecated `getPath()`. The preferred form
is `getName()` so it is clear that we are getting a single name part. If the
Parquet column is nested (a.b.c, say), then we have to explicitly handle the
multiple name parts as Drill does support names with dots and one cannot simply
concatenate or split a path to get a string. That is, `["a.b", "c"]` is two
fields, `["a", "b", "c"]` is three, but both are represented as a full path as
"a.b.c", creating an ambiguity.
> Use more often the new parquet reader
> -------------------------------------
>
> Key: DRILL-5797
> URL: https://issues.apache.org/jira/browse/DRILL-5797
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Parquet
> Reporter: Damien Profeta
> Assignee: Damien Profeta
> Fix For: 1.12.0
>
>
> The choice of using the regular parquet reader of the optimized one is based
> of what type of columns is in the file. But the columns that are read by the
> query doesn't matter. We can increase a little bit the cases where the
> optimized reader is used by checking is the projected column are simple or
> not.
> This is an optimization waiting for the fast parquet reader to handle complex
> structure.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)