sachouche opened a new pull request #1414: DRILL-6101: Optimized implicit columns handling within scanner URL: https://github.com/apache/drill/pull/1414 Problem Description - File based implicit columns are projected only if explicitly requested within the query Note that Partition Columns are not included in this discussion (only referring about FILENAME, FILEPATH, FQN, and SUFFIX) The scanner operator is called with three sets of columns to handle: Table Columns, Partition Columns, and Implicit Columns When a SELECT_STAR is used, the operator doesn't receive the original query selection (only '**' is received) This behavior mandates that the Scanner operator projects all file based Implicit Columns only for these to be filtered out later on by the Project Operator Performance tests indicates this behavior introduces a 30% degradation within the scanning phase for some TPCH queries (this degradation is larger for tables with long paths) Fix - Noticed the code uses a Utility to figure out whether a selection is a STAR_QUERY; this utility expects a list of columns and attempts to detect the presence of the STAR selection keyword Modified the code to include all selection columns (including the ones in the where clause) This allowed the execution layer to invoke the Scan operator with the correct implicit columns (the ones explicitly listed within the query) and thus addressing this performance issue Note that readers are not impacted with the newly added metadata as the reader code doesn't use the columns list when a STAR_QUERY is involved
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
