sachouche opened a new pull request #1414: DRILL-6101: Optimized implicit 
columns handling within scanner
URL: https://github.com/apache/drill/pull/1414
 
 
   Problem Description -
   
   File based implicit columns are projected only if explicitly requested 
within the query
   Note that Partition Columns are not included in this discussion (only 
referring about FILENAME, FILEPATH, FQN, and SUFFIX)
   The scanner operator is called with three sets of columns to handle: Table 
Columns, Partition Columns, and Implicit Columns
   When a SELECT_STAR is used, the operator doesn't receive the original query 
selection (only '**' is received)
   This behavior mandates that the Scanner operator projects all file based 
Implicit Columns only for these to be filtered out later on by the Project 
Operator
   Performance tests indicates this behavior introduces a 30% degradation 
within the scanning phase for some TPCH queries (this degradation is larger for 
tables with long paths)
   Fix -
   
   Noticed the code uses a Utility to figure out whether a selection is a 
STAR_QUERY; this utility expects a list of columns and attempts to detect the 
presence of the STAR selection keyword
   Modified the code to include all selection columns (including the ones in 
the where clause)
   This allowed the execution layer to invoke the Scan operator with the 
correct implicit columns (the ones explicitly listed within the query) and thus 
addressing this performance issue
   Note that readers are not impacted with the newly added metadata as the 
reader code doesn't use the columns list when a STAR_QUERY is involved

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to