Jinfeng Ni created DRILL-684: -------------------------------- Summary: Use parquet row count in cost-based optimization. Use parquet row count, column value count to optimize count() aggregate function. Key: DRILL-684 URL: https://issues.apache.org/jira/browse/DRILL-684 Project: Apache Drill Issue Type: Improvement Reporter: Jinfeng Ni Assignee: Jinfeng Ni Attachments: DRILL-684.1.patch
Parquet group scan provides the exact row count and the exact value count for each individual column. Such information could be leveraged in the following two ways: 1. Use the count in the cost estimation, when query refers parquet files. 2. Use the row count or column value count to optimize count() aggregate function. For instance, select count(*) from parquet_file; select count(column_a) from parquet_file; First query could be transformed to return the row count directly, the second one could return the column value count for 'column_a'. Both of the two cases will avoid scan the whole parquet files, thus improve query performance. -- This message was sent by Atlassian JIRA (v6.2#6252)