Pooja Nilangekar has uploaded a new change for review. http://gerrit.cloudera.org:8080/7465
Change subject: Populate OffsetIndex and ColumnIndex of a row_group and Filter pages ...................................................................... Populate OffsetIndex and ColumnIndex of a row_group and Filter pages The statistics for each page in a ColumnChunk of a RowGroup are added to the ColumnIndex structure. When a page is flushed to the file, its location and offset of the first row is added to the PageLocation structure of the Offset index. If a file is found to have only one row_group when it is Finalized, ColumnIndex for each Column is written to the file (just before footer) and its length and offset is populated in the ColumnChunk. The OffsetIndexes of all the columns in the row_group are written to the RowGroupOffsetIndex structure and written out to the file. The offset and length of the index is written out to the RowGroup. This ensures that the rage scans and point look ups can skip pages based on these statistics while at the same time scans without selective predicates do not incur any overhead. Space efficiency is ensured by not populating parquet::Statistics in the ColumnMeta when the statistics are written to the ColumnIndex. Additionally, for ordered columns, the ColumnIndex only contains the min_values. While scanning a RowGroup, the HdfsParquetScanner invokes the ParquetIndexFilter for the RowGroups where the indexes are present. The filter evaluates each conjunct against each page of the corresponding column. It consolidates the RowRanges for the given RowGroup and returns the final set of pages to be scanned for each column. Testing: The populated index structures were deserialized from the parquet file and the validity of the offsets and statistics were verified. The filtered index ranges were verified manually by ensuring that the filtered ranges would always evaluate the min/max conjuncts to true. Change-Id: Idace1e57067f95973cef3567eeb84f2ad87fd3f6 --- M be/src/exec/CMakeLists.txt M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h A be/src/exec/parquet-index-filter.cc A be/src/exec/parquet-index-filter.h M bin/impala-config.sh M common/thrift/parquet.thrift M tests/query_test/test_insert_parquet.py M tests/util/get_parquet_index.py 12 files changed, 589 insertions(+), 69 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/7465/2 -- To view, visit http://gerrit.cloudera.org:8080/7465 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Idace1e57067f95973cef3567eeb84f2ad87fd3f6 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Pooja Nilangekar <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]>
