Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )
Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types ...................................................................... Patch Set 3: (7 comments) http://gerrit.cloudera.org:8080/#/c/17262/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17262/3//COMMIT_MSG@16 PS3, Line 16: TBL_PROPS TBL_PROPS will configure bloom filters per column, right? http://gerrit.cloudera.org:8080/#/c/17262/3//COMMIT_MSG@16 PS3, Line 16: TBL_PROPS - write Parquet Bloom filters as set in table properties : IF_NO_DICT - write Parquet Bloom filters if the row group is not : fully dictionary encoded What is the relation of TBL_PROPS and IF_NO_DICT? E.g. with TBL_PROPS will you write bloom filter even if the column dictionary encoded? http://gerrit.cloudera.org:8080/#/c/17262/1/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/17262/1/be/src/exec/parquet/hdfs-parquet-table-writer.cc@605 PS1, Line 605: // The ParquetBloomFilter object if one is being written. If : // 'ShouldInitParquetBloomFilter()' is false, the combination of the impala type and the : // parquet type is not supported or some error occurs during the initialisation of the : // ParquetBloomFilter object, it is set to NULL. : unique_ptr<ParquetB > I find using dst_ptr a bit risky: I like the solution, but I disagree here: "in case of int8 and int16 we need to pad them to int32 because parquet doesn't support smaller ints, so in these cases we cannot use the buffer, adding even more special cases." Why can't we use the buffer? dst_ptr points to the plain encoded page buffer, where the i16/i8 are already converted to i32 http://gerrit.cloudera.org:8080/#/c/17262/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/17262/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@493 PS3, Line 493: if (parent_ I would prefer to move this to a separate function like FlushDictionaryToBloomFilterIfNeeded() ProcessValue() is a critical function when trying to understand how Impala writes Parquet files, so I think that we should try to minimize noise. http://gerrit.cloudera.org:8080/#/c/17262/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@542 PS3, Line 542: if (ShouldUpdateParquetBloomFilter()) { I would prefer to move this to a separate function like UpdateBloomFilterIfNeeded() http://gerrit.cloudera.org:8080/#/c/17262/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@597 PS3, Line 597: whwn typo http://gerrit.cloudera.org:8080/#/c/17262/3/be/src/exec/parquet/parquet-bloom-filter-util.cc File be/src/exec/parquet/parquet-bloom-filter-util.cc: http://gerrit.cloudera.org:8080/#/c/17262/3/be/src/exec/parquet/parquet-bloom-filter-util.cc@220 PS3, Line 220: { nit: brace placement -- To view, visit http://gerrit.cloudera.org:8080/17262 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792 Gerrit-Change-Number: 17262 Gerrit-PatchSet: 3 Gerrit-Owner: Daniel Becker <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Comment-Date: Wed, 14 Apr 2021 14:03:25 +0000 Gerrit-HasComments: Yes
