Lars Volker has posted comments on this change. Change subject: IMPALA-4815, IMPALA-4817, IMPALA-4819: Populate Parquet Statistics for remaining types ......................................................................
Patch Set 6: (22 comments) Thank you for your review. Please see PS6 and my inline comments. http://gerrit.cloudera.org:8080/#/c/6563/4/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: Line 375: scoped_ptr<ColumnStats<T>> page_stats_; > why this change? The stats now need to have their memory pool reset in Reset() (L297), so I followed the surrounding code (DictEncoder). Should we instead keep the objects and pass the memory pool to page_stats_.Reset()? http://gerrit.cloudera.org:8080/#/c/6563/4/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 34: /// decode parquet::Statistics into slots. > update Done Line 60: /// Enum to select whether to read minimum or maximum statistics. Values do not > "do not" Done Line 73: const ColumnType& col_type, const parquet::ColumnOrder* col_order, > const& isn't necessary here Done Line 82: /// batch, since the batch memory will be released. Overwrite this method in derived > is this useful for anything other than strings/var-len data? if not, make n Done Line 83: /// classes to provide the functionality. > ) {} git-clang-format does this. If you feel strongly about it, I'll add a TODO for myself to add the space before submitting the change. Line 105: /// 'col_order'. Otherwise, returns false. > CanUse-? or should as in 'would be preferable'/'there's a moral imperative' Done Line 108: > i had to remind myself again what T was. would be useful to point out in th Done Line 127: || std::is_same<bool, T>::value > describe params Done Line 135: public: > you should point out that it may keep a reference to 'v' until Materialize. Done Line 144: min_buffer_(mem_pool), > this doesn't need to be part of the public interface, columstatsbase can be Done Line 152: virtual void Merge(const ColumnStatsBase& other) override; > static Done. Making it static required passing an additional parameter. Line 154: > you don't mention here that this is plain-encoded. Done Line 158: protected: > why -internal? Done Line 161: static void EncodePlainValue(const T& v, int64_t bytes_needed, std::string* out); > why -internal? Done http://gerrit.cloudera.org:8080/#/c/6563/4/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: Line 28: inline void ColumnStats<T>::Update(const T& v) { > again, you changed the meaning of a variable but not the name. Done Line 30: has_values_ = true; > this whole function does almost nothing. is it worth having it (extra level Done Line 189: } > why do you need T here? It seemed cleaner to me to define it for all types. Removing T required moving the declaration before its first usage. http://gerrit.cloudera.org:8080/#/c/6563/4/common/thrift/parquet.thrift File common/thrift/parquet.thrift: Line 563: 2: required i64 total_byte_size > this has now changed, please take a look at Done http://gerrit.cloudera.org:8080/#/c/6563/4/testdata/workloads/functional-query/queries/QueryTest/parquet-deprecated-stats.test File testdata/workloads/functional-query/queries/QueryTest/parquet-deprecated-stats.test: Line 138: select count(*) from deprecated_stats where int_col < 0 and timestamp_col != "2009-02-01 00:00:00" > stray line? Done http://gerrit.cloudera.org:8080/#/c/6563/4/testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test File testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test: Line 290: select count(*) from functional_parquet.alltypessmall where string_col < "1" > use a predicate that will result in filter? (so we can see it also works wi I'm not sure I understand your comment. The test in L274 filters all rows using "<". Can you post the query you have in mind? http://gerrit.cloudera.org:8080/#/c/6563/4/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: Line 60: # 'timetuple' will the datetime __eq__ function forward an unknown equality check to > 'timetuple' will the datetime __eq__ function forward... Done -- To view, visit http://gerrit.cloudera.org:8080/6563 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Attila Jeges <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
