Lars Volker has posted comments on this change. Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer ......................................................................
Patch Set 6: (13 comments) Thanks for the review. Please see PS7. http://gerrit.cloudera.org:8080/#/c/5611/6/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: PS6, Line 136: < > <=? Done Line 145: virtual void Reset() { > Do we need to reset the row group stats at this point? Yes, good catch. In PS5 this was done in ColumnWriter::Reset() and I added it there again. PS6, Line 178: ProcessValue > Maybe AppendValue() or something like that? Process is a little vague. Marcel had suggested that name, but I'm good with either. Marcel, do you have a strong preference? Line 389: bool ret = bool_values_->PutValue(v, 1); > Consider something like: Done, though it has the same number of lines, but now uses two return statements. Is there a strong reason to prefer one over the other? PS6, Line 644: < > <=? Done PS6, Line 648: column > row group Done http://gerrit.cloudera.org:8080/#/c/5611/6/be/src/exec/hdfs-parquet-table-writer.h File be/src/exec/hdfs-parquet-table-writer.h: Line 105: static const int MAX_COLUMN_STATS_SIZE = 4 * 1024; > Mention that this value came from parquet-mr? Done http://gerrit.cloudera.org:8080/#/c/5611/6/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: PS6, Line 125: min > Others might have different opinions but I think I'd prefer making this eve For this change I would think it is ok to handle min/max implicitly, since we only add support for numerical types. When adding more complex types we could add a comparator method to ColumnStats<T> that does the comparison in a type specific way for all types where we want to have a more explicit implementation. I'm also good with doing that right now, but I added a TODO first to get additional feedback from other reviewers first. Line 139: out->__set_min(min_str); > It probably doesn't really matter, but I think using move() here would a co Done PS6, Line 150: ((uint8_t*) > reinterpret_cast Done Line 183: /// parquet-mr and subsequently Hive currently do not handle the following types > We should file an Impala JIRA to keep track of this too I created separate JIRAs and added them to the comment. http://gerrit.cloudera.org:8080/#/c/5611/6/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: Line 208: class TestHdfsParquetTableStatsWriter(ImpalaTestSuite): > It would be good to have some kind of test for multiple row groups so that Good idea! As you suspect, it is hard to deterministically write row groups across multiple files. I added a test using the new sortby() hint. Line 309: (0, 7299), > It would be nice to test a table with a negative number to make sure that t Done. We don't have tables with negative numbers in them in 'functional' (TIL), so I used a view. -- To view, visit http://gerrit.cloudera.org:8080/5611 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Michael Brown <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Ivanfi <[email protected]> Gerrit-HasComments: Yes
