Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 )
Change subject: IMPALA-5842: Write page index in Parquet files ...................................................................... Patch Set 10: (9 comments) Overall this is looking good. I had some specific concerns about some of the nitty-gritty details. http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@301 PS10, Line 301: std::vector<std::string> min_values_; I'm still concerned about the amount of untracked memory from min_values_ and max_values_, even if we truncate the string values to 1KB or similar - it seems like could end up with multiple MB of untracked memory. We could probably live with it since it's smaller than the actual data, but it's a step in the wrong direction. Maybe we could store min_values_ and max_values_ as StringValues backed by memory per_file_mem_pool_ and then only convert to strings when writing out each column to the page index? http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@735 PS10, Line 735: min_values_.push_back(std::string("")); I don't know if we need the call to std::string() here, I think it should work if we just emplace_back() to instantiate an empty string. http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@1227 PS10, Line 1227: for (auto& column : columns_) { nit: can fit loop on one line. http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/parquet-column-stats.h@159 PS10, Line 159: // If true, min/max values are ascending. Maybe briefly mention why they both start off true? And both can be true at the same time? It's slightly subtle. http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@37 PS10, Line 37: class TestHdfsParquetTableIndexWriter(ImpalaTestSuite): We've got a lot of good coverage in this test. I'm wondering if we're missing some basic tests that confirm that the values in the page match the min/max values in the page index. It seems like these validations might not catch some kinds of bugs. E.g. min/max values in the index are somehow out-of-sync with the pages. Most bugs that I can imagine would get caught by one validation or another but it would be nice to have a sanity test where we confirm that the values in each page match the values in the page index. http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@177 PS10, Line 177: previouse_value typo in variable name http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@205 PS10, Line 205: falied nit: failed http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@205 PS10, Line 205: column_info_schema this variable isn't defined - did you mean column_info? http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@244 PS10, Line 244: chars_formats chars_formats is weird in that it's created by a different test - TestCharsFormats. I.e. it's not present unless that test ran before this one. Maybe we should change it so that the table is loaded during normal data loading? -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 10 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 12 Apr 2018 00:23:08 +0000 Gerrit-HasComments: Yes
