Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 )
Change subject: IMPALA-5842: Write page index in Parquet files ...................................................................... Patch Set 10: (9 comments) http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@301 PS10, Line 301: std::vector<std::string> min_values_; > Options 1 or 2 seem fine to me. Chose option 1, but also added a memtracker. http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@735 PS10, Line 735: min_values_.push_back(std::string("")); > I don't know if we need the call to std::string() here, I think it should w Right. Now the code looks a bit different, and I call push_back after the 'if'. http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/hdfs-parquet-table-writer.cc@1227 PS10, Line 1227: for (auto& column : columns_) { > nit: can fit loop on one line. Done http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: http://gerrit.cloudera.org:8080/#/c/9693/10/be/src/exec/parquet-column-stats.h@159 PS10, Line 159: // If true, min/max values are ascending. > Maybe briefly mention why they both start off true? And both can be true at Added some explanation here and to GetBoundaryOrder(). http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@37 PS10, Line 37: class TestHdfsParquetTableIndexWriter(ImpalaTestSuite): > We've got a lot of good coverage in this test. I agree. It would be best to have a fully-fledged parquet reader implementation in Python that we could easily use. I only found https://pypi.org/project/parquet/ but it seems uncomplete. Or, maybe we can hard-code the expected values for a concrete parquet file. http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@177 PS10, Line 177: previouse_value > typo in variable name Done http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@205 PS10, Line 205: falied > nit: failed Done http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@205 PS10, Line 205: column_info_schema > this variable isn't defined - did you mean column_info? Done http://gerrit.cloudera.org:8080/#/c/9693/10/tests/query_test/test_parquet_page_index.py@244 PS10, Line 244: chars_formats > chars_formats is weird in that it's created by a different test - TestChars I moved table creation to 'testdata/bin/load-dependent-tables.sql'. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 10 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Mon, 23 Apr 2018 15:50:20 +0000 Gerrit-HasComments: Yes