Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/11575 )
Change subject: IMPALA-6964: Track stats about column and page sizes in Parquet reader ...................................................................... Patch Set 6: (18 comments) A few high level changes: * After I found the race condition in IMPALA-7816, I realized it would affect this patch as well. So I moved the updates of the bytes_read_per_col_ map from the HdfsParquetScanner::Close method into the ParquetColumnReader class. Now ParquetColumnReader updates the map directly each time it reads a data page. This avoids the race condition described in IMPALA-7816. * Created a new method in parse_util.py to help parse summary stat counters. Re-factored the Python tests a bit so they are easier to read. http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-parquet-scanner.h@661 PS4, Line 661: /// Update the counter parquet_compressed_page_size_counter_ with the given compressed > Mention that these are called by the column readers? Done http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-parquet-scanner.h@663 PS4, Line 663: void UpdateCompressedPageSizeCounter(int64_t compressed_page_size); > These should probably be int64_t for good measure Done http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.h@354 PS4, Line 354: /// to track the number of bytes read per column and is meant to be called by > prefix with /// Done http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.h@356 PS4, Line 356: void UpdateUncompressedBytesRead(SlotId slot_id, int64_t uncompressed_bytes_read); > Make the byte counters int64_t? Also use a const ref? Done http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.h@541 PS4, Line 541: with the size of each column. > _bytes_read_... Done http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.h@554 PS4, Line 554: Status status_; > nit: Map from SlotId to pair... Done http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.h@557 PS4, Line 557: uncompr > nit: typo Done http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.h@558 PS4, Line 558: ap is used to upda > misses trailing _ Done http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.cc@907 PS4, Line 907: > loop per const ref, you can use (const auto& bytes_read : ..) if you'd like Done http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.cc@953 PS4, Line 953: > In C++ you can access a map using [key] and if the entry doesn't exist it w Done http://gerrit.cloudera.org:8080/#/c/11575/4/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/11575/4/tests/query_test/test_scanners.py@726 PS4, Line 726: > Shouldn't the [Un] be without [] here, i.e. we don't update the ParquetComp Done http://gerrit.cloudera.org:8080/#/c/11575/4/tests/query_test/test_scanners.py@737 PS4, Line 737: anner thread for the invalid f > assert that len(compressed_page_size_summaries) > 0 ? Done http://gerrit.cloudera.org:8080/#/c/11575/4/tests/query_test/test_scanners.py@738 PS4, Line 738: before we mark the whole scan complete.""" > for value in summary: ? Done http://gerrit.cloudera.org:8080/#/c/11575/4/tests/query_test/test_scanners.py@768 PS4, Line 768: n re > remove the []? Done http://gerrit.cloudera.org:8080/#/c/11575/4/tests/query_test/test_scanners.py@789 PS4, Line 789: summary['samples'] > 0 > With named matches and groupdicts, this could be summary['max'] Done http://gerrit.cloudera.org:8080/#/c/11575/4/tests/query_test/test_scanners.py@798 PS4, Line 798: > This seems to be the same in all tests. Move to the helper method? Done http://gerrit.cloudera.org:8080/#/c/11575/4/tests/query_test/test_scanners.py@811 PS4, Line 811: r > I don't understand what the "s" is doing Done http://gerrit.cloudera.org:8080/#/c/11575/4/tests/query_test/test_scanners.py@813 PS4, Line 813: summary['samples'] > 0 > If you convert this into a verbose regex with named fields it might become Done. Created a new method in parse_util.py that uses verbose regex and groupdicts. -- To view, visit http://gerrit.cloudera.org:8080/11575 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I322f9b324b6828df28e5caf79529085c43d7c817 Gerrit-Change-Number: 11575 Gerrit-PatchSet: 6 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Comment-Date: Wed, 07 Nov 2018 18:19:53 +0000 Gerrit-HasComments: Yes
