Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11575 )

Change subject: IMPALA-6964: Track stats about column and page sizes in Parquet 
reader
......................................................................


Patch Set 6:

(18 comments)

A few high level changes:

* After I found the race condition in IMPALA-7816, I realized it would affect 
this patch as well. So I moved the updates of the bytes_read_per_col_ map from 
the HdfsParquetScanner::Close method into the ParquetColumnReader class. Now 
ParquetColumnReader updates the map directly each time it reads a data page. 
This avoids the race condition described in IMPALA-7816.

* Created a new method in parse_util.py to help parse summary stat counters. 
Re-factored the Python tests a bit so they are easier to read.

http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-parquet-scanner.h@661
PS4, Line 661:   /// Update the counter parquet_compressed_page_size_counter_ 
with the given compressed
> Mention that these are called by the column readers?
Done


http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-parquet-scanner.h@663
PS4, Line 663:   void UpdateCompressedPageSizeCounter(int64_t 
compressed_page_size);
> These should probably be int64_t for good measure
Done


http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.h@354
PS4, Line 354:   /// to track the number of bytes read per column and is meant 
to be called by
> prefix with ///
Done


http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.h@356
PS4, Line 356:   void UpdateUncompressedBytesRead(SlotId slot_id, int64_t 
uncompressed_bytes_read);
> Make the byte counters int64_t? Also use a const ref?
Done


http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.h@541
PS4, Line 541:  with the size of each column.
> _bytes_read_...
Done


http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.h@554
PS4, Line 554:   Status status_;
> nit: Map from SlotId to pair...
Done


http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.h@557
PS4, Line 557: uncompr
> nit: typo
Done


http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.h@558
PS4, Line 558: ap is used to upda
> misses trailing _
Done


http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.cc@907
PS4, Line 907:
> loop per const ref, you can use (const auto& bytes_read : ..) if you'd like
Done


http://gerrit.cloudera.org:8080/#/c/11575/4/be/src/exec/hdfs-scan-node-base.cc@953
PS4, Line 953:
> In C++ you can access a map using [key] and if the entry doesn't exist it w
Done


http://gerrit.cloudera.org:8080/#/c/11575/4/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/11575/4/tests/query_test/test_scanners.py@726
PS4, Line 726:
> Shouldn't the [Un] be without [] here, i.e. we don't update the ParquetComp
Done


http://gerrit.cloudera.org:8080/#/c/11575/4/tests/query_test/test_scanners.py@737
PS4, Line 737: anner thread for the invalid f
> assert that len(compressed_page_size_summaries) > 0 ?
Done


http://gerrit.cloudera.org:8080/#/c/11575/4/tests/query_test/test_scanners.py@738
PS4, Line 738:     before we mark the whole scan complete."""
> for value in summary:  ?
Done


http://gerrit.cloudera.org:8080/#/c/11575/4/tests/query_test/test_scanners.py@768
PS4, Line 768: n re
> remove the []?
Done


http://gerrit.cloudera.org:8080/#/c/11575/4/tests/query_test/test_scanners.py@789
PS4, Line 789:           summary['samples'] > 0
> With named matches and groupdicts, this could be summary['max']
Done


http://gerrit.cloudera.org:8080/#/c/11575/4/tests/query_test/test_scanners.py@798
PS4, Line 798:
> This seems to be the same in all tests. Move to the helper method?
Done


http://gerrit.cloudera.org:8080/#/c/11575/4/tests/query_test/test_scanners.py@811
PS4, Line 811: r
> I don't understand what the "s" is doing
Done


http://gerrit.cloudera.org:8080/#/c/11575/4/tests/query_test/test_scanners.py@813
PS4, Line 813:         summary['samples'] > 0
> If you convert this into a verbose regex with named fields it might become
Done. Created a new method in parse_util.py that uses verbose regex and 
groupdicts.



--
To view, visit http://gerrit.cloudera.org:8080/11575
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I322f9b324b6828df28e5caf79529085c43d7c817
Gerrit-Change-Number: 11575
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Comment-Date: Wed, 07 Nov 2018 18:19:53 +0000
Gerrit-HasComments: Yes

Reply via email to