[Impala-ASF-CR] IMPALA-6964: Track stats about column and page sizes in Parquet reader

Sahil Takiar (Code Review) Mon, 19 Nov 2018 12:19:03 -0800

Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11575 )


Change subject: IMPALA-6964: Track stats about column and page sizes in Parquet 
reader
......................................................................


Patch Set 11:

(5 comments)

I rebased the patch so there are several unrelated changes that show up in the 
most recent diff.

The major change I made is to how locks are used in hdfs-scan-node-base.cc - 
I'm using a Read-Write lock to protect access to bytes_read_per_col_

http://gerrit.cloudera.org:8080/#/c/11575/10/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

http://gerrit.cloudera.org:8080/#/c/11575/10/be/src/exec/hdfs-scan-node-base.h@565
PS10, Line 565:   boost::scoped_ptr<MemPool> scan_node_pool_;
> Is it an option to collect the stats in the column readers and then pull th
I looked into it, and there doesn't seem a clean way to do it.

We can add an internal counter to each ParquetColumnReader that tracks the 
amount of [un]compressed bytes read, but ParquetColumnReaders are reset and 
re-used multiple times. So I'm not sure a pull model would work, each column 
reader would have to push its state before it gets reset.

file_type_counts_ is a similar data-structure to bytes_read_per_col_ and its 
using a single SpinLock to control access to a shared map.

file_type_counts_ is only updated once per scan-range though, so there is 
probably not much lock contention.

Since bytes_read_per_col_ is updated for each data-page, I'm using read-write 
locks which should lead to reduced lock-contention compared to a single 
SpinLock.


http://gerrit.cloudera.org:8080/#/c/11575/10/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/11575/10/be/src/exec/hdfs-scan-node-base.cc@901
PS10, Line 901:   // column readers, and all column readers should have 
completed at this point; however,
> Do we need to acquire the lock here, too? It we don't, can you add a commen
I don't think its necessary to acquire a lock, but I updated the code to 
acquire it anyway. It only needs a read lock so it should be low overhead, and 
makes the code more defensive.


http://gerrit.cloudera.org:8080/#/c/11575/10/be/src/exec/hdfs-scan-node-base.cc@945
PS10, Line 945: Status 
HdfsScanNodeBase::ScanNodeDebugAction(TExecNodePhase::type phase) {
> nit: in this case I'd mention the type in the name to prevent confusion, i.
Done


http://gerrit.cloudera.org:8080/#/c/11575/10/be/src/exec/hdfs-scan-node-base.cc@947
PS10, Line 947: }
> It seems that unique_lock is the corresponding unique version of shared_loc
It's using a shared_lock because its using a shared_mutex. My understanding of 
the APIs is that you want to use shared_lock with shared_mutex (but I could be 
wrong).

Added the boost:: prefix


http://gerrit.cloudera.org:8080/#/c/11575/5/tests/util/parse_util.py
File tests/util/parse_util.py:

http://gerrit.cloudera.org:8080/#/c/11575/5/tests/util/parse_util.py@137
PS5, Line 137:      example of how this method should be used:
> I think it'd be nice to actually execute the example code. You could look f
Added to tests/infra/test_utils.py



--
To view, visit http://gerrit.cloudera.org:8080/11575
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I322f9b324b6828df28e5caf79529085c43d7c817
Gerrit-Change-Number: 11575
Gerrit-PatchSet: 11
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Comment-Date: Mon, 19 Nov 2018 20:18:30 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6964: Track stats about column and page sizes in Parquet reader

Reply via email to