Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/11575 )
Change subject: IMPALA-6964: Track stats about column and page sizes in Parquet reader ...................................................................... Patch Set 11: (5 comments) I rebased the patch so there are several unrelated changes that show up in the most recent diff. The major change I made is to how locks are used in hdfs-scan-node-base.cc - I'm using a Read-Write lock to protect access to bytes_read_per_col_ http://gerrit.cloudera.org:8080/#/c/11575/10/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: http://gerrit.cloudera.org:8080/#/c/11575/10/be/src/exec/hdfs-scan-node-base.h@565 PS10, Line 565: boost::scoped_ptr<MemPool> scan_node_pool_; > Is it an option to collect the stats in the column readers and then pull th I looked into it, and there doesn't seem a clean way to do it. We can add an internal counter to each ParquetColumnReader that tracks the amount of [un]compressed bytes read, but ParquetColumnReaders are reset and re-used multiple times. So I'm not sure a pull model would work, each column reader would have to push its state before it gets reset. file_type_counts_ is a similar data-structure to bytes_read_per_col_ and its using a single SpinLock to control access to a shared map. file_type_counts_ is only updated once per scan-range though, so there is probably not much lock contention. Since bytes_read_per_col_ is updated for each data-page, I'm using read-write locks which should lead to reduced lock-contention compared to a single SpinLock. http://gerrit.cloudera.org:8080/#/c/11575/10/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: http://gerrit.cloudera.org:8080/#/c/11575/10/be/src/exec/hdfs-scan-node-base.cc@901 PS10, Line 901: // column readers, and all column readers should have completed at this point; however, > Do we need to acquire the lock here, too? It we don't, can you add a commen I don't think its necessary to acquire a lock, but I updated the code to acquire it anyway. It only needs a read lock so it should be low overhead, and makes the code more defensive. http://gerrit.cloudera.org:8080/#/c/11575/10/be/src/exec/hdfs-scan-node-base.cc@945 PS10, Line 945: Status HdfsScanNodeBase::ScanNodeDebugAction(TExecNodePhase::type phase) { > nit: in this case I'd mention the type in the name to prevent confusion, i. Done http://gerrit.cloudera.org:8080/#/c/11575/10/be/src/exec/hdfs-scan-node-base.cc@947 PS10, Line 947: } > It seems that unique_lock is the corresponding unique version of shared_loc It's using a shared_lock because its using a shared_mutex. My understanding of the APIs is that you want to use shared_lock with shared_mutex (but I could be wrong). Added the boost:: prefix http://gerrit.cloudera.org:8080/#/c/11575/5/tests/util/parse_util.py File tests/util/parse_util.py: http://gerrit.cloudera.org:8080/#/c/11575/5/tests/util/parse_util.py@137 PS5, Line 137: example of how this method should be used: > I think it'd be nice to actually execute the example code. You could look f Added to tests/infra/test_utils.py -- To view, visit http://gerrit.cloudera.org:8080/11575 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I322f9b324b6828df28e5caf79529085c43d7c817 Gerrit-Change-Number: 11575 Gerrit-PatchSet: 11 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Comment-Date: Mon, 19 Nov 2018 20:18:30 +0000 Gerrit-HasComments: Yes
