Kurt Deschler has posted comments on this change. ( http://gerrit.cloudera.org:8080/20850 )
Change subject: IMPALA-12533: Support row materialization outside of fetch lock ...................................................................... Patch Set 11: (7 comments) http://gerrit.cloudera.org:8080/#/c/20850/11//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20850/11//COMMIT_MSG@21 PS11, Line 21: (default 0.2). > As there is only minimal testing in the patch, it may better to disable thi This is effectively inactive without setting options in the Hive JDBC driver or setting the threshold to 0, and that potential performance regressions are minimal (see comments below). I would rather leave this with the threshold set as-is so that it is not required to set additional options from JDBC to benefit from this enhancement. http://gerrit.cloudera.org:8080/#/c/20850/11//COMMIT_MSG@23 PS11, Line 23: Once enabled, delayed materialization will be used for the remainder of : the rows fetched for the current query > Just brain dump, I don't think that the patch needs to be changed: The Hive JDBC driver ramps up threads slowly as row batches are fetched. That should avoid most of these cases. http://gerrit.cloudera.org:8080/#/c/20850/11//COMMIT_MSG@24 PS11, Line 24: It is expected that the default : threshold will never be reached with a single fetch stream. > It seems safer to guarantee this by detecting parallel fetches and only ena There are no parallel fetches as they are serialized by the fetch lock. Using the fetch lock acquisition latency is superior to counting how many streams are referencing the statement since we can avoid delayed materialization when the benefits would be negligible. The current threshold (20%) will typically only delay when there are more than 2 streams. I also measured performance with narrow and wide table and the performance difference was very negligible, <5% for degenerate wide table cases and <1% for narrow tables. http://gerrit.cloudera.org:8080/#/c/20850/11//COMMIT_MSG@31 PS11, Line 31: -Manual testing with Multi-stream Hive JDBC driver. This showed a 20% > A follow up Jira would be nice about creating some simple multi thread clie Possibly once the Hive driver changes are available in a release. I did test manually with asan and tsan and the multi-stream driver. http://gerrit.cloudera.org:8080/#/c/20850/11/be/src/runtime/row-batch.h File be/src/runtime/row-batch.h: http://gerrit.cloudera.org:8080/#/c/20850/11/be/src/runtime/row-batch.h@329 PS11, Line 329: ror > typo Done http://gerrit.cloudera.org:8080/#/c/20850/11/be/src/service/impala-hs2-server.cc File be/src/service/impala-hs2-server.cc: http://gerrit.cloudera.org:8080/#/c/20850/11/be/src/service/impala-hs2-server.cc@254 PS11, Line 254: RETURN_IF_ERROR(result_set->Flush()); : result_flush_timer.Stop(); : query_handle->AddResultFlushTime(result_flush_timer.ElapsedTime()); > the timer is not stopped in case line 254 returns an error This is not necessary since we would not want to report flush time in the case of an error. http://gerrit.cloudera.org:8080/#/c/20850/11/be/src/service/query-result-set.cc File be/src/service/query-result-set.cc: http://gerrit.cloudera.org:8080/#/c/20850/11/be/src/service/query-result-set.cc@56 PS11, Line 56: class RowBatchContext { > Can you add some comment about the role of this class? Done -- To view, visit http://gerrit.cloudera.org:8080/20850 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If9512aa6022dbcf81e7eb5f559853b89fe80bd23 Gerrit-Change-Number: 20850 Gerrit-PatchSet: 11 Gerrit-Owner: Kurt Deschler <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Kurt Deschler <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Comment-Date: Mon, 05 Feb 2024 19:23:25 +0000 Gerrit-HasComments: Yes
