Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8003 )
Change subject: IMPALA-3200: [DOCS] Document user-facing aspects of new buffer pool ...................................................................... Patch Set 5: (10 comments) http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_buffer_pool_limit.xml File docs/topics/impala_buffer_pool_limit.xml: http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_buffer_pool_limit.xml@53 PS5, Line 53: (This default is represented by an empty value, : sometimes incorrectly reported as 0 in the output of the <codeph>SET</codeph> : command.) This area of behaviour in the code is a bit of a mess and subject to change. The output will be fixed in 5.14 and you can't "set buffer_pool_limit=" until 5.14, so I think this creates potential for further confusion. Maybe best to leave it undocumented or say something like "(This default takes affect if the options is unset)". http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_default_spillable_buffer_size.xml File docs/topics/impala_default_spillable_buffer_size.xml: http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_default_spillable_buffer_size.xml@72 PS5, Line 72: spill-to-disk operations. If spill-to-disk operations result in memory pressure The last sentence I think creates a lot of potential for confusion. The conditions where this can make a difference do not necessarily involve spilling to disk and don't necessary involve a memory constraint on the node itself - it could prevent the query hitting a mem_limit even if there is plenty of free memory. It seems sufficient to me to say that reducing the value may reduce memory consumption - spelling out the exactly circumstances where it will helpful would require a lot of words and be subject to change. http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_default_spillable_buffer_size.xml@73 PS5, Line 73: DataNodes Impala daemons? http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_max_row_size.xml File docs/topics/impala_max_row_size.xml: http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_max_row_size.xml@40 PS5, Line 40: Applies when unpacking column values from : a row read from disk, or when constructing intermediate or final rows in the : result set. Setting an upper limit prevents out-of-control memory use when : accessing columns containing huge strings. This summary is wrong (the behaviour is subtle). The query option has zero effect on scans and the maximum row size isn't necessarily enforced. The point is to force Impala to reserve enough memory to process rows of this size. If the rows are larger It's a lower bound on the upper bound, rather than an upper bound. If it's set to a high value then Impala may reserve more memory than is necessary to execute the query. http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_max_row_size.xml@69 PS5, Line 69: so it is possible This is correct but I'm concerned some users might take it the wrong way. To me this sounds like it's "possible but unlikely", whereas I think in many case that queries will succeed. My concern is that some users might see this and start worrying about their queries. Maybe something like "in many cases" http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_min_spillable_buffer_size.xml File docs/topics/impala_min_spillable_buffer_size.xml: http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_min_spillable_buffer_size.xml@49 PS5, Line 49: Same comments as default_spillable_buffer_size. http://gerrit.cloudera.org:8080/#/c/8003/4/docs/topics/impala_scalability.xml File docs/topics/impala_scalability.xml: http://gerrit.cloudera.org:8080/#/c/8003/4/docs/topics/impala_scalability.xml@701 PS4, Line 701: If your queries experience substantial performance overhead due to spilling, enable the > OK, I'll skip over the output piece and see if users ask for more details t I'm ok with this - it seems like we need more of a systematic effort to pin down the semantics of the profile and fdocument it. http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_scalability.xml File docs/topics/impala_scalability.xml: http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_scalability.xml@570 PS5, Line 570: <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, use memory. I think this background is helpful for users to understand the implications of spill to disk. I had a go at conveying similar information with the new infrastructure: <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, use memory. On each host that participates in the query, each such operator in a query requires memory to store rows of data and other data structures. Impala reserves a certain amount of memory up-front for each operator that supports spill-to-disk that is sufficient to execute the operator. If an operator accumulates more data than can fit in the reserved memory, it can either reserve more memory to continue processing data in memory or start spilling data to temporary scratch files on disk. Thus operators with spill-to-disk support can adapt to different memory constraints by using however much memory is available to speed up execution yet tolerate low memory conditions by spilling data to disk. The amount data depends on the portion of the data being handled by that host, and thus the operator may end up consuming different amounts of memory on different hosts. </p> http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_scalability.xml@572 PS5, Line 572: while building the data structure to process the aggregation or join operation. The amount Could we keep this around but make clear that it describes the behaviour in 2.9 and earlier? It seems useful for people looking at the latest docs or for contrasting behaviour. http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_scalability.xml@627 PS5, Line 627: <codeph>WriteIoBytes</codeph> counter reports how much data was written to disk for each operator Should we mention the old names parenthetically (it was ScratchBytesWritten in 2.9 and BytesWritten earlier) so that the docs are still usable for older Impala versions. -- To view, visit http://gerrit.cloudera.org:8080/8003 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I49323f8ffbff3e195058e88762eedbb1fcb1bc0e Gerrit-Change-Number: 8003 Gerrit-PatchSet: 5 Gerrit-Owner: John Russell <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: John Russell <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Thu, 05 Oct 2017 21:22:45 +0000 Gerrit-HasComments: Yes
