[Impala-ASF-CR] IMPALA-3200: [DOCS] Document user-facing aspects of new buffer pool

Tim Armstrong (Code Review) Thu, 05 Oct 2017 14:24:05 -0700

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8003 )


Change subject: IMPALA-3200: [DOCS] Document user-facing aspects of new buffer 
pool
......................................................................


Patch Set 5:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_buffer_pool_limit.xml
File docs/topics/impala_buffer_pool_limit.xml:

http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_buffer_pool_limit.xml@53
PS5, Line 53: (This default is represented by an empty value,
            :       sometimes incorrectly reported as 0 in the output of the 
<codeph>SET</codeph>
            :       command.)
This area of behaviour in the code is a bit of a mess and subject to change. 
The output will be fixed in 5.14 and you can't "set buffer_pool_limit=" until 
5.14, so I think this creates potential for further confusion.

 Maybe best to leave it undocumented or say something like "(This default takes 
affect if the options is unset)".


http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_default_spillable_buffer_size.xml
File docs/topics/impala_default_spillable_buffer_size.xml:

http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_default_spillable_buffer_size.xml@72
PS5, Line 72:       spill-to-disk operations. If spill-to-disk operations 
result in memory pressure
The last sentence I think creates a lot of potential for confusion. The 
conditions where this can make a difference do not necessarily involve spilling 
to disk and don't necessary involve a memory constraint on the node itself - it 
could prevent the query hitting a mem_limit even if there is plenty of free 
memory.

It seems sufficient to me to say that reducing the value may reduce memory 
consumption - spelling out the exactly circumstances where it will helpful 
would require a lot of words and be subject to change.


http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_default_spillable_buffer_size.xml@73
PS5, Line 73: DataNodes
Impala daemons?


http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_max_row_size.xml
File docs/topics/impala_max_row_size.xml:

http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_max_row_size.xml@40
PS5, Line 40: Applies when unpacking column values from
            :       a row read from disk, or when constructing intermediate or 
final rows in the
            :       result set. Setting an upper limit prevents out-of-control 
memory use when
            :       accessing columns containing huge strings.
This summary is wrong (the behaviour is subtle). The query option has zero 
effect on scans and the maximum row size isn't necessarily enforced. The point 
is to force Impala to reserve enough memory to process rows of this size. If 
the rows are larger

It's a lower bound on the upper bound, rather than an upper bound.

If it's set to a high value then Impala may reserve more memory than is 
necessary to execute the query.


http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_max_row_size.xml@69
PS5, Line 69:  so it is possible
This is correct but I'm concerned some users might take it the wrong way. To me 
this sounds like it's "possible but unlikely", whereas I think in many case 
that queries will succeed. My concern is that some users might see this and 
start worrying about their queries. Maybe something like "in many cases"


http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_min_spillable_buffer_size.xml
File docs/topics/impala_min_spillable_buffer_size.xml:

http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_min_spillable_buffer_size.xml@49
PS5, Line 49:
Same comments as default_spillable_buffer_size.


http://gerrit.cloudera.org:8080/#/c/8003/4/docs/topics/impala_scalability.xml
File docs/topics/impala_scalability.xml:

http://gerrit.cloudera.org:8080/#/c/8003/4/docs/topics/impala_scalability.xml@701
PS4, Line 701:               If your queries experience substantial performance 
overhead due to spilling, enable the
> OK, I'll skip over the output piece and see if users ask for more details t
I'm ok with this - it seems like we need more of a systematic effort to pin 
down the semantics of the profile and fdocument it.


http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_scalability.xml
File docs/topics/impala_scalability.xml:

http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_scalability.xml@570
PS5, Line 570:         <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, 
and joins, use memory.
I think this background is helpful for users to understand the implications of 
spill to disk. I had a go at conveying similar information with the new 
infrastructure:

        <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, use 
memory.
        On each host that participates in the query, each such operator in a 
query requires memory
        to store rows of data and other data structures. Impala reserves a 
certain amount of memory
        up-front for each operator that supports spill-to-disk that is 
sufficient to execute the
        operator. If an operator accumulates more data than can fit in the 
reserved memory, it
        can either reserve more memory to continue processing data in memory or 
start spilling
        data to temporary scratch files on disk. Thus operators with 
spill-to-disk support
        can adapt to different memory constraints by using however much memory 
is available
        to speed up execution yet tolerate low memory conditions by spilling 
data to disk.

        The amount data depends on the portion of the data being handled by 
that host, and thus
        the operator may end up consuming different amounts of memory on 
different hosts.
      </p>


http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_scalability.xml@572
PS5, Line 572:         while building the data structure to process the 
aggregation or join operation. The amount
Could we keep this around but make clear that it describes the behaviour in 2.9 
and earlier? It seems useful for people looking at the latest docs or for 
contrasting behaviour.


http://gerrit.cloudera.org:8080/#/c/8003/5/docs/topics/impala_scalability.xml@627
PS5, Line 627:               <codeph>WriteIoBytes</codeph> counter reports how 
much data was written to disk for each operator
Should we mention the old names parenthetically (it was ScratchBytesWritten in 
2.9 and BytesWritten earlier) so that the docs are still usable for older 
Impala versions.



--
To view, visit http://gerrit.cloudera.org:8080/8003
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I49323f8ffbff3e195058e88762eedbb1fcb1bc0e
Gerrit-Change-Number: 8003
Gerrit-PatchSet: 5
Gerrit-Owner: John Russell <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: John Russell <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Thu, 05 Oct 2017 21:22:45 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3200: [DOCS] Document user-facing aspects of new buffer pool

Reply via email to