Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-4787: Optimize APPX_MEDIAN() memory usage
......................................................................


Patch Set 5:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/6025/2/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

PS2, Line 1018: Returns a pointer to one p
> Make this one start with a capital letter. But other ones like begin() shou
I see that you're matching some casing with c++ style iterators, but we do 
follow the C++ style guide [1], which states you'd use camel case for functions 
unless it's a simple accessor/mutator, which can have the lower-cased names.

1: https://google.github.io/styleguide/cppguide.html#Function_Names


http://gerrit.cloudera.org:8080/#/c/6025/3/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

PS3, Line 1025: 
2 now


PS3, Line 1076: 
2


http://gerrit.cloudera.org:8080/#/c/6025/5/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

Line 1121:     int r = rand() % state->num_samples;
> I know you didn't author this line, but could you switch to a better PRNG t
There is a comment in the code about upgrading to mersenne twister, I think we 
can if it's now baked into c++11. However, what is the state/struct? Right now 
we keep the 'ranlux64_3 rng' state on the ResSampState, and we'd need to do the 
same for mt19937_64. We'd need to check that is reasonably sized. It might be 
worth doing that in a separate patch anyway.


PS5, Line 1122: ((double) state->source_size - r) / state->source_size
> I'm not yet convinced this is correct. I'm not convinced it is correct in H
The point of the big comment above is that this is not exact, and we're 
approximating the proper sampling in order to avoid maintaining a heap. The 
point is that this is picking some number proportional to the size of the 
source input. I'm sure you can find mathematical flaws in this approximation, 
and It's not clear to me how bad the error is, but the original use of this 
function was to generate coarse histograms so we really wanted to optimize for 
that use case, especially knowing that planner heuristics are only heuristics 
to begin with. Over time we've come to want more from this code, so it's 
probably worth reevaluating whether or not this approach still makes sense. 
There's a reason we even call the median function "APPX_MEDIAN". I don't think 
we give any guarantees beyond 'it is approximate': 
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_appx_median.html


-- 
To view, visit http://gerrit.cloudera.org:8080/6025
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I99adaad574d4fb0a3cf38c6cbad8b2a23df12968
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Jim Apple <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Matthew Jacobs <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Taras Bobrovytsky <[email protected]>
Gerrit-HasComments: Yes

Reply via email to