Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3931: arbitrary fixed-size uda intermediate types
......................................................................


Patch Set 12:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/7526/12/be/src/codegen/codegen-anyval.cc
File be/src/codegen/codegen-anyval.cc:

PS12, Line 555: the StringVal is already set up
> I'm not sure I follow this comment given that this function is setting the 
Done


http://gerrit.cloudera.org:8080/#/c/7526/12/be/src/codegen/codegen-anyval.h
File be/src/codegen/codegen-anyval.h:

PS12, Line 52: TYPE_FIXED_UDA_INTERMEDIATE
> why isn't this a byte array if that's how it is represented? I guess we wan
It could be a *Val which has a ptr member only (various places assume that the 
intermediate value is a subclass of AnyVal). StringVal is convenient and 
consistent with how CHAR(n) is passed around.

I thought about changing that but it wasn't clear to me that it significantly 
improved the code so I thought I'd start with the smaller delta. It does 
require changing the interface of a lot of the aggregate functions and updating 
the corresponding symbols everywhere to match.


PS12, Line 193: matching
              :   /// native type
> isn't the native type for intermediate just char array?  By "native" do we 
Yes it is, that would be another example of a native type along with 
TimestampValue or StringValue.

I agree that the "Native" terminology seems to be specific to this class. 
Elsewhere it's "raw value" or "slot". Happy to rename to whatever seems clearer.


PS12, Line 195: raw_val_ptr
> is this a pointer to the slot itself? if so, why not just specify it like t
I thought about this more and I don't think it even makes sense to call 
StoreToNativePtr() with FIXED_UDA_INTERMEDIATE, since the caller has to 
understand the memory management and should know that the buffer was already 
written to.

I think this is a reason to separate FIXED_UDA_INTERMEDIATE from CHAR - the 
memory management scheme seems somewhat specific to this use case and it's not 
clear that the requirements will always be the same as CHAR.


http://gerrit.cloudera.org:8080/#/c/7526/12/be/src/codegen/llvm-codegen.h
File be/src/codegen/llvm-codegen.h:

PS12, Line 240: the
> delete
Done


http://gerrit.cloudera.org:8080/#/c/7526/12/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

Line 888
> it might be a bit confusing that now StringVal's can have different managem
Yeah I think it's potentially confusing. At a minimum we'd probably want to 
revisit before exposing it to UDAs.

If I was to do it I'd probably do it as a follow-on patch since this already 
touches many lines of code.


http://gerrit.cloudera.org:8080/#/c/7526/12/be/src/udf/udf.h
File be/src/udf/udf.h:

PS12, Line 81: TYPE_FIXED_BUFFER
> should we rename that to TYPE_CHAR, or would that break compat?
It seems better to keep them separate. They're not the same data type 
conceptually. The semantics might be the same for the intermediate value case, 
but I find it easier to reason about if we separate the types then share the 
implementation in cases where the requirements are the same.


PS12, Line 346: For UDAs that need a complex data structure as the intermediate 
state, the
              : /// intermediate type should be string and the UDA can cast the 
ptr to the structure
              : /// it is using.
              : ///
              : /// Memory Management: For allocations that are not returned to 
Impala, the UDA should use
              : /// the FunctionContext::Allocate()/Free() methods. In general, 
Allocate() is called in
              : /// Init(), and then Free() must be called in both Serialize() 
and Finalize(), since
              : /// either of these functions may be called to clean up the 
state. For StringVal
              : /// allocations returned to Impala (e.g. returned by 
UdaSerialize()), the UDA should
              : /// allocate the result via StringVal(FunctionContext*, int) 
ctor or the function
              : /// StringVal::CopyFrom(FunctionContext*, const uint8_t*, 
size_t) and Impala will
              : /// automatically handle freeing it.
> does any of that need updating to account for the fact that these new Strin
It seems confusing to document something that actual UDAs can't use. The 
cleanest solution I can think of is to add a new intermediate type in 
udf-internal.h and document the semantics there.


http://gerrit.cloudera.org:8080/#/c/7526/12/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java:

Line 57:   private static final int RANK_INTERMEDIATE_SIZE = 16;
> are these constants verified for consistency in any way?
We have DCHECKs in the backend for dst->len, so we'd find out as soon as we ran 
the function.


-- 
To view, visit http://gerrit.cloudera.org:8080/7526
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ife90cf27989f98ffb5ef5c39f1e09ce92e8cb87c
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Matthew Jacobs <[email protected]>
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to