[Impala-ASF-CR] IMPALA-3931: arbitrary fixed-size uda intermediate types

Tim Armstrong (Code Review) Tue, 15 Aug 2017 17:11:58 -0700

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3931: arbitrary fixed-size uda intermediate types
......................................................................

Patch Set 12:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/7526/12/be/src/codegen/codegen-anyval.cc
File be/src/codegen/codegen-anyval.cc:

PS12, Line 555: the StringVal is already set up
> I'm not sure I follow this comment given that this function is setting the
Done

http://gerrit.cloudera.org:8080/#/c/7526/12/be/src/codegen/codegen-anyval.h
File be/src/codegen/codegen-anyval.h:

PS12, Line 52: TYPE_FIXED_UDA_INTERMEDIATE
> why isn't this a byte array if that's how it is represented? I guess we wan
It could be a *Val which has a ptr member only (various places assume that the
intermediate value is a subclass of AnyVal). StringVal is convenient and
consistent with how CHAR(n) is passed around.

I thought about changing that but it wasn't clear to me that it significantly
improved the code so I thought I'd start with the smaller delta. It does
require changing the interface of a lot of the aggregate functions and updating
the corresponding symbols everywhere to match.

PS12, Line 193: matching
: /// native type
> isn't the native type for intermediate just char array? By "native" do we
Yes it is, that would be another example of a native type along with
TimestampValue or StringValue.

I agree that the "Native" terminology seems to be specific to this class.
Elsewhere it's "raw value" or "slot". Happy to rename to whatever seems clearer.

PS12, Line 195: raw_val_ptr
> is this a pointer to the slot itself? if so, why not just specify it like t
I thought about this more and I don't think it even makes sense to call
StoreToNativePtr() with FIXED_UDA_INTERMEDIATE, since the caller has to
understand the memory management and should know that the buffer was already
written to.

I think this is a reason to separate FIXED_UDA_INTERMEDIATE from CHAR - the
memory management scheme seems somewhat specific to this use case and it's not
clear that the requirements will always be the same as CHAR.

http://gerrit.cloudera.org:8080/#/c/7526/12/be/src/codegen/llvm-codegen.h
File be/src/codegen/llvm-codegen.h:

PS12, Line 240: the
> delete
Done

http://gerrit.cloudera.org:8080/#/c/7526/12/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

Line 888
> it might be a bit confusing that now StringVal's can have different managem
Yeah I think it's potentially confusing. At a minimum we'd probably want to
revisit before exposing it to UDAs.

If I was to do it I'd probably do it as a follow-on patch since this already
touches many lines of code.

http://gerrit.cloudera.org:8080/#/c/7526/12/be/src/udf/udf.h
File be/src/udf/udf.h:

PS12, Line 81: TYPE_FIXED_BUFFER
> should we rename that to TYPE_CHAR, or would that break compat?
It seems better to keep them separate. They're not the same data type
conceptually. The semantics might be the same for the intermediate value case,
but I find it easier to reason about if we separate the types then share the
implementation in cases where the requirements are the same.

PS12, Line 346: For UDAs that need a complex data structure as the intermediate
state, the
: /// intermediate type should be string and the UDA can cast the
ptr to the structure
: /// it is using.
: ///
: /// Memory Management: For allocations that are not returned to
Impala, the UDA should use
: /// the FunctionContext::Allocate()/Free() methods. In general,
Allocate() is called in
: /// Init(), and then Free() must be called in both Serialize()
and Finalize(), since
: /// either of these functions may be called to clean up the
state. For StringVal
: /// allocations returned to Impala (e.g. returned by
UdaSerialize()), the UDA should
: /// allocate the result via StringVal(FunctionContext*, int)
ctor or the function
: /// StringVal::CopyFrom(FunctionContext*, const uint8_t*,
size_t) and Impala will
: /// automatically handle freeing it.
> does any of that need updating to account for the fact that these new Strin
It seems confusing to document something that actual UDAs can't use. The
cleanest solution I can think of is to add a new intermediate type in
udf-internal.h and document the semantics there.

http://gerrit.cloudera.org:8080/#/c/7526/12/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java:

Line 57: private static final int RANK_INTERMEDIATE_SIZE = 16;
> are these constants verified for consistency in any way?
We have DCHECKs in the backend for dst->len, so we'd find out as soon as we ran
the function.

--
To view, visit http://gerrit.cloudera.org:8080/7526
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ife90cf27989f98ffb5ef5c39f1e09ce92e8cb87c
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Matthew Jacobs <[email protected]>
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3931: arbitrary fixed-size uda intermediate types

Reply via email to