Tim Armstrong has posted comments on this change. Change subject: IMPALA-3931: arbitrary fixed-size uda intermediate types ......................................................................
Patch Set 12: (9 comments) http://gerrit.cloudera.org:8080/#/c/7526/12/be/src/codegen/codegen-anyval.cc File be/src/codegen/codegen-anyval.cc: PS12, Line 555: the StringVal is already set up > I'm not sure I follow this comment given that this function is setting the Done http://gerrit.cloudera.org:8080/#/c/7526/12/be/src/codegen/codegen-anyval.h File be/src/codegen/codegen-anyval.h: PS12, Line 52: TYPE_FIXED_UDA_INTERMEDIATE > why isn't this a byte array if that's how it is represented? I guess we wan It could be a *Val which has a ptr member only (various places assume that the intermediate value is a subclass of AnyVal). StringVal is convenient and consistent with how CHAR(n) is passed around. I thought about changing that but it wasn't clear to me that it significantly improved the code so I thought I'd start with the smaller delta. It does require changing the interface of a lot of the aggregate functions and updating the corresponding symbols everywhere to match. PS12, Line 193: matching : /// native type > isn't the native type for intermediate just char array? By "native" do we Yes it is, that would be another example of a native type along with TimestampValue or StringValue. I agree that the "Native" terminology seems to be specific to this class. Elsewhere it's "raw value" or "slot". Happy to rename to whatever seems clearer. PS12, Line 195: raw_val_ptr > is this a pointer to the slot itself? if so, why not just specify it like t I thought about this more and I don't think it even makes sense to call StoreToNativePtr() with FIXED_UDA_INTERMEDIATE, since the caller has to understand the memory management and should know that the buffer was already written to. I think this is a reason to separate FIXED_UDA_INTERMEDIATE from CHAR - the memory management scheme seems somewhat specific to this use case and it's not clear that the requirements will always be the same as CHAR. http://gerrit.cloudera.org:8080/#/c/7526/12/be/src/codegen/llvm-codegen.h File be/src/codegen/llvm-codegen.h: PS12, Line 240: the > delete Done http://gerrit.cloudera.org:8080/#/c/7526/12/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: Line 888 > it might be a bit confusing that now StringVal's can have different managem Yeah I think it's potentially confusing. At a minimum we'd probably want to revisit before exposing it to UDAs. If I was to do it I'd probably do it as a follow-on patch since this already touches many lines of code. http://gerrit.cloudera.org:8080/#/c/7526/12/be/src/udf/udf.h File be/src/udf/udf.h: PS12, Line 81: TYPE_FIXED_BUFFER > should we rename that to TYPE_CHAR, or would that break compat? It seems better to keep them separate. They're not the same data type conceptually. The semantics might be the same for the intermediate value case, but I find it easier to reason about if we separate the types then share the implementation in cases where the requirements are the same. PS12, Line 346: For UDAs that need a complex data structure as the intermediate state, the : /// intermediate type should be string and the UDA can cast the ptr to the structure : /// it is using. : /// : /// Memory Management: For allocations that are not returned to Impala, the UDA should use : /// the FunctionContext::Allocate()/Free() methods. In general, Allocate() is called in : /// Init(), and then Free() must be called in both Serialize() and Finalize(), since : /// either of these functions may be called to clean up the state. For StringVal : /// allocations returned to Impala (e.g. returned by UdaSerialize()), the UDA should : /// allocate the result via StringVal(FunctionContext*, int) ctor or the function : /// StringVal::CopyFrom(FunctionContext*, const uint8_t*, size_t) and Impala will : /// automatically handle freeing it. > does any of that need updating to account for the fact that these new Strin It seems confusing to document something that actual UDAs can't use. The cleanest solution I can think of is to add a new intermediate type in udf-internal.h and document the semantics there. http://gerrit.cloudera.org:8080/#/c/7526/12/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java: Line 57: private static final int RANK_INTERMEDIATE_SIZE = 16; > are these constants verified for consistency in any way? We have DCHECKs in the backend for dst->len, so we'd find out as soon as we ran the function. -- To view, visit http://gerrit.cloudera.org:8080/7526 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ife90cf27989f98ffb5ef5c39f1e09ce92e8cb87c Gerrit-PatchSet: 12 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Matthew Jacobs <[email protected]> Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
