Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17638 )
Change subject: WiP: IMPALA-9495: Support struct in select list for ORC tables ...................................................................... Patch Set 7: (13 comments) I have a bunch of comments about BE. Haven't digested FE parts yet. http://gerrit.cloudera.org:8080/#/c/17638/7//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17638/7//COMMIT_MSG@10 PS7, Line 10: JSON Can you add a separate section for "returning structs to cliens"? I think that it is important to mention that struct-> JSON string conversion happens on server side. The HS2 protocol supports returning structs (https://github.com/apache/impala/blob/master/common/thrift/hive-1-api/TCLIService.thrift#L220 ), so this conversion could be also done on client side. Hive also does it on server side AFAIK. In the future it may make sense to add the feature of returning it as Thrift, based on a query option, as some clients may use the structs directly instead of printing them as string. http://gerrit.cloudera.org:8080/#/c/17638/7//COMMIT_MSG@14 PS7, Line 14: '{"int_struct_member":12,"string_struct_member":"string value"}' Can you add an example that has a nested struct? http://gerrit.cloudera.org:8080/#/c/17638/7//COMMIT_MSG@20 PS7, Line 20: for each of the struct's children. The struct SlotDescriptor points to the nit: wrap at 72 http://gerrit.cloudera.org:8080/#/c/17638/7//COMMIT_MSG@34 PS7, Line 34: -- Internal representation of a struct: I would move this earlier, as this seems to be the most critical decision. It confused me a bit that tuples were mentioned earlier, as it suggested to me that members of a struct are held in a separate tuple for the struct, not the topmost tuple - I didn't realize that struct tuples are "just virtual tuples" and are part of the topmost tuple's layout. Flattening the whole struct and holding all members in the topmost slot is great for speed and simplicity, but it also means that members need memory even if their parent struct is NULL. http://gerrit.cloudera.org:8080/#/c/17638/7//COMMIT_MSG@35 PS7, Line 35: When scanning a struct the rowbatch will hold the values of the struct's : children as if they were queried one by one directly in the select list. Do structs affect the order of slots? Generally we order the slots by size and not by the order in the select list. Will the slots of the members of a given struct always follow each other in the same order as the struct's definition? http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/exprs/scalar-expr-evaluator.cc File be/src/exprs/scalar-expr-evaluator.cc: http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/exprs/scalar-expr-evaluator.cc@369 PS7, Line 369: StructVal v = expr.GetStructVal(this, row); : if (v.is_null) return nullptr; : result_.struct_val = v; optimization: this can potentially lead to several copies and heap calls The heap calls could be avoided by using FunctionContextImpl ::AllocateForResults() in StructVal, similarly to StringVal. This would mean that the memory will be deallocated only when the whole result pool is cleared, so we could share the pointer until that time. The assignment operator could simply overwrite the pointer without caring about deallocation. See udf.h for my other comments about StructVal http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/exprs/scalar-expr.h File be/src/exprs/scalar-expr.h: http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/exprs/scalar-expr.h@361 PS7, Line 361: mutable ScalarExprEvaluator* eval_ = nullptr; I am unsure about thread safety -AFAIK we have a single expr tree in a fragment that can be used by multiple fragment instances, while all fragment instances have their own ScalarExprEvaluator. The comments at line 316/327 have some info about thread handling. http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/runtime/raw-value.h File be/src/runtime/raw-value.h: http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/runtime/raw-value.h@129 PS7, Line 129: /// Writes the bytes of a given value into the slot of a tuple. : /// For string values, the string data is copied into memory allocated from 'pool' : /// only if pool is non-NULL. : : /// Wrapper function for Write() to handle struct slots and its children. Additionally, : /// gathers the string slots of the slot tree into 'string_values'. The two comments seem a bit redundant. http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/udf/udf.h File be/src/udf/udf.h: http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/udf/udf.h@740 PS7, Line 740: // A struct is represented by a vector of pointers where these pointers point to the I have some dilemmas about how to design StructVal. As it is in udf.h, it is expected to be used in UDFs and changing it later is considered a breaking change. For this reason I would prefer to move to udf-internal.h (CollectionVal also lives there). This would allow us to change it later and add it to udf.h only when it becomes mature enough to be used by external developers. For example to actually write a useful UDF, e.g. to_xml(), the names of the fields would be also needed, and I think that it is not possible to get that metadata through the UDF interface at the moment (FunctionContext has TypeDesc* GetArgType(int arg_idx), but TypeDesc does not contain struct related infos). Another issue is that while the ownership of the pointer arrays is clear, the ownership of the values they point to is not. As I understand in the current use cases they are owned by ScalarExprEvaluators which overwrites when it processes a new row. If a UDF would create and return a StructVal, then some mechanism would be needed to delete these values. Designing an "ideal" StructVal seems hard due to conflicting goals: - making it fast (e.g. by simply using a pointer to the existing tuples) - not exposing the actual layout to UDF developers to allow us to change it later - making it easy to use inside Impala / for UDF developers I don't want to block this change with this decision, so I would prefer to make it internal. http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/udf/udf.h@745 PS7, Line 745: children Can you add more info? What are the children exactly? Other AnyVals, or members of tuples? Does nullptr mean NULL? http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/udf/udf.h@770 PS7, Line 770: free StringVal uses the alloc methods of FunctionContext - is there a reason for using malloc/free directly? StringVal has no destructor, and AnyVal also doesn't have a virtual destructor, so I am not sure that this will be called in all cases. See my comment in scalar-expr-evaluator.cc about a possible way to change memory management. http://gerrit.cloudera.org:8080/#/c/17638/7/fe/src/main/java/org/apache/impala/analysis/SlotRef.java File fe/src/main/java/org/apache/impala/analysis/SlotRef.java: http://gerrit.cloudera.org:8080/#/c/17638/7/fe/src/main/java/org/apache/impala/analysis/SlotRef.java@156 PS7, Line 156: rootTable instanceof FeFsTable Can you move this to a precondition? We shouldn't see structs if it is not FeFsTable http://gerrit.cloudera.org:8080/#/c/17638/7/testdata/ComplexTypesTbl/structs.orc File testdata/ComplexTypesTbl/structs.orc: http://gerrit.cloudera.org:8080/#/c/17638/7/testdata/ComplexTypesTbl/structs.orc@1 PS7, Line 1: ORC Do we need to have these data files, can't we create them with Hive? I think that in the long term table generation will be better, as it would allow us to have large tables without adding huge files to the Impala repository. The current tables are too small IMO to test things like parallelism and spilling. If we want to keep them as files, I would prefer to move them to testdata/data (/ComplexTypesTbl ? ) and to be mention them in the README. -- To view, visit http://gerrit.cloudera.org:8080/17638 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0fbe56bdcd372b72e99c0195d87a818e7fa4bc3a Gerrit-Change-Number: 17638 Gerrit-PatchSet: 7 Gerrit-Owner: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Comment-Date: Fri, 06 Aug 2021 14:40:19 +0000 Gerrit-HasComments: Yes
