Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17638 )

Change subject: WiP: IMPALA-9495: Support struct in select list for ORC tables
......................................................................


Patch Set 7:

(13 comments)

I have a bunch of comments about BE. Haven't digested FE parts yet.

http://gerrit.cloudera.org:8080/#/c/17638/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17638/7//COMMIT_MSG@10
PS7, Line 10: JSON
Can you add a separate section for "returning structs to cliens"? I think that 
it is important to mention that struct-> JSON string conversion happens on 
server side. The HS2 protocol supports returning structs 
(https://github.com/apache/impala/blob/master/common/thrift/hive-1-api/TCLIService.thrift#L220
 ), so this conversion could be also done on client side. Hive also does it on 
server side AFAIK. In the future it may make sense to add the feature of 
returning it as Thrift, based on a query option, as some clients may use the 
structs directly instead of printing them as string.


http://gerrit.cloudera.org:8080/#/c/17638/7//COMMIT_MSG@14
PS7, Line 14: '{"int_struct_member":12,"string_struct_member":"string value"}'
Can you add an example that has a nested struct?


http://gerrit.cloudera.org:8080/#/c/17638/7//COMMIT_MSG@20
PS7, Line 20: for each of the struct's children. The struct SlotDescriptor 
points to the
nit: wrap at 72


http://gerrit.cloudera.org:8080/#/c/17638/7//COMMIT_MSG@34
PS7, Line 34: -- Internal representation of a struct:
I would move this earlier, as this seems to be the most critical decision. It 
confused me a bit that tuples were mentioned earlier, as it suggested to me 
that members of a struct are held in a separate tuple for the struct, not the 
topmost tuple - I didn't realize that struct tuples are "just virtual tuples" 
and are part of the topmost tuple's layout.

Flattening the whole struct and holding all members in the topmost slot is 
great for speed and simplicity, but it also means that members need memory even 
if their parent struct is NULL.


http://gerrit.cloudera.org:8080/#/c/17638/7//COMMIT_MSG@35
PS7, Line 35: When scanning a struct the rowbatch will hold the values of the 
struct's
            : children as if they were queried one by one directly in the 
select list.
Do structs affect the order of slots? Generally we order the slots by size and 
not by the order in the select list. Will the slots of the members of a given 
struct always follow each other in the same order as the struct's definition?


http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/exprs/scalar-expr-evaluator.cc
File be/src/exprs/scalar-expr-evaluator.cc:

http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/exprs/scalar-expr-evaluator.cc@369
PS7, Line 369:       StructVal v = expr.GetStructVal(this, row);
             :       if (v.is_null) return nullptr;
             :       result_.struct_val = v;
optimization: this can potentially lead to several copies and heap calls

The heap calls could be avoided by using FunctionContextImpl 
::AllocateForResults() in StructVal, similarly to StringVal. This would mean 
that the memory will be deallocated only when the whole result pool is cleared, 
so we could share the pointer until that time. The assignment operator could 
simply overwrite the pointer without caring about deallocation.

See udf.h for my other comments about StructVal


http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/exprs/scalar-expr.h
File be/src/exprs/scalar-expr.h:

http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/exprs/scalar-expr.h@361
PS7, Line 361:   mutable ScalarExprEvaluator* eval_ = nullptr;
I am unsure about thread safety -AFAIK we have a single expr tree in a fragment 
that can be used by multiple fragment instances, while all fragment instances 
have their own ScalarExprEvaluator. The comments at line 316/327 have some info 
about thread handling.


http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/runtime/raw-value.h
File be/src/runtime/raw-value.h:

http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/runtime/raw-value.h@129
PS7, Line 129:   /// Writes the bytes of a given value into the slot of a tuple.
             :   /// For string values, the string data is copied into memory 
allocated from 'pool'
             :   /// only if pool is non-NULL.
             :
             :   /// Wrapper function for Write() to handle struct slots and 
its children. Additionally,
             :   /// gathers the string slots of the slot tree into 
'string_values'.
The two comments seem a bit redundant.


http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/udf/udf.h
File be/src/udf/udf.h:

http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/udf/udf.h@740
PS7, Line 740: // A struct is represented by a vector of pointers where these 
pointers point to the
I have some dilemmas about how to design StructVal.

As it is in udf.h, it is expected to be used in UDFs and changing it later is 
considered a breaking change. For this reason I would prefer to move to 
udf-internal.h (CollectionVal  also lives there). This would allow us to change 
it later and add it to udf.h only when it becomes mature enough to be used by 
external developers.

For example to actually write a useful UDF, e.g. to_xml(), the names of the 
fields would be also needed, and I think that it is not possible to get that 
metadata through the UDF interface at the moment (FunctionContext has TypeDesc* 
GetArgType(int arg_idx), but TypeDesc does not contain struct related infos).

Another issue is that while the ownership of the pointer arrays is clear, the 
ownership of the values they point to is not. As I understand in the current 
use cases they are owned by ScalarExprEvaluators which overwrites when it 
processes a new row. If a UDF would create and return a StructVal, then some 
mechanism would be needed to delete these values.

Designing an "ideal" StructVal seems hard due to conflicting goals:
- making it fast (e.g. by simply using a pointer to the existing tuples)
- not exposing the actual layout to UDF developers to allow us to change it 
later
- making it easy to use inside Impala / for UDF developers

I don't want to block this change with this decision, so I would prefer to make 
it internal.


http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/udf/udf.h@745
PS7, Line 745: children
Can you add more info? What are the children exactly? Other AnyVals, or members 
of tuples? Does nullptr mean NULL?


http://gerrit.cloudera.org:8080/#/c/17638/7/be/src/udf/udf.h@770
PS7, Line 770: free
StringVal uses the alloc methods of FunctionContext - is there a reason for 
using malloc/free directly?

StringVal has no destructor, and AnyVal also doesn't have a virtual destructor, 
so I am not sure that this will be called in all cases.

See my comment in scalar-expr-evaluator.cc about a possible way to change 
memory management.


http://gerrit.cloudera.org:8080/#/c/17638/7/fe/src/main/java/org/apache/impala/analysis/SlotRef.java
File fe/src/main/java/org/apache/impala/analysis/SlotRef.java:

http://gerrit.cloudera.org:8080/#/c/17638/7/fe/src/main/java/org/apache/impala/analysis/SlotRef.java@156
PS7, Line 156: rootTable instanceof FeFsTable
Can you move this to a precondition? We shouldn't see structs if it is not 
FeFsTable


http://gerrit.cloudera.org:8080/#/c/17638/7/testdata/ComplexTypesTbl/structs.orc
File testdata/ComplexTypesTbl/structs.orc:

http://gerrit.cloudera.org:8080/#/c/17638/7/testdata/ComplexTypesTbl/structs.orc@1
PS7, Line 1: ORC
Do we need to have these data files, can't we create them with Hive? I think 
that in the long term table generation will be better, as it would allow us to 
have large tables without adding huge files to the Impala repository. The 
current tables are too small IMO to test things like parallelism and spilling.

If we want to keep them as files, I would prefer to move them to testdata/data 
(/ComplexTypesTbl ? ) and to be mention them in the README.



--
To view, visit http://gerrit.cloudera.org:8080/17638
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fbe56bdcd372b72e99c0195d87a818e7fa4bc3a
Gerrit-Change-Number: 17638
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Comment-Date: Fri, 06 Aug 2021 14:40:19 +0000
Gerrit-HasComments: Yes

Reply via email to