Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: WIP IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@18
PS2, Line 18:   - UDF/UDAFs that expect STRING argument accept BINARY too, 
while in
Are we planning to change this? I think implicit casts from BINARY to STRING 
are a can of worms that we should really avoid, cause it'll cause major 
headaches when we start treating STRING as UTF-8.

With CHAR, a similar thing of automatically promoting to string was attempted 
to get the builtin functions "for free" but it didn't really work out because 
the semantics aren't actually the same.


http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@52
PS2, Line 52: - Added a basic coverage in FE/EE tests, will continue adding new
I think we want a lot more coverage in analysis tests, to make sure that we're 
doing all the implicit casts correctly (e.g. when invoking functions with 
BINARY and a different type).


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-rcfile-scanner.cc
File be/src/exec/hdfs-rcfile-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-rcfile-scanner.cc@593
PS2, Line 593:         const ColumnDescriptor& col_desc =
             :             scan_node_->hdfs_table()->GetColumnDesc(slot_desc);
> Wouldn't it be better to have a BINARY type in the BE so we could save pass
Yeah I wondered about this too. I think we should consider this in the context 
of UTF-8 and what we'd want to do there.

I think the situation for UTF-8 is actually similar, in that most of the string 
operations are not changed, and most of the changes would be in the frontend, 
to call a UTF-8-aware version of a function instead of the current ASCII 
function.

UTF-8 is also orthogonal to CHAR/VARCHAR/STRING, so having the encoding 
auxiliary to the type does have some advantages.

It's similar in that the writers and readers need to know about it to add file 
metadata and validate the data they're reading/writing.

So I think the infrastructure for auxiliary column info that this patch adds is 
actually useful for future work.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc
File be/src/runtime/types.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@185
PS2, Line 185:   PrimitiveType primitiveType = ThriftToType(col_type);
nit: here and elsewhere, I'm seeing a bunch of camelCase variable names in C++ 
where it should be either CamelCase or separated_by_underscores.



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Tue, 16 Jun 2020 17:24:22 +0000
Gerrit-HasComments: Yes

Reply via email to