Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )
Change subject: WIP IMPALA-9482 Support for BINARY columns ...................................................................... Patch Set 2: (4 comments) http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@18 PS2, Line 18: - UDF/UDAFs that expect STRING argument accept BINARY too, while in Are we planning to change this? I think implicit casts from BINARY to STRING are a can of worms that we should really avoid, cause it'll cause major headaches when we start treating STRING as UTF-8. With CHAR, a similar thing of automatically promoting to string was attempted to get the builtin functions "for free" but it didn't really work out because the semantics aren't actually the same. http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@52 PS2, Line 52: - Added a basic coverage in FE/EE tests, will continue adding new I think we want a lot more coverage in analysis tests, to make sure that we're doing all the implicit casts correctly (e.g. when invoking functions with BINARY and a different type). http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-rcfile-scanner.cc@593 PS2, Line 593: const ColumnDescriptor& col_desc = : scan_node_->hdfs_table()->GetColumnDesc(slot_desc); > Wouldn't it be better to have a BINARY type in the BE so we could save pass Yeah I wondered about this too. I think we should consider this in the context of UTF-8 and what we'd want to do there. I think the situation for UTF-8 is actually similar, in that most of the string operations are not changed, and most of the changes would be in the frontend, to call a UTF-8-aware version of a function instead of the current ASCII function. UTF-8 is also orthogonal to CHAR/VARCHAR/STRING, so having the encoding auxiliary to the type does have some advantages. It's similar in that the writers and readers need to know about it to add file metadata and validate the data they're reading/writing. So I think the infrastructure for auxiliary column info that this patch adds is actually useful for future work. http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc File be/src/runtime/types.cc: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@185 PS2, Line 185: PrimitiveType primitiveType = ThriftToType(col_type); nit: here and elsewhere, I'm seeing a bunch of camelCase variable names in C++ where it should be either CamelCase or separated_by_underscores. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 2 Gerrit-Owner: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Tue, 16 Jun 2020 17:24:22 +0000 Gerrit-HasComments: Yes
