Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 )
Change subject: IMPALA-9482 Support for BINARY columns ...................................................................... Patch Set 6: (23 comments) Thanks for the comments and sorry for the huge time before resolving them! Patch 3 is just a rebase with some conflict resolution. http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@18 PS2, Line 18: - No NDV is calculated during COMPUTE STATISTICS. > Are we planning to change this? I think implicit casts from BINARY to STRIN I have changed the casting rules to always need explicit casting between STRING and BINARY. Added more info in the commit message (line 42). http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@47 PS2, Line 47: INSERT > Do you plan to do interop tests between Hive and Impala? We generally don't seem to have such tests, so I didn't add it here. I think it would be great to add some simple tests that test all types and fileformats with writer as Hive and reader as Impala + vica versa, but I would do this in jira. Note that for some file formats the Hive writer -> Impala reader path is tested as Hive writes the tables during dataload. http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@52 PS2, Line 52: to test scanning. > I think we want a lot more coverage in analysis tests, to make sure that we I have added much more FE tests (similar ones as Attila added for the Date type) + a few to specifically check casts. http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-scan-node.cc File be/src/exec/hbase-scan-node.cc: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-scan-node.cc@49 PS2, Line 49: null > nit: nullptr? I see there's a lot of NULLs in the file, so maybe just updat Done http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc File be/src/exec/hbase-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc@157 PS2, Line 157: if (col_desc.auxType().IsBinaryStringSubtype()) { > line too long (92 > 90) Done http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc@159 PS2, Line 159: Base64Encode > Why do we encode it for HBase? This how Hive does it, BINARY columns are base64 encoded as in text files. http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-rcfile-scanner.cc@593 PS2, Line 593: const AuxColumnType& aux_type = : scan_node_->hdfs_table()->GetColumnDesc(slot_desc). > Yeah I wondered about this too. I think we should consider this in the cont Yes, I was thinking similarly as what Tim wrote - it would be much nicer if string encoding could be handled mainly in the frontend and the backend would just need to think about byte arrays. Probably VARCHAR could be also handled this way, as AFAIK most code do not enforce the length limit, only scanners/writers/client handlers need to deal with it. Another type that could probably profit from AuxType is timestamp, if we'll implement TIMESTAMP WITH (LOCAL) TIMEZONE. It should be probably the same data type underneath, but the frontend could pick different functions where the conversion matters + scanners would also need the info as different files may be written in different timezones. http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc File be/src/exec/hdfs-text-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc@103 PS2, Line 103: const ColumnDescriptor& col_desc = > line too long (96 > 90) Done http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc@105 PS2, Line 105: const StringValue* string_value = reinterpret_cast<const StringValue*>(value); > line too long (93 > 90) Done http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc@107 PS2, Line 107: // TODO: try to find a more efficient imlementation > line too long (91 > 90) Done http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.h File be/src/runtime/types.h: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.h@67 PS2, Line 67: Aux > nit: 'Aux' is not too descriptive. Probably EncodedColumnType/StoredColumnT I didn't change it in the last patch to avoid the additional noise, but I am not attached to the current name. http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc File be/src/runtime/types.cc: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@122 PS2, Line 122: ToThrift(PrimitiveType ptype) > Should this function work as the inverse of ThriftToType() ? It makes sense to create a version which also uses AuxType, but this was not needed so far. Where the backend needs to send the full type info somewhere as thrift, it seems to have the original thrift structures so no reverse conversion is needed. http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@180 PS2, Line 180: DCHECK_EQ(1, type.types.size()); > nit: I'd move this comment right before the switch to L186. Done http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@185 PS2, Line 185: AuxColumnType aux_type(type); > nit: here and elsewhere, I'm seeing a bunch of camelCase variable names in done here and hopefully everywhere where it's needed http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@222 PS2, Line 222: ColumnType::ToThrift(TColumnType* thrift_type) > Same as above, maybe it needs now an additional AuxColumnType parameter? Same as above. http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/service/hs2-util.cc File be/src/service/hs2-util.cc: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/service/hs2-util.cc@289 PS2, Line 289: HS2TColumn(ScalarExprEvaluator* expr_eval, RowBatch* batch, : int start_id > nit: refactor to a common code that takes the Thrift field as parameter? Done http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/service/hs2-util.cc@729 PS2, Line 729: se TYPE_BINARY: : default: > Shouldn't you change this comment to reflect that now BINARY can also trigg Done http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java File fe/src/main/java/org/apache/impala/analysis/CastExpr.java: http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@126 PS2, Line 126: // Disable casting from string to boolean : if (fromType.isStringType() && toType.isBoolean()) continue; : // Casting from date is only allowed when to-type is timestamp or string. : if (fromType.isDate() && !toType.isTimestamp() && !toType.isStringType()) { : > BINARY<->STRING conversions are no-op conversins, so maybe this block shoul Done http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/catalog/Type.java File fe/src/main/java/org/apache/impala/catalog/Type.java: http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/catalog/Type.java@820 PS2, Line 820: // STRING <-> BINARY conversion is not lossy, but implicit cast is not allowed > I'm probably misunderstanding something but the commit msg suggests that BI I added more info about the casts to the commit message. The last patch set compatibility of BINARY <-> STRING to INVALID_TYPE to avoid casting implicitly from BINARY to STRING in function calls. http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java File fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java: http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java@154 PS2, Line 154: case BINARY: return Schema.create(Schema.Type.BYTES); > I'm not sure about this, maybe binary should be converted to avro bytes. Th Thanks, I changed it to bytes. I tried it that way first, but there was some error during decoding, which doesn't occur now, so I must have made a mistake somewhere. http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java File fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java: http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java@263 PS2, Line 263: > This comment is no longer valid, right? Done http://gerrit.cloudera.org:8080/#/c/16066/2/testdata/bin/generate-schema-statements.py File testdata/bin/generate-schema-statements.py: http://gerrit.cloudera.org:8080/#/c/16066/2/testdata/bin/generate-schema-statements.py@213 PS2, Line 213: PERTIE > Again, avro has bytes type which might be a better fit for binary. Done http://gerrit.cloudera.org:8080/#/c/16066/2/tests/common/impala_connection.py File tests/common/impala_connection.py: http://gerrit.cloudera.org:8080/#/c/16066/2/tests/common/impala_connection.py@426 PS2, Line 426: > nit: typo Done -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 6 Gerrit-Owner: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Attila Jeges <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 13 Aug 2020 18:59:58 +0000 Gerrit-HasComments: Yes
