[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Csaba Ringhofer (Code Review) Thu, 13 Aug 2020 12:00:59 -0700

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )


Change subject: IMPALA-9482 Support for BINARY columns
......................................................................


Patch Set 6:

(23 comments)

Thanks for the comments and sorry for the huge time before resolving them!

Patch 3 is just a rebase with some conflict resolution.

http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@18
PS2, Line 18: - No NDV is calculated during COMPUTE STATISTICS.
> Are we planning to change this? I think implicit casts from BINARY to STRIN
I have changed the casting rules to always need explicit casting between STRING 
and BINARY. Added more info in the commit message (line 42).


http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@47
PS2, Line 47: INSERT
> Do you plan to do interop tests between Hive and Impala?
We generally don't seem to have such tests, so I didn't add it here. I think it 
would be great to add some simple tests that test all types and fileformats 
with writer as Hive and reader as Impala + vica versa, but I would do this in 
jira.

Note that for some file formats the Hive writer -> Impala reader path is tested 
as Hive writes the tables during dataload.


http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@52
PS2, Line 52:   to test scanning.
> I think we want a lot more coverage in analysis tests, to make sure that we
I have added much more FE tests (similar ones as Attila added for the Date 
type) + a few to specifically check casts.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-scan-node.cc
File be/src/exec/hbase-scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-scan-node.cc@49
PS2, Line 49: null
> nit: nullptr? I see there's a lot of NULLs in the file, so maybe just updat
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc
File be/src/exec/hbase-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc@157
PS2, Line 157:             if (col_desc.auxType().IsBinaryStringSubtype()) {
> line too long (92 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc@159
PS2, Line 159: Base64Encode
> Why do we encode it for HBase?
This how Hive does it, BINARY columns are base64 encoded as in text files.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-rcfile-scanner.cc
File be/src/exec/hdfs-rcfile-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-rcfile-scanner.cc@593
PS2, Line 593:         const AuxColumnType& aux_type =
             :             scan_node_->hdfs_table()->GetColumnDesc(slot_desc).
> Yeah I wondered about this too. I think we should consider this in the cont
Yes, I was thinking similarly as what Tim wrote - it would be much nicer if 
string encoding could be handled mainly in the frontend and the backend would 
just need to think about byte arrays. Probably VARCHAR could be also handled 
this way, as AFAIK most code do not enforce the length limit, only 
scanners/writers/client handlers need to deal with it.

Another type that could probably profit from AuxType is timestamp, if we'll 
implement TIMESTAMP WITH (LOCAL) TIMEZONE. It should be probably the same data 
type underneath, but the frontend could pick different functions where the 
conversion matters + scanners would also need the info as different files may 
be written in different timezones.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc
File be/src/exec/hdfs-text-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc@103
PS2, Line 103:             const ColumnDescriptor& col_desc =
> line too long (96 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc@105
PS2, Line 105:             const StringValue* string_value = 
reinterpret_cast<const StringValue*>(value);
> line too long (93 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc@107
PS2, Line 107:               // TODO: try to find a more efficient imlementation
> line too long (91 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.h
File be/src/runtime/types.h:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.h@67
PS2, Line 67: Aux
> nit: 'Aux' is not too descriptive. Probably EncodedColumnType/StoredColumnT
I didn't change it in the last patch to avoid the additional noise, but I am 
not attached to the current name.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc
File be/src/runtime/types.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@122
PS2, Line 122: ToThrift(PrimitiveType ptype)
> Should this function work as the inverse of ThriftToType() ?
It makes sense to create a version which also uses AuxType, but this was not 
needed so far. Where the backend needs to send the full type info somewhere as 
thrift, it seems to have the original thrift structures so no reverse 
conversion is needed.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@180
PS2, Line 180:   DCHECK_EQ(1, type.types.size());
> nit: I'd move this comment right before the switch to L186.
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@185
PS2, Line 185:   AuxColumnType aux_type(type);
> nit: here and elsewhere, I'm seeing a bunch of camelCase variable names in
done here and hopefully everywhere where it's needed


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@222
PS2, Line 222: ColumnType::ToThrift(TColumnType* thrift_type)
> Same as above, maybe it needs now an additional AuxColumnType parameter?
Same as above.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/service/hs2-util.cc@289
PS2, Line 289: HS2TColumn(ScalarExprEvaluator* expr_eval, RowBatch* batch,
             :     int start_id
> nit: refactor to a common code that takes the Thrift field as parameter?
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/service/hs2-util.cc@729
PS2, Line 729: se TYPE_BINARY:
             :     default:
> Shouldn't you change this comment to reflect that now BINARY can also trigg
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
File fe/src/main/java/org/apache/impala/analysis/CastExpr.java:

http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@126
PS2, Line 126:        // Disable casting from string to boolean
             :         if (fromType.isStringType() && toType.isBoolean()) 
continue;
             :         // Casting from date is only allowed when to-type is 
timestamp or string.
             :         if (fromType.isDate() && !toType.isTimestamp() && 
!toType.isStringType()) {
             :
> BINARY<->STRING conversions are no-op conversins, so maybe this block shoul
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/catalog/Type.java
File fe/src/main/java/org/apache/impala/catalog/Type.java:

http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/catalog/Type.java@820
PS2, Line 820: // STRING <->  BINARY conversion is not lossy, but implicit cast 
is not allowed
> I'm probably misunderstanding something but the commit msg suggests that BI
I added more info about the casts to the commit message. The last patch set 
compatibility of BINARY <-> STRING to INVALID_TYPE to avoid casting implicitly 
from BINARY to STRING in function calls.


http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
File fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java:

http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java@154
PS2, Line 154: case BINARY: return Schema.create(Schema.Type.BYTES);
> I'm not sure about this, maybe binary should be converted to avro bytes. Th
Thanks, I changed it to bytes. I tried it that way first, but there was some 
error during decoding, which doesn't occur now, so I must have made a mistake 
somewhere.


http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java@263
PS2, Line 263:
> This comment is no longer valid, right?
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/16066/2/testdata/bin/generate-schema-statements.py@213
PS2, Line 213: PERTIE
> Again, avro has bytes type which might be a better fit for binary.
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/tests/common/impala_connection.py
File tests/common/impala_connection.py:

http://gerrit.cloudera.org:8080/#/c/16066/2/tests/common/impala_connection.py@426
PS2, Line 426:
> nit: typo
Done



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Attila Jeges <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 13 Aug 2020 18:59:58 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

Reply via email to