Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12481 )
Change subject: IMPALA-7368: Add initial support for DATE type ...................................................................... Patch Set 11: (32 comments) nice work, pretty impressive test coverage. Mostly small items here. http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG@29 PS11, Line 29: - from STRING to DATE if the source string value is used in a : context where a DATE value is expected. it's interesting that this differs from the behavior of TIMESTAMP today, at least for timestamp arithmetic expressions: SELECT '2019-01-01 12;12:12' + interval 1 day does not do an implicit conversion. (interestingly it does in MySQL). Does the ANSI standard give rules for this? http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG@38 PS11, Line 38: E.g: year('2019-02-15') must resolve to : year(TIMESTAMP) instead of year(DATE). Note, that year(DATE) is : not implemented yet, so this is not an issue at the moment but : it will be in the future. wouldn't the result be the same either way for this example? http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG@42 PS11, Line 42: better fit how is this defined, specifically? http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG@70 PS11, Line 70: complete DATE type implementation should we consider turning it off by default and only enabling it once we're sure it's stable and correct? http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG@72 PS11, Line 72: - Add date support to the random query generator. should we transition TPC-DS test tables over to 'date' instead of 'string' for the appropriate columns? http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/codegen/codegen-anyval.cc File be/src/codegen/codegen-anyval.cc: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/codegen/codegen-anyval.cc@68 PS11, Line 68: return cg->i64_type(); why do we lower to i64 instead of i32 given the slot size is only 4 bytes? seems like we're using int32 below http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exec/hdfs-table-sink.cc@500 PS11, Line 500: stringstream error_msg; : error_msg << "Cannot write DATE column to a PARQUET table."; : return Status(error_msg.str()); nit: why not just pass that string directly? Could we make this better by including the column name from the table desc? http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/aggregate-functions-ir.cc@381 PS11, Line 381: val_struct->sum / val_struct->count Is this correct behavior when the dates are negative? Dividing as signed integers means we'll "truncate towards zero" which means that average of (1969-01-01, 1969-01-02) would be 1969-01-02 whereas average of 1971-01-01 and 1971-01-02 would be 1971-01-01. I think that's surprising because it exposes the epoch reference point in the semantics of the operation. Does the ANSI standard have anything to say here? Checking briefly on sqlfiddle: - postgres 9.6 doesn't support AVG(date) at all. Neither does Oracle 11g R2 nor SQL Server 2017. - mysql 5.6 implicitly casts the dates to an DECIMIAL and return something weird like 19690101.5. Not sure what's going on under the covers. - hive 3.1 doesn't support AVG(date) either Maybe we should remove this functionality for now and consider adding it back based on user demand? http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/conditional-functions-ir.cc File be/src/exprs/conditional-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/conditional-functions-ir.cc@37 PS11, Line 37: ITERATE_OVER_SEQ per comment elsewhere, I found it clearer in the old version since I didn't need to understand boost magic http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/expr-test.cc@3386 PS11, Line 3386: : TEST_F(ExprTest, CastDateExprs) { add some tests for invalid dates like feb 30 http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/expr-test.cc@3475 PS11, Line 3475: TestDateValue("cast(cast('1400-01-01 00:00:00' as timestamp) as date)", can you test a before-the-epoch timestamp that includes a time portion? eg '1960-01-01 23:59' and make sure it truncates "down" and not truncate "towards the epoch"? http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/runtime/date-value.h File be/src/runtime/date-value.h: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/runtime/date-value.h@40 PS11, Line 40: /// - Proleptic Gregorian calendar is used to calculate the number of days since epoch, : /// which can lead to different representation of historical dates compared to Hive. : /// (https://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar) I know this was from the original change a few months back, but should we be concerned about this? Should we have a compatibility mode even if there is a performance hit? Doing some research online it seems like hive might have had different behavior previously but then switched to proleptic gregorian as of 3.1? http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/runtime/timestamp-value.inline.h File be/src/runtime/timestamp-value.inline.h: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/runtime/timestamp-value.inline.h@36 PS11, Line 36: return TimestampValue(EPOCH + boost::gregorian::date_duration(days), t); is there any chance of exceptions in this boost API? I seem to recall there were some unexpected exceptions thrown by boost::date_time stuff http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/service/fe-support.cc File be/src/service/fe-support.cc: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/service/fe-support.cc@638 PS11, Line 638: jclass caller_class, jstring date) { I see 'caller_class' was copied from above, but that's a somewhat inaccurate name. Really the class here is the FeSupport class itself, given this is a static method, right? http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/udf/udf.h File be/src/udf/udf.h: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/udf/udf.h@599 PS11, Line 599: DateVal(underlying_type_t val = 0) : val(val) { } should be marked explicit? http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/string-parser-test.cc File be/src/util/string-parser-test.cc: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/string-parser-test.cc@534 PS11, Line 534: // Test bad formats. worth a check where there is a valid leading date followed by junk ("2010-01-01 foo") as well as one where it's a full valid timestamp ("2010-01-01 12:22:22") to make sure we get the expected results http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/string-parser-test.cc@544 PS11, Line 544: // Test invalid month/day values. what about special days like February 29th? Worth testing that in a year when it doesn't exist and a year when it exists, including perhaps some "exception" leap years? ie 2000-02-29 does exist but 2100-02-29 doesn't http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/symbols-util.cc File be/src/util/symbols-util.cc: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/symbols-util.cc@129 PS11, Line 129: ITERATE_OVER_SEQ2(CASE_TYPE_APPEND_MANGLED_TOKEN, I dunno that using the ITERATE_OVER is really any clearer than just repeating the macro N times and not even using BOOST_PP_STRINGIZE: CASE_TYPE_APPEND_MANGLED_TOKEN(TYPE_BOOLEAN, "BooleanVal"); CASE_TYPE_APPEND_MANGLED_TOKEN(TYPE_TINYINT, "TinyIntVal"); ... it's only slightly more typing and I think less cognitive overhead for someone to go figure out what this weird boost macro magic is doing http://gerrit.cloudera.org:8080/#/c/12481/11/common/thrift/Exprs.thrift File common/thrift/Exprs.thrift: http://gerrit.cloudera.org:8080/#/c/12481/11/common/thrift/Exprs.thrift@58 PS11, Line 58: // String representation : 2: required string date_string why is it necessary to include the string representation here too? I don't see a similar field in other literal expr types that have to be "parsed" and may not round-trip to exactly the same string the user input (timestamp, float, etc) http://gerrit.cloudera.org:8080/#/c/12481/11/common/thrift/ImpalaInternalService.thrift File common/thrift/ImpalaInternalService.thrift: http://gerrit.cloudera.org:8080/#/c/12481/11/common/thrift/ImpalaInternalService.thrift@745 PS11, Line 745: sinxce typo http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java File fe/src/main/java/org/apache/impala/analysis/DateLiteral.java: http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java@61 PS11, Line 61: throw new AnalysisException("Invalid date literal: " + e.getMessage(), e); would this actually be due to an invalid date literal? or would this be truly a bug? maybe makes sense to have a different error message here due to an internal error? http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java@65 PS11, Line 65: throw new AnalysisException("Invalid date literal: " + strDate); I think it's worth quoting the date literal they passed in so if they did something like DATE ' ' then they will get a readable error message http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java@66 PS11, Line 66: else nit: no need for else here since you threw above http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java@103 PS11, Line 103: return nit: missing space http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/PartitionSpec.java File fe/src/main/java/org/apache/impala/analysis/PartitionSpec.java: http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/PartitionSpec.java@83 PS11, Line 83: This is necessary as a DATE literal can have different string : // representations, but all these representations refer to the same partition hrm... this part is spooking me a little bit. What happens if you have multiple partitions with different representations for the same date? Also, are you sure we aren't relying on immutability of the partitionSpec_ list anywhere? Is there any other way we can accomplish this? http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java: http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1148 PS11, Line 1148: // Avg(Date) is this capability part of the SQL standard? http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java File fe/src/main/java/org/apache/impala/catalog/ColumnStats.java: http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@293 PS11, Line 293: LongMath.pow(2, Integer.SIZE) should we instead cap at the number of unique dates in 0000-00-00 to 9999-12-31? I think this is smaller than 2^32 right? http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/Function.java File fe/src/main/java/org/apache/impala/catalog/Function.java: http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/Function.java@191 PS11, Line 191: public boolean compare(Function other, CompareMode mode) { Is there an ANSI standard for function resolution with implicit casts? Have you checked what Hive does here? It would be no good if we were inconsistent http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1248 PS11, Line 1248: "Scanning DATE values is not supported for non-text fileformats "); Can we include the table name here? http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java File fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java: http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java@732 PS11, Line 732: TestResolveFunction nit: here and elsewhere, the function names should start with a lower case letter in java http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java File fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java: http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java@67 PS11, Line 67: testLiteralExprNegative("ABC", Type.DATE); worth a negative test here for an invalid date like february 31? http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/ParserTest.java File fe/src/test/java/org/apache/impala/analysis/ParserTest.java: http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/ParserTest.java@1215 PS11, Line 1215: ParserError("select date '2011--01'"); same, check for feb 31 -- To view, visit http://gerrit.cloudera.org:8080/12481 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f Gerrit-Change-Number: 12481 Gerrit-PatchSet: 11 Gerrit-Owner: Attila Jeges <atti...@cloudera.com> Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Comment-Date: Thu, 14 Mar 2019 04:22:12 +0000 Gerrit-HasComments: Yes