Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/12481 )
Change subject: IMPALA-7368: Add initial support for DATE type ...................................................................... Patch Set 12: (32 comments) http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG@29 PS11, Line 29: : - Implicit casting between DATE and other typ > it's interesting that this differs from the behavior of TIMESTAMP today, at Implicit casting from STRING to TIMESTAMP works in Impala in most cases (parameters to function calls/operators, values in insert/select statements, etc). - Hive does these kind of implicit conversions too, - PostgreSQL on the other hand doesn't. - Couldn't find anything about STRING->TIMESTAMP implicit conversions in the SQL ANSI standard. Timestamp arithmetic expressions in Impala are an exception: in TimestampArithmeticExpr.java implicit conversions are explicitly forbidden. - Neither Hive nor PostgreSQL allow implicit conversions in arithmetic expressions. http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG@38 PS11, Line 38: ution logic is not adequate anymore. : For example, it resolves the : if(false, '2011-01-01', DATE '1499-02-02') function call to the : if(BOOLEAN, TIMESTAMP, TIMESTA > wouldn't the result be the same either way for this example? Not necessarily. The valid range for TIMESTAMP is from 1400-01-01 to 9999-12-31. The valid range for DATE is from 0000-01-01 to 9999-12-31. Therefore: year(cast('1399-12-31' as TIMESTAMP)) returns NULL, whereas year(cast('1399-12-31' as DATE)) will return 1399. In Hive the valid range for TIMESTAMP and DATE is the same, so this is not an issue. http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG@42 PS11, Line 42: DATE, DATE > how is this defined, specifically? I've changed the wording to describe the function resolution algorithm and related issues in more detail. http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG@70 PS11, Line 70: iders overloaded function > It seems easier to leave it enabled so long as the current support is corre I agree with Tim, introducing a feature flag for DATE support would take a considerable effort since this patch touches lots of files. Testing whether turning off the feature flag works as expected (i.e. we get the same behavior and error messages as before) can also be complicated. Also, the patch includes a sql-parser change. Moving the parser change behind a feature flag is probably not that straightforward either. http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/codegen/codegen-anyval.cc File be/src/codegen/codegen-anyval.cc: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/codegen/codegen-anyval.cc@68 PS11, Line 68: return cg->i64_type(); > why do we lower to i64 instead of i32 given the slot size is only 4 bytes? The highest 4 bytes store the actual value, the lowest byte is used for the "is_null" flag. Similarly, TYPE_INT represents an i32 value but its lowered type is i64. http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exec/hdfs-table-sink.cc@500 PS11, Line 500: (type.type == TYPE_DATE) { : ColumnDescriptor col_desc = table_desc_->col_descs()[num_clustering_cols + i]; : stringstream error_msg; > nit: why not just pass that string directly? Done http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/aggregate-functions-ir.cc@381 PS11, Line 381: s should not have a negative perfor > Yea, my feeling was just that one fewer feature to test and bug-fix and sup Removed avg(DATE) to avoid semantic problems. http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/conditional-functions-ir.cc File be/src/exprs/conditional-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/conditional-functions-ir.cc@37 PS11, Line 37: IS_NULL_COMPUTE_ > per comment elsewhere, I found it clearer in the old version since I didn't Done http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/expr-test.cc@3386 PS11, Line 3386: : TEST_F(ExprTest, CastDateExprs) { > add some tests for invalid dates like feb 30 Done http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/expr-test.cc@3475 PS11, Line 3475: TestError("cast(cast(null as date) as double)"); > can you test a before-the-epoch timestamp that includes a time portion? eg Done http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/runtime/date-value.h File be/src/runtime/date-value.h: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/runtime/date-value.h@40 PS11, Line 40: /// - Proleptic Gregorian calendar is used to calculate the number of days since epoch, : /// which can lead to different representation of historical dates compared to Hive. : /// (https://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar) > I know this was from the original change a few months back, but should we b I didn't get the chance to test Hive 3.1, but yes it seems that Hive 3.1 switched to using Proleptic Gregorian calendar too. We could introduce a feature flag in Impala to switch between Julian/Gregorian and Proleptic Gregorian calendars. Please note though that this is not just a DATE issue: TIMESTAMP type already has the same calendar-incompatibility problem. If we want to fix this, I suggest introducing a feature flag that affects both TIMESTAMP & DATE types in a separate patch. http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/runtime/timestamp-value.inline.h File be/src/runtime/timestamp-value.inline.h: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/runtime/timestamp-value.inline.h@36 PS11, Line 36: return TimestampValue(EPOCH + boost::gregorian::date_duration(days), t); > is there any chance of exceptions in this boost API? I seem to recall there I've tested EPOCH + boost::gregorian::date_duration(days) with days outside of the valid [1400-01-01..9999.12.31] range and it doesn't throw an exception. If 'days' is outside of the supported range, L36 returns a TimestampValue instance where TimestampValue.HasDate() is false. http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/service/fe-support.cc File be/src/service/fe-support.cc: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/service/fe-support.cc@638 PS11, Line 638: jclass fe_support_class, jstring date) { > I see 'caller_class' was copied from above, but that's a somewhat inaccurat Renamed it to 'fe_support_class' here and elsewhere. http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/udf/udf.h File be/src/udf/udf.h: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/udf/udf.h@599 PS11, Line 599: explicit DateVal(underlying_type_t val = 0) : val(val) { } > should be marked explicit? Done http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/string-parser-test.cc File be/src/util/string-parser-test.cc: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/string-parser-test.cc@534 PS11, Line 534: // Test bad formats. > worth a check where there is a valid leading date followed by junk ("2010-0 Done http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/string-parser-test.cc@544 PS11, Line 544: TestDateValue("2010-01-01 12:22:22", invalid_date, StringParser::PARSE_FAILURE); > what about special days like February 29th? Worth testing that in a year wh Done http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/symbols-util.cc File be/src/util/symbols-util.cc: http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/symbols-util.cc@129 PS11, Line 129: CASE_TYPE_APPEND_MANGLED_TOKEN(TYPE_SMALLINT, SmallIntVal) > I dunno that using the ITERATE_OVER is really any clearer than just repeati Done http://gerrit.cloudera.org:8080/#/c/12481/11/common/thrift/Exprs.thrift File common/thrift/Exprs.thrift: http://gerrit.cloudera.org:8080/#/c/12481/11/common/thrift/Exprs.thrift@58 PS11, Line 58: // String representation : 2: required string date_string > why is it necessary to include the string representation here too? I don't Having the string representation here is not absolutely necessary, it is included for convenience. When we create a DateLiteral object from a TDateLiteral object, it is beneficial to pass both the string and the i32 representation to the DateLiteral constructor. Otherwise, we would have to convert the i32 representation to the string representation on the Java side. This would require calling a native BE function from Java, which is somewhat inconvenient and I'd like to avoid it if possible. FLOAT and other similar types don't have this problem as they can always be converted to string in Java easily. TIMESTAMPs are handled differently as there is no way to directly specify a literal timestamp from SQL. http://gerrit.cloudera.org:8080/#/c/12481/11/common/thrift/ImpalaInternalService.thrift File common/thrift/ImpalaInternalService.thrift: http://gerrit.cloudera.org:8080/#/c/12481/11/common/thrift/ImpalaInternalService.thrift@745 PS11, Line 745: since > typo Done http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java File fe/src/main/java/org/apache/impala/analysis/DateLiteral.java: http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java@61 PS11, Line 61: throw new AnalysisException("Error parsing date literal: " + e.getMessage(), e); > would this actually be due to an invalid date literal? or would this be tru True, this is the wrong error message. Replaced with "Error parsing.." which is what we use for similar errors in other places.. http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java@65 PS11, Line 65: throw new AnalysisException("Invalid date literal: '" + strDate + "'"); > I think it's worth quoting the date literal they passed in so if they did s Done http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java@66 PS11, Line 66: > nit: no need for else here since you threw above Done http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java@103 PS11, Line 103: retur > nit: missing space Done http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/PartitionSpec.java File fe/src/main/java/org/apache/impala/analysis/PartitionSpec.java: http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/PartitionSpec.java@83 PS11, Line 83: possible, replace STRING_LITERAL value with the corresponding : // DateLiteral object. > hrm... this part is spooking me a little bit. What happens if you have mult I've changed the wording, I hope it makes more sense now. L125-131 was added to solve the problem they you described: different string representations of the same date will be disambiguated and replaced with the corresponding DateLiteral object. 'partitionSpec_' is made immutable in L135 in PartitionSpec.analyze(). I've checked the FE code: PartitionSpec.getPartitionSpecKeyValues() is called only after PartitionSpec.analyze() is called, so the returned list should already be immutable. http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java: http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1148 PS11, Line 1148: > is this capability part of the SQL standard? It isn't, and Hive doesn't support it either. Removed it for now. http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java File fe/src/main/java/org/apache/impala/catalog/ColumnStats.java: http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@293 PS11, Line 293: t dates in the 0000-01-01..99 > should we instead cap at the number of unique dates in 0000-00-00 to 9999-1 True, fixed it. http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/Function.java File fe/src/main/java/org/apache/impala/catalog/Function.java: http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/Function.java@191 PS11, Line 191: public boolean compare(Function other, CompareMode mode) { > Is there an ANSI standard for function resolution with implicit casts? Have I took a quick look at the SQL ANSI standard, but couldn't find anything about function name resolution. Hive takes a very different approach, it doesn't have overloaded functions in the same sense as Impala. Built-in functions take generic parameters and decide on their own what implicit casts to apply on their arguments and in which order. For instance: - add_months() used to convert its first argument to date (even if it was a timestamp string) until recently, but since HIVE-19370 it tries to do a timestamp-conversion first and falls back on date-conversion only if that fails. - date_add() on the other hand still does date-conversion only. This is probably a bug, but this is how it works right now. http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1248 PS11, Line 1248: "Scanning DATE values in table '" + table.getFullName() + > Can we include the table name here? Done http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java File fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java: http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java@732 PS11, Line 732: TestResolveFunction > nit: here and elsewhere, the function names should start with a lower case Changed the 1st letter of function names to lower case in all the java files that I've touched in this patch. I ended up changing another 5K+ lines in 37 files. I will upload these changes in a separate patch-set. http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java File fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java: http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java@67 PS11, Line 67: testLiteralExprNegative("ABC", Type.DATE); > worth a negative test here for an invalid date like february 31? Done http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/ParserTest.java File fe/src/test/java/org/apache/impala/analysis/ParserTest.java: http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/ParserTest.java@1215 PS11, Line 1215: ParserError("select date '2011--01'"); > same, check for feb 31 Done http://gerrit.cloudera.org:8080/#/c/12481/10/tests/query_test/test_date_queries.py File tests/query_test/test_date_queries.py: http://gerrit.cloudera.org:8080/#/c/12481/10/tests/query_test/test_date_queries.py@107 PS10, Line 107: def test_impala_shell(self): > We don't need to test this for multiple file formats, right? Correct. Note, that this test will be executed only once, since 'test_impala_shell' doesn't take a 'vector' parameter. -- To view, visit http://gerrit.cloudera.org:8080/12481 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f Gerrit-Change-Number: 12481 Gerrit-PatchSet: 12 Gerrit-Owner: Attila Jeges <atti...@cloudera.com> Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Comment-Date: Thu, 21 Mar 2019 18:47:06 +0000 Gerrit-HasComments: Yes