Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/12481 )
Change subject: IMPALA-7368: Add initial support for DATE type ...................................................................... Patch Set 3: (19 comments) Thanks for the reviews! While I was going through the comments, I realized that partitioning by DATE wasn't properly implemented. I fixed those issues as well and added some extra tests to cover partitioning functionality. http://gerrit.cloudera.org:8080/#/c/12481/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12481/3//COMMIT_MSG@11 PS3, Line 11: of values supported for the DATE type is 0000-01-01 to 9999-12-31. > Can you mention the literal syntax supported for date values? Done http://gerrit.cloudera.org:8080/#/c/12481/3//COMMIT_MSG@22 PS3, Line 22: - from BIGINT, INT, SMALLINT, TINYINT to DATE. The source value is : interpreted as a number of days since the epoch. : - from DATE to BIGINT, INT, SMALLINT, TINYINT. The resulting : integer is the number of days since epoch. : - from DOUBLE, FLOAT, DECIMAL to DATE. The source value's : fractional part is ignored, the integer part is interpreted as a : number of days since epoch. : - from DATE to DOUBLE, FLOAT, DECIMAL. The resulting value is the : number of days since epoch. > Hive does not support these casts (I checked with Hive 2.1.1), are you sure I introduced them as all these conversions are supported for timestamps. On the other hand if Hive doesn't support them, probably we shouldn't either. Removed them. http://gerrit.cloudera.org:8080/#/c/12481/3//COMMIT_MSG@35 PS3, Line 35: - Implicit casting between DATE and other types: > Do you plan to support the DATE 'yyyy-mm-dd' literal syntax? It looks like Good idea, added support for that. http://gerrit.cloudera.org:8080/#/c/12481/3//COMMIT_MSG@40 PS3, Line 40: - Since both STRING -> DATE and STRING -> TIMESTAMP implicit > Do you have an example of where this is required? Added more information with some examples. http://gerrit.cloudera.org:8080/#/c/12481/3//COMMIT_MSG@63 PS3, Line 63: tests/query_test/test_date_queries.py. > Can we add a simple test that runs a query returning date through impala-sh Added an extra test to test_date_queries.py. http://gerrit.cloudera.org:8080/#/c/12481/3//COMMIT_MSG@64 PS3, Line 64: > I think we should plan on adding DATE support to the random query generator Added an extra section to the commit msg to describe future plans. http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/exec/hdfs-scan-node-base.cc@644 PS3, Line 644: // TODO: Remove this block once DATE type is supported accross all file formats. : if (has_materialized_date_slot_ && partition->file_format() != THdfsFileFormat::TEXT) { : context->ClearStreams(); : return Status("DATE type is only supported with TEXT file format."); : } > I think that this logic should be implemented in frontend, and the backend I think it would be much more difficult (or impossible) to do this in the FE. Imagine joining a table to a partitioned table on a partition column, where each partition has a different file format. I don't think it is possible to figure out in the analyze/plan phase whether non-text partitions will be scanned or not. http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/exprs/cast-functions-ir.cc File be/src/exprs/cast-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/exprs/cast-functions-ir.cc@310 PS3, Line 310: int year, month, day; : if (UNLIKELY(!dv.ToYearMonthDay(&year, &month, &day))) return TimestampVal::null(); : if (year < 1400 || year > 9999) return TimestampVal::null(); : : const boost::gregorian::date d(year, month, day); : const boost::posix_time::time_duration t(0, 0, 0, 0); : TimestampValue tv(d, t); > performance: this could be done much faster, as both DateVal and TimestampV Done http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/exprs/cast-functions-ir.cc@330 PS3, Line 330: if (val.is_null) return DateVal::null(); : TimestampValue tv = TimestampValue::FromTimestampVal(val); : if (UNLIKELY(!tv.HasDate())) return DateVal::null(); : : const boost::gregorian::date d = tv.date(); : const DateValue dv(d.year(), d.month(), d.day()); : return dv.ToDateVal(); > same as in line 310: this could be done with a simple addition. There is a Done http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/exprs/slot-ref.cc File be/src/exprs/slot-ref.cc: http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/exprs/slot-ref.cc@485 PS3, Line 485: const DateValue dv(*reinterpret_cast<int32_t*>(t->GetSlot(slot_offset_)) > Why don't we reinterpret_cast to DateValue* directly? This would avoid the This was intentional: DateValue uses a specific int32_t value (DateValue::INVALID_DAYS_SINCE_EPOCH) to represent invalid dates. *reinterpret_cast<int32_t*>(t->GetSlot(slot_offset_) may return an invalid value that differs from INVALID_DAYS_SINCE_EPOCH http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/udf/udf.h File be/src/udf/udf.h: http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/udf/udf.h@586 PS3, Line 586: /// Date value represented as days since epoch 1970-01-01. > The valid range and the fact that the value is interpreted as proleptic Gre Done http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/util/string-parser-test.cc File be/src/util/string-parser-test.cc: http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/util/string-parser-test.cc@526 PS3, Line 526: TestDateValue("0000-01-01", DateValue(0, 1, 1), StringParser::PARSE_SUCCESS); : TestDateValue("9999-12-31", DateValue(9999, 12, 31), StringParser::PARSE_SUCCESS); > Please add the "wrong side" of the edge values: Done http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/util/string-parser-test.cc@531 PS3, Line 531: TestDateValue("2-11-10", invalid_date, StringParser::PARSE_FAILURE); > Please add tests where month/day have the correct number of characters but Done http://gerrit.cloudera.org:8080/#/c/12481/3/common/thrift/Exprs.thrift File common/thrift/Exprs.thrift: http://gerrit.cloudera.org:8080/#/c/12481/3/common/thrift/Exprs.thrift@56 PS3, Line 56: // string representation of date formatted as yyyy-MM-dd. : 1: required string value; > Why do you use string instead of i32 to represent the date? TTimestampLiter I've kept the string member and added the i32 value to this struct. TDateLiteral is the thrift representation od DateLiteral Java class. DateLiteral is used to represent DATE 'yyyy-MM-dd' date literals. When we create a DateLiteral object from a TDateLiteral object, it is beneficial to pass both the string and the i32 representation to the DateLiteral constructor. Otherwise, we would have to convert the i32 representation to the string representation on the Java side. Such a conversion could potentially be inconsistent with what the BE would produce. TIMESTAMPs are handled differently as there is no way to directly specify a literal timestamp from SQL. http://gerrit.cloudera.org:8080/#/c/12481/3/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java File fe/src/main/java/org/apache/impala/analysis/DateLiteral.java: http://gerrit.cloudera.org:8080/#/c/12481/3/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java@32 PS3, Line 32: yyy > Isn't it yyyy? Done http://gerrit.cloudera.org:8080/#/c/12481/3/fe/src/main/java/org/apache/impala/analysis/StringLiteral.java File fe/src/main/java/org/apache/impala/analysis/StringLiteral.java: http://gerrit.cloudera.org:8080/#/c/12481/3/fe/src/main/java/org/apache/impala/analysis/StringLiteral.java@155 PS3, Line 155: isDateType > By implementing the Date type the name of this function became very mislead Done http://gerrit.cloudera.org:8080/#/c/12481/3/fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java File fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java: http://gerrit.cloudera.org:8080/#/c/12481/3/fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java@194 PS3, Line 194: AnalyzesOk("select cast (cast ('1970-10-10 10:00:00.123' as timestamp) as date)"); > What will happen if some writes Yes, it returns NULL w/o a warning. This is how explicit casts from string work in impala in general. E.g.: select cast('123.0' as int); also returns NULL w/o a warning. http://gerrit.cloudera.org:8080/#/c/12481/3/fe/src/test/java/org/apache/impala/analysis/ExprRewriterTest.java File fe/src/test/java/org/apache/impala/analysis/ExprRewriterTest.java: http://gerrit.cloudera.org:8080/#/c/12481/3/fe/src/test/java/org/apache/impala/analysis/ExprRewriterTest.java@368 PS3, Line 368: assertToSqlWithImplicitCasts(ctx, "select * from functional.alltypes, " : + "functional.date_tbl where timestamp_col = date_col", : "SELECT * FROM functional.alltypes, functional.date_tbl " : + "WHERE timestamp_col = CAST(date_col AS TIMESTAMP)"); > I saw some weirdness around this in Hive, it seems to convert everything to Interesting, in postgresql and mysql: select cast('1970-01-01' as timestamp) > cast('1970-01-01' as date); returns false. This could be a Hive bug. http://gerrit.cloudera.org:8080/#/c/12481/3/tests/query_test/test_date_queries.py File tests/query_test/test_date_queries.py: http://gerrit.cloudera.org:8080/#/c/12481/3/tests/query_test/test_date_queries.py@18 PS3, Line 18: decimal > date Done -- To view, visit http://gerrit.cloudera.org:8080/12481 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f Gerrit-Change-Number: 12481 Gerrit-PatchSet: 3 Gerrit-Owner: Attila Jeges <[email protected]> Gerrit-Reviewer: Attila Jeges <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Fri, 22 Feb 2019 18:09:30 +0000 Gerrit-HasComments: Yes
