Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12481 )

Change subject: IMPALA-7368: Add initial support for DATE type
......................................................................


Patch Set 3:

(19 comments)

Thanks for the reviews!

While I was going through the comments, I realized that partitioning by DATE 
wasn't properly implemented. I fixed those issues as well and added some extra 
tests to cover partitioning functionality.

http://gerrit.cloudera.org:8080/#/c/12481/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12481/3//COMMIT_MSG@11
PS3, Line 11: of values supported for the DATE type is 0000-01-01 to 9999-12-31.
> Can you mention the literal syntax supported for date values?
Done


http://gerrit.cloudera.org:8080/#/c/12481/3//COMMIT_MSG@22
PS3, Line 22:     - from BIGINT, INT, SMALLINT, TINYINT to DATE. The source 
value is
            :       interpreted as a number of days since the epoch.
            :     - from DATE to BIGINT, INT, SMALLINT, TINYINT. The resulting
            :       integer is the number of days since epoch.
            :     - from DOUBLE, FLOAT, DECIMAL to DATE. The source value's
            :       fractional part is ignored, the integer part is interpreted 
as a
            :       number of days since epoch.
            :     - from DATE to DOUBLE, FLOAT, DECIMAL. The resulting value is 
the
            :       number of days since epoch.
> Hive does not support these casts (I checked with Hive 2.1.1), are you sure
I introduced them as all these conversions are supported for timestamps. On the 
other hand if Hive doesn't support them, probably we shouldn't either.
Removed them.


http://gerrit.cloudera.org:8080/#/c/12481/3//COMMIT_MSG@35
PS3, Line 35: - Implicit casting between DATE and other types:
> Do you plan to support the DATE 'yyyy-mm-dd' literal syntax? It looks like
Good idea, added support for that.


http://gerrit.cloudera.org:8080/#/c/12481/3//COMMIT_MSG@40
PS3, Line 40: - Since both STRING -> DATE and STRING -> TIMESTAMP implicit
> Do you have an example of where this is required?
Added more information with some examples.


http://gerrit.cloudera.org:8080/#/c/12481/3//COMMIT_MSG@63
PS3, Line 63:       tests/query_test/test_date_queries.py.
> Can we add a simple test that runs a query returning date through impala-sh
Added an extra test to test_date_queries.py.


http://gerrit.cloudera.org:8080/#/c/12481/3//COMMIT_MSG@64
PS3, Line 64:
> I think we should plan on adding DATE support to the random query generator
Added an extra section to the commit msg to describe future plans.


http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/exec/hdfs-scan-node-base.cc@644
PS3, Line 644:   // TODO: Remove this block once DATE type is supported accross 
all file formats.
             :   if (has_materialized_date_slot_ && partition->file_format() != 
THdfsFileFormat::TEXT) {
             :     context->ClearStreams();
             :     return Status("DATE type is only supported with TEXT file 
format.");
             :   }
> I think that this logic should be implemented in frontend, and the backend
I think it would be much more difficult (or impossible) to do this in the FE.

Imagine joining a table to a partitioned table on a partition column, where 
each partition has a different file format. I don't think it is possible to 
figure out in the analyze/plan phase whether non-text partitions will be 
scanned or not.


http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/exprs/cast-functions-ir.cc
File be/src/exprs/cast-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/exprs/cast-functions-ir.cc@310
PS3, Line 310:   int year, month, day;
             :   if (UNLIKELY(!dv.ToYearMonthDay(&year, &month, &day))) return 
TimestampVal::null();
             :   if (year < 1400 || year > 9999) return TimestampVal::null();
             :
             :   const boost::gregorian::date d(year, month, day);
             :   const boost::posix_time::time_duration t(0, 0, 0, 0);
             :   TimestampValue tv(d, t);
> performance: this could be done much faster, as both DateVal and TimestampV
Done


http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/exprs/cast-functions-ir.cc@330
PS3, Line 330:   if (val.is_null) return DateVal::null();
             :   TimestampValue tv = TimestampValue::FromTimestampVal(val);
             :   if (UNLIKELY(!tv.HasDate())) return DateVal::null();
             :
             :   const boost::gregorian::date d = tv.date();
             :   const DateValue dv(d.year(), d.month(), d.day());
             :   return dv.ToDateVal();
> same as in line 310: this could be done with a simple addition. There is a
Done


http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/exprs/slot-ref.cc
File be/src/exprs/slot-ref.cc:

http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/exprs/slot-ref.cc@485
PS3, Line 485:   const DateValue 
dv(*reinterpret_cast<int32_t*>(t->GetSlot(slot_offset_))
> Why don't we reinterpret_cast to DateValue* directly? This would avoid the
This was intentional: DateValue uses a specific int32_t value 
(DateValue::INVALID_DAYS_SINCE_EPOCH) to represent invalid dates.

*reinterpret_cast<int32_t*>(t->GetSlot(slot_offset_) may return an invalid 
value that differs from INVALID_DAYS_SINCE_EPOCH


http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/udf/udf.h
File be/src/udf/udf.h:

http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/udf/udf.h@586
PS3, Line 586: /// Date value represented as days since epoch 1970-01-01.
> The valid range and the fact that the value is interpreted as proleptic Gre
Done


http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/util/string-parser-test.cc
File be/src/util/string-parser-test.cc:

http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/util/string-parser-test.cc@526
PS3, Line 526:   TestDateValue("0000-01-01", DateValue(0, 1, 1), 
StringParser::PARSE_SUCCESS);
             :   TestDateValue("9999-12-31", DateValue(9999, 12, 31), 
StringParser::PARSE_SUCCESS);
> Please add the "wrong side" of the edge values:
Done


http://gerrit.cloudera.org:8080/#/c/12481/3/be/src/util/string-parser-test.cc@531
PS3, Line 531:   TestDateValue("2-11-10", invalid_date, 
StringParser::PARSE_FAILURE);
> Please add tests where month/day have the correct number of characters but
Done


http://gerrit.cloudera.org:8080/#/c/12481/3/common/thrift/Exprs.thrift
File common/thrift/Exprs.thrift:

http://gerrit.cloudera.org:8080/#/c/12481/3/common/thrift/Exprs.thrift@56
PS3, Line 56:   // string representation of date formatted as yyyy-MM-dd.
            :   1: required string value;
> Why do you use string instead of i32 to represent the date? TTimestampLiter
I've kept the string member and added the i32 value to this struct.

TDateLiteral is the thrift representation od DateLiteral Java class. 
DateLiteral is used to represent DATE 'yyyy-MM-dd' date literals.

When we create a DateLiteral object from a TDateLiteral object, it is 
beneficial to pass both the string and the i32 representation to the 
DateLiteral constructor. Otherwise, we would have to convert the i32 
representation to the string representation on the Java side. Such a conversion 
could potentially be inconsistent with what the BE would produce.

TIMESTAMPs are handled differently as there is no way to directly specify a 
literal timestamp from SQL.


http://gerrit.cloudera.org:8080/#/c/12481/3/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java
File fe/src/main/java/org/apache/impala/analysis/DateLiteral.java:

http://gerrit.cloudera.org:8080/#/c/12481/3/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java@32
PS3, Line 32: yyy
> Isn't it yyyy?
Done


http://gerrit.cloudera.org:8080/#/c/12481/3/fe/src/main/java/org/apache/impala/analysis/StringLiteral.java
File fe/src/main/java/org/apache/impala/analysis/StringLiteral.java:

http://gerrit.cloudera.org:8080/#/c/12481/3/fe/src/main/java/org/apache/impala/analysis/StringLiteral.java@155
PS3, Line 155: isDateType
> By implementing the Date type the name of this function became very mislead
Done


http://gerrit.cloudera.org:8080/#/c/12481/3/fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java:

http://gerrit.cloudera.org:8080/#/c/12481/3/fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java@194
PS3, Line 194: AnalyzesOk("select cast (cast ('1970-10-10 10:00:00.123' as 
timestamp) as date)");
> What will happen if some writes
Yes, it returns NULL w/o a warning. This is how explicit casts from string work 
in impala in general. E.g.:

select cast('123.0' as int);

also returns NULL w/o a warning.


http://gerrit.cloudera.org:8080/#/c/12481/3/fe/src/test/java/org/apache/impala/analysis/ExprRewriterTest.java
File fe/src/test/java/org/apache/impala/analysis/ExprRewriterTest.java:

http://gerrit.cloudera.org:8080/#/c/12481/3/fe/src/test/java/org/apache/impala/analysis/ExprRewriterTest.java@368
PS3, Line 368:     assertToSqlWithImplicitCasts(ctx, "select * from 
functional.alltypes, "
             :         + "functional.date_tbl where timestamp_col = date_col",
             :         "SELECT * FROM functional.alltypes, functional.date_tbl "
             :         + "WHERE timestamp_col = CAST(date_col AS TIMESTAMP)");
> I saw some weirdness around this in Hive, it seems to convert everything to
Interesting, in postgresql and mysql:

select cast('1970-01-01' as timestamp) > cast('1970-01-01' as date);

returns false. This could be a Hive bug.


http://gerrit.cloudera.org:8080/#/c/12481/3/tests/query_test/test_date_queries.py
File tests/query_test/test_date_queries.py:

http://gerrit.cloudera.org:8080/#/c/12481/3/tests/query_test/test_date_queries.py@18
PS3, Line 18: decimal
> date
Done



--
To view, visit http://gerrit.cloudera.org:8080/12481
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f
Gerrit-Change-Number: 12481
Gerrit-PatchSet: 3
Gerrit-Owner: Attila Jeges <[email protected]>
Gerrit-Reviewer: Attila Jeges <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Fri, 22 Feb 2019 18:09:30 +0000
Gerrit-HasComments: Yes

Reply via email to