Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12481 )

Change subject: IMPALA-7368: Add initial support for DATE type
......................................................................


Patch Set 12:

(32 comments)

http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG@29
PS11, Line 29:
             : - Implicit casting between DATE and other typ
> it's interesting that this differs from the behavior of TIMESTAMP today, at
Implicit casting from STRING to TIMESTAMP works in Impala in most cases 
(parameters to function calls/operators, values in insert/select statements, 
etc).
- Hive does these kind of implicit conversions too,
- PostgreSQL on the other hand doesn't.
- Couldn't find anything about STRING->TIMESTAMP implicit conversions in the 
SQL ANSI standard.

Timestamp arithmetic expressions in Impala are an exception: in 
TimestampArithmeticExpr.java implicit conversions are explicitly forbidden.
- Neither Hive nor PostgreSQL allow implicit conversions in arithmetic 
expressions.


http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG@38
PS11, Line 38: ution logic is not adequate anymore.
             :   For example, it resolves the
             :   if(false, '2011-01-01', DATE '1499-02-02') function call to the
             :   if(BOOLEAN, TIMESTAMP, TIMESTA
> wouldn't the result be the same either way for this example?
Not necessarily. The valid range for TIMESTAMP is from 1400-01-01 to 
9999-12-31. The valid range for DATE is from 0000-01-01 to 9999-12-31. 
Therefore:
year(cast('1399-12-31' as TIMESTAMP)) returns NULL, whereas
year(cast('1399-12-31' as DATE)) will return 1399.

In Hive the valid range for TIMESTAMP and DATE is the same, so this is not an 
issue.


http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG@42
PS11, Line 42: DATE, DATE
> how is this defined, specifically?
I've changed the wording to describe the function resolution algorithm and 
related issues in more detail.


http://gerrit.cloudera.org:8080/#/c/12481/11//COMMIT_MSG@70
PS11, Line 70: iders overloaded function
> It seems easier to leave it enabled so long as the current support is corre
I agree with Tim, introducing a feature flag for DATE support would take a 
considerable effort since this patch touches lots of files.

Testing whether turning off the feature flag works as expected (i.e. we get the 
same behavior and error messages as before) can also be complicated.

Also, the patch includes a sql-parser change. Moving the parser change behind a 
feature flag is probably not that straightforward either.


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/codegen/codegen-anyval.cc
File be/src/codegen/codegen-anyval.cc:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/codegen/codegen-anyval.cc@68
PS11, Line 68:       return cg->i64_type();
> why do we lower to i64 instead of i32 given the slot size is only 4 bytes?
The highest 4 bytes store the actual value, the lowest byte is used for the 
"is_null" flag.

Similarly, TYPE_INT represents an i32 value but its lowered type is i64.


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exec/hdfs-table-sink.cc@500
PS11, Line 500:  (type.type == TYPE_DATE) {
              :             ColumnDescriptor col_desc = 
table_desc_->col_descs()[num_clustering_cols + i];
              :             stringstream error_msg;
> nit: why not just pass that string directly?
Done


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/aggregate-functions-ir.cc@381
PS11, Line 381: s should not have a negative perfor
> Yea, my feeling was just that one fewer feature to test and bug-fix and sup
Removed avg(DATE) to avoid semantic problems.


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/conditional-functions-ir.cc
File be/src/exprs/conditional-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/conditional-functions-ir.cc@37
PS11, Line 37: IS_NULL_COMPUTE_
> per comment elsewhere, I found it clearer in the old version since I didn't
Done


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/expr-test.cc@3386
PS11, Line 3386:
               : TEST_F(ExprTest, CastDateExprs) {
> add some tests for invalid dates like feb 30
Done


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/exprs/expr-test.cc@3475
PS11, Line 3475:   TestError("cast(cast(null as date) as double)");
> can you test a before-the-epoch timestamp that includes a time portion? eg
Done


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/runtime/date-value.h
File be/src/runtime/date-value.h:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/runtime/date-value.h@40
PS11, Line 40: /// - Proleptic Gregorian calendar is used to calculate the 
number of days since epoch,
             : ///   which can lead to different representation of historical 
dates compared to Hive.
             : ///   
(https://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar)
> I know this was from the original change a few months back, but should we b
I didn't get the chance to test Hive 3.1, but yes it seems that Hive 3.1 
switched to using Proleptic Gregorian calendar too.

We could introduce a feature flag in Impala to switch between Julian/Gregorian 
and Proleptic Gregorian calendars. Please note though that this is not just a 
DATE issue: TIMESTAMP type already has the same calendar-incompatibility 
problem.

If we want to fix this, I suggest introducing a feature flag that affects both 
TIMESTAMP & DATE types in a separate patch.


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/runtime/timestamp-value.inline.h@36
PS11, Line 36:   return TimestampValue(EPOCH + 
boost::gregorian::date_duration(days), t);
> is there any chance of exceptions in this boost API? I seem to recall there
I've tested EPOCH + boost::gregorian::date_duration(days) with days outside of 
the valid [1400-01-01..9999.12.31] range and it doesn't throw an exception.

If 'days' is outside of the supported range, L36 returns a TimestampValue 
instance where TimestampValue.HasDate() is false.


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/service/fe-support.cc
File be/src/service/fe-support.cc:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/service/fe-support.cc@638
PS11, Line 638:     jclass fe_support_class, jstring date) {
> I see 'caller_class' was copied from above, but that's a somewhat inaccurat
Renamed it to 'fe_support_class' here and elsewhere.


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/udf/udf.h
File be/src/udf/udf.h:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/udf/udf.h@599
PS11, Line 599:   explicit DateVal(underlying_type_t val = 0) : val(val) { }
> should be marked explicit?
Done


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/string-parser-test.cc
File be/src/util/string-parser-test.cc:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/string-parser-test.cc@534
PS11, Line 534:   // Test bad formats.
> worth a check where there is a valid leading date followed by junk ("2010-0
Done


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/string-parser-test.cc@544
PS11, Line 544:   TestDateValue("2010-01-01 12:22:22", invalid_date, 
StringParser::PARSE_FAILURE);
> what about special days like February 29th? Worth testing that in a year wh
Done


http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/symbols-util.cc
File be/src/util/symbols-util.cc:

http://gerrit.cloudera.org:8080/#/c/12481/11/be/src/util/symbols-util.cc@129
PS11, Line 129:     CASE_TYPE_APPEND_MANGLED_TOKEN(TYPE_SMALLINT, SmallIntVal)
> I dunno that using the ITERATE_OVER is really any clearer than just repeati
Done


http://gerrit.cloudera.org:8080/#/c/12481/11/common/thrift/Exprs.thrift
File common/thrift/Exprs.thrift:

http://gerrit.cloudera.org:8080/#/c/12481/11/common/thrift/Exprs.thrift@58
PS11, Line 58:   // String representation
             :   2: required string date_string
> why is it necessary to include the string representation here too? I don't
Having the string representation here is not absolutely necessary, it is 
included for convenience.

When we create a DateLiteral object from a TDateLiteral object, it is 
beneficial to pass both the string and the i32 representation to the 
DateLiteral constructor. Otherwise, we would have to convert the i32 
representation to the string representation on the Java side. This would 
require calling a native BE function from Java, which is somewhat inconvenient 
and I'd like to avoid it if possible.

FLOAT and other similar types don't have this problem as they can always be 
converted to string in Java easily.

TIMESTAMPs are handled differently as there is no way to directly specify a 
literal timestamp from SQL.


http://gerrit.cloudera.org:8080/#/c/12481/11/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/12481/11/common/thrift/ImpalaInternalService.thrift@745
PS11, Line 745: since
> typo
Done


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java
File fe/src/main/java/org/apache/impala/analysis/DateLiteral.java:

http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java@61
PS11, Line 61:       throw new AnalysisException("Error parsing date literal: " 
+ e.getMessage(), e);
> would this actually be due to an invalid date literal? or would this be tru
True, this is the wrong error message. Replaced with "Error parsing.." which is 
what we use for similar errors in other places..


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java@65
PS11, Line 65:       throw new AnalysisException("Invalid date literal: '" + 
strDate + "'");
> I think it's worth quoting the date literal they passed in so if they did s
Done


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java@66
PS11, Line 66:
> nit: no need for else here since you threw above
Done


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/DateLiteral.java@103
PS11, Line 103:  retur
> nit: missing space
Done


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/PartitionSpec.java
File fe/src/main/java/org/apache/impala/analysis/PartitionSpec.java:

http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/analysis/PartitionSpec.java@83
PS11, Line 83:  possible, replace STRING_LITERAL value with the corresponding
             :     //       DateLiteral object.
> hrm... this part is spooking me a little bit. What happens if you have mult
I've changed the wording, I hope it makes more sense now.

L125-131 was added to solve the problem they you described: different string 
representations of the same date will be disambiguated and replaced with the 
corresponding DateLiteral object.

'partitionSpec_' is made immutable in L135 in PartitionSpec.analyze().

I've checked the FE code: PartitionSpec.getPartitionSpecKeyValues() is called 
only after PartitionSpec.analyze() is called, so the returned list should 
already be immutable.


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java:

http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1148
PS11, Line 1148:
> is this capability part of the SQL standard?
It isn't, and Hive doesn't support it either. Removed it for now.


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
File fe/src/main/java/org/apache/impala/catalog/ColumnStats.java:

http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@293
PS11, Line 293: t dates in the 0000-01-01..99
> should we instead cap at the number of unique dates in 0000-00-00 to 9999-1
True, fixed it.


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/Function.java
File fe/src/main/java/org/apache/impala/catalog/Function.java:

http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/catalog/Function.java@191
PS11, Line 191:   public boolean compare(Function other, CompareMode mode) {
> Is there an ANSI standard for function resolution with implicit casts? Have
I took a quick look at the SQL ANSI standard, but couldn't find anything about 
function name resolution.

Hive takes a very different approach, it doesn't have overloaded functions in 
the same sense as Impala. Built-in functions take generic parameters and decide 
on their own what implicit casts to apply on their arguments and in which 
order. For instance:
- add_months() used to convert its first argument to date (even if it was a 
timestamp string) until recently, but since HIVE-19370 it tries to do a 
timestamp-conversion first and falls back on date-conversion only if that fails.
- date_add() on the other hand still does date-conversion only. This is 
probably a bug, but this is how it works right now.


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1248
PS11, Line 1248:               "Scanning DATE values in table '" + 
table.getFullName() +
> Can we include the table name here?
Done


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java:

http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java@732
PS11, Line 732: TestResolveFunction
> nit: here and elsewhere, the function names should start with a lower case
Changed the 1st letter of function names to lower case in all the java files 
that I've touched in this patch. I ended up changing another 5K+ lines in 37 
files. I will upload these changes in a separate patch-set.


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
File fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java:

http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java@67
PS11, Line 67:     testLiteralExprNegative("ABC", Type.DATE);
> worth a negative test here for an invalid date like february 31?
Done


http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/ParserTest.java
File fe/src/test/java/org/apache/impala/analysis/ParserTest.java:

http://gerrit.cloudera.org:8080/#/c/12481/11/fe/src/test/java/org/apache/impala/analysis/ParserTest.java@1215
PS11, Line 1215:     ParserError("select date '2011--01'");
> same, check for feb 31
Done


http://gerrit.cloudera.org:8080/#/c/12481/10/tests/query_test/test_date_queries.py
File tests/query_test/test_date_queries.py:

http://gerrit.cloudera.org:8080/#/c/12481/10/tests/query_test/test_date_queries.py@107
PS10, Line 107:   def test_impala_shell(self):
> We don't need to test this for multiple file formats, right?
Correct. Note, that this test will be executed only once, since 
'test_impala_shell' doesn't take a 'vector' parameter.



--
To view, visit http://gerrit.cloudera.org:8080/12481
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f
Gerrit-Change-Number: 12481
Gerrit-PatchSet: 12
Gerrit-Owner: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Comment-Date: Thu, 21 Mar 2019 18:47:06 +0000
Gerrit-HasComments: Yes

Reply via email to