[Impala-ASF-CR] IMPALA-8705: ISO:SQL:2016 datetime patterns - Milestone 3

Gabor Kaszab (Code Review) Tue, 19 Nov 2019 03:34:20 -0800

Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14714 )


Change subject: IMPALA-8705: ISO:SQL:2016 datetime patterns - Milestone 3
......................................................................


Patch Set 3:

(21 comments)

http://gerrit.cloudera.org:8080/#/c/14714/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14714/1//COMMIT_MSG@12
PS1, Line 12: The tokens introduced:
            : - Full month name (MONTH, Month, month): In a string to datetime
            :   conversion this token can parse textual month name into a 
datetime
            :   type. In a datetime to string conversion this token gives the
            :   textual representation of a month.
            : - Short month name (MON, Mon, mon): Similar to the full month name
            :   token but this works for 3-character month names like 'JAN'.
            : - Full day name (DAY, Day, day): In a datetime to string 
conversion
            :   this token gives the textual representation of a day like
            :   'Tuesday.' Not suppported in a string to datetime conversion.
            : - Short day name (DY, Dy, dy): Similar to full day name token but
            :   this works for 3-character day names like 'TUE'.
            : - Day of week (D): In a datetime to string conversion this gives a
            :   number in [1-7] where 1 represents Sunday. Not supported in a
            :   string to datetime conversion.
            : - Quarter of year (Q): In a datetime to string conversion this 
gives
            :   a number in [1-4] representing a quarter of the year. Not 
supported
            :   in a string to datetime conversion.
            : - Week of year (WW): In a datetime to string conversion this 
gives a
            :   number in [1-53] to represent the week of year where the first 
week
            :   starts from 1st of January. Not supported in a string to 
datetime
            :   conversion.
            : - Week of month (W): In a datetime to string conversion this 
gives a
            :   number in [1-5] to represent the week of month where the first 
week
            :   starts from the first day of the month. Not supported in a 
string
            :   to datetime conversion.
> Please add the actual format tokens to the commit message as well.
Done


http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/exprs/timestamp-functions.h
File be/src/exprs/timestamp-functions.h:

http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/exprs/timestamp-functions.h@71
PS1, Line 71: CAPITALIZ
> CAMELCASE is a bit misleading as strings like "eBay", "iPhone" are also cal
Good suggestion. The only reason I named it CAMELCASE was that I didn't find 
anything better :)


http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/exprs/timestamp-functions.cc@59
PS1, Line 59: const string TimestampFunctions::SHORT_DAY_NAMES[3][7] = {
            :     {"Sun", "Mon",
> Is this still necessary? We can just use SHORT_DAY_NAMES[CAMELCASE] instead
Indeed. I'll get rid of this one then.


http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/date-parse-util.cc
File be/src/runtime/date-parse-util.cc:

http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/date-parse-util.cc@212
PS1, Line 212: int DateParser::GetDayOfWeek(const DateValue& date) {
> DCHECK(date.IsValid());
Done


http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-iso-sql-format-tokenizer.cc
File be/src/runtime/datetime-iso-sql-format-tokenizer.cc:

http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@77
PS1, Line 77: {"HH12", 2
> Does "HH" have to be listed here? The length of the token is already 2.
Done


http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@77
PS1, Line 77:  {"FF5", 5
> Does "FF3" have to be listed here? The length of the token is already 3.
Done


http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.h
File be/src/runtime/datetime-parser-common.h:

http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.h@69
PS1, Line 69: ort month names like 'JAN',
> Probably this should be just called 'SHORT_MONTH_NAME_LENGTH' as all short 
Done


http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.h@74
PS1, Line 74:
> Comment should be about short day names and not about short month names.
Oops, copy-paste :)


http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.h@75
PS1, Line 75: ort day names like 'MON',
> Same as L74: should be called 'SHORT_DAY_NAME_LENGTH'.
Done


http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.h@296
PS1, Line 296: month, int* day)
             :     WARN_UNUSED_RESULT;
> nit: Based on the casing of the month format token in 'tok'.
Done


http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.h@302
PS1, Line 302:
> nit: Same as L297.
Done


http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.cc
File be/src/runtime/datetime-parser-common.cc:

http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.cc@197
PS1, Line 197:   DCHECK(token_start <= *token_end && *token_end <= input_end);
             :   int token_len = *token_end - token_start;
> Ok, I messed this up :)
Done


http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.cc@270
PS1, Line 270:       result.append(text_it, 1);
             :       continue;
             :     }
             :     if (tok.is_double_escaped && strncmp(text_it, "\\\\\\\"", 4) 
== 0) {
             :       result.append("\"");
             :
> This logic is a bit weird:
Indeed, it's weird a bit. However, I found it too strict to reject queries 
because of the spelling. I figured it makes more sense to be more permissive 
and pick one of the cases for these weirdly spelled inputs.

Let me make one adjustment here to check tok.type for strncmp(tok.val, "MONTH", 
5) and for strncmp(tok.val, "MON", 3) because with the current implementation 
the first one doesn't make sense. This way the output would be full uppercase 
only if all the chars are uppercase in the token.

Update: I took a look at how Oracle handles this. It is permissive as well but 
checks the first 2 chars only for getting the case. e.g. MOnth -> FEBRUARY, 
mONTH -> february, MoNTH -> February.
Opened a jira on Hive to follow the same approach: 
https://issues.apache.org/jira/browse/HIVE-22511


http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.cc@276
PS1, Line 276: else if (!tok.is_double_escaped && str
> Using (tok.type == MONTH_NAME_SHORT) condition would be more straightforwar
Done


http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.cc@282
PS1, Line 282: e if (strncmp(text_it, "\\b", 2) == 0) {
> Is this behavior documented in the ANSI standard?
I haven't found anything related in the standard, however, this padding logic 
follows what Oracle does with month/day names.


http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.cc@290
PS1, Line 290:       ++text_it;
             :     } else if (strncmp(text_it, "\\t", 2) == 0) {
             :       result.append("\t");
             :       ++text_it;
             :
> Same as L270
Same as above. However, here it's not necessary to check tok.type to make 
distinction between DAY and DY as DY is not a prefix for DAY. Added covering 
tests.


http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.cc@295
PS1, Line 295:
> Using (tok.type == DAY_NAME_SHORT) condition would be more straightforward.
Done


http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.cc@299
PS1, Line 299: ring& FormatMonthName(int num_of_month, const DateTimeForm
> Is this behavior defined in the ANSI standard?
See my answer above.


http://gerrit.cloudera.org:8080/#/c/14714/1/tests/query_test/test_cast_with_format.py
File tests/query_test/test_cast_with_format.py:

http://gerrit.cloudera.org:8080/#/c/14714/1/tests/query_test/test_cast_with_format.py@301
PS1, Line 301: # Test different lowercase vs uppercase scenarios with the 
string to datetime path.
> In all the string -> datetime conversions, month-name is surrounded by sepa
Good point. Month names should work without separators (when FX is not given).

Additionally, I found anther issue. This should also work:
select cast('2019 october  18' as date format 'FXYYYY MonthDD');
ERROR: UDF ERROR: String to Date parse failed. Invalid string val: "2019 
october  18"

Fixed both.


http://gerrit.cloudera.org:8080/#/c/14714/1/tests/query_test/test_cast_with_format.py@399
PS1, Line 399:    # Test odd casing of month token.
             :     result = self.execute_query("select cast(date'2010-09-20' as 
string FORMAT "
             :         "'MOnth MONth MONTh')")
             :     assert result.data == ["SEPTEMBER SEPTEMBER SEPTEMBER"]
             :
> This is a bit weird that we have to add FM modifier to the MONTH token to a
Yes, Hive and Oracle works like this.

E.g. In Oracle:
TO_CHAR(TO_TIMESTAMP('2019-05-10','YYYY-MM-DD'),'MONTH-YYYY')
MAY      -2019

TO_CHAR(TO_TIMESTAMP('2019-05-10','YYYY-MM-DD'),'FMMONTH-YYYY')
MAY-2019


http://gerrit.cloudera.org:8080/#/c/14714/1/tests/query_test/test_cast_with_format.py@834
PS1, Line 834:         "'DAy dAY daY dY')")
             :     assert result.data == ["WEDNESDAY wednesday wednesday wed"]
             :
> Same as L403
See answer above.

Oracle:
TO_CHAR(TO_TIMESTAMP('2019-05-10','YYYY-MM-DD'),'DAY-YYYY')
FRIDAY   -2019
TO_CHAR(TO_TIMESTAMP('2019-05-10','YYYY-MM-DD'),'FMDAY-YYYY')
FRIDAY-2019



--
To view, visit http://gerrit.cloudera.org:8080/14714
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic797f19a1311b54e5d00d01d0a7afe1f0f21fb8f
Gerrit-Change-Number: 14714
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Attila Jeges <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Comment-Date: Tue, 19 Nov 2019 11:33:13 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8705: ISO:SQL:2016 datetime patterns - Milestone 3

Reply via email to