Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/14714 )
Change subject: IMPALA-8705: ISO:SQL:2016 datetime patterns - Milestone 3 ...................................................................... Patch Set 3: (21 comments) http://gerrit.cloudera.org:8080/#/c/14714/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/14714/1//COMMIT_MSG@12 PS1, Line 12: The tokens introduced: : - Full month name (MONTH, Month, month): In a string to datetime : conversion this token can parse textual month name into a datetime : type. In a datetime to string conversion this token gives the : textual representation of a month. : - Short month name (MON, Mon, mon): Similar to the full month name : token but this works for 3-character month names like 'JAN'. : - Full day name (DAY, Day, day): In a datetime to string conversion : this token gives the textual representation of a day like : 'Tuesday.' Not suppported in a string to datetime conversion. : - Short day name (DY, Dy, dy): Similar to full day name token but : this works for 3-character day names like 'TUE'. : - Day of week (D): In a datetime to string conversion this gives a : number in [1-7] where 1 represents Sunday. Not supported in a : string to datetime conversion. : - Quarter of year (Q): In a datetime to string conversion this gives : a number in [1-4] representing a quarter of the year. Not supported : in a string to datetime conversion. : - Week of year (WW): In a datetime to string conversion this gives a : number in [1-53] to represent the week of year where the first week : starts from 1st of January. Not supported in a string to datetime : conversion. : - Week of month (W): In a datetime to string conversion this gives a : number in [1-5] to represent the week of month where the first week : starts from the first day of the month. Not supported in a string : to datetime conversion. > Please add the actual format tokens to the commit message as well. Done http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/exprs/timestamp-functions.h File be/src/exprs/timestamp-functions.h: http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/exprs/timestamp-functions.h@71 PS1, Line 71: CAPITALIZ > CAMELCASE is a bit misleading as strings like "eBay", "iPhone" are also cal Good suggestion. The only reason I named it CAMELCASE was that I didn't find anything better :) http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/exprs/timestamp-functions.cc@59 PS1, Line 59: const string TimestampFunctions::SHORT_DAY_NAMES[3][7] = { : {"Sun", "Mon", > Is this still necessary? We can just use SHORT_DAY_NAMES[CAMELCASE] instead Indeed. I'll get rid of this one then. http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/date-parse-util.cc File be/src/runtime/date-parse-util.cc: http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/date-parse-util.cc@212 PS1, Line 212: int DateParser::GetDayOfWeek(const DateValue& date) { > DCHECK(date.IsValid()); Done http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-iso-sql-format-tokenizer.cc File be/src/runtime/datetime-iso-sql-format-tokenizer.cc: http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@77 PS1, Line 77: {"HH12", 2 > Does "HH" have to be listed here? The length of the token is already 2. Done http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@77 PS1, Line 77: {"FF5", 5 > Does "FF3" have to be listed here? The length of the token is already 3. Done http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.h File be/src/runtime/datetime-parser-common.h: http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.h@69 PS1, Line 69: ort month names like 'JAN', > Probably this should be just called 'SHORT_MONTH_NAME_LENGTH' as all short Done http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.h@74 PS1, Line 74: > Comment should be about short day names and not about short month names. Oops, copy-paste :) http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.h@75 PS1, Line 75: ort day names like 'MON', > Same as L74: should be called 'SHORT_DAY_NAME_LENGTH'. Done http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.h@296 PS1, Line 296: month, int* day) : WARN_UNUSED_RESULT; > nit: Based on the casing of the month format token in 'tok'. Done http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.h@302 PS1, Line 302: > nit: Same as L297. Done http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.cc File be/src/runtime/datetime-parser-common.cc: http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.cc@197 PS1, Line 197: DCHECK(token_start <= *token_end && *token_end <= input_end); : int token_len = *token_end - token_start; > Ok, I messed this up :) Done http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.cc@270 PS1, Line 270: result.append(text_it, 1); : continue; : } : if (tok.is_double_escaped && strncmp(text_it, "\\\\\\\"", 4) == 0) { : result.append("\""); : > This logic is a bit weird: Indeed, it's weird a bit. However, I found it too strict to reject queries because of the spelling. I figured it makes more sense to be more permissive and pick one of the cases for these weirdly spelled inputs. Let me make one adjustment here to check tok.type for strncmp(tok.val, "MONTH", 5) and for strncmp(tok.val, "MON", 3) because with the current implementation the first one doesn't make sense. This way the output would be full uppercase only if all the chars are uppercase in the token. Update: I took a look at how Oracle handles this. It is permissive as well but checks the first 2 chars only for getting the case. e.g. MOnth -> FEBRUARY, mONTH -> february, MoNTH -> February. Opened a jira on Hive to follow the same approach: https://issues.apache.org/jira/browse/HIVE-22511 http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.cc@276 PS1, Line 276: else if (!tok.is_double_escaped && str > Using (tok.type == MONTH_NAME_SHORT) condition would be more straightforwar Done http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.cc@282 PS1, Line 282: e if (strncmp(text_it, "\\b", 2) == 0) { > Is this behavior documented in the ANSI standard? I haven't found anything related in the standard, however, this padding logic follows what Oracle does with month/day names. http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.cc@290 PS1, Line 290: ++text_it; : } else if (strncmp(text_it, "\\t", 2) == 0) { : result.append("\t"); : ++text_it; : > Same as L270 Same as above. However, here it's not necessary to check tok.type to make distinction between DAY and DY as DY is not a prefix for DAY. Added covering tests. http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.cc@295 PS1, Line 295: > Using (tok.type == DAY_NAME_SHORT) condition would be more straightforward. Done http://gerrit.cloudera.org:8080/#/c/14714/1/be/src/runtime/datetime-parser-common.cc@299 PS1, Line 299: ring& FormatMonthName(int num_of_month, const DateTimeForm > Is this behavior defined in the ANSI standard? See my answer above. http://gerrit.cloudera.org:8080/#/c/14714/1/tests/query_test/test_cast_with_format.py File tests/query_test/test_cast_with_format.py: http://gerrit.cloudera.org:8080/#/c/14714/1/tests/query_test/test_cast_with_format.py@301 PS1, Line 301: # Test different lowercase vs uppercase scenarios with the string to datetime path. > In all the string -> datetime conversions, month-name is surrounded by sepa Good point. Month names should work without separators (when FX is not given). Additionally, I found anther issue. This should also work: select cast('2019 october 18' as date format 'FXYYYY MonthDD'); ERROR: UDF ERROR: String to Date parse failed. Invalid string val: "2019 october 18" Fixed both. http://gerrit.cloudera.org:8080/#/c/14714/1/tests/query_test/test_cast_with_format.py@399 PS1, Line 399: # Test odd casing of month token. : result = self.execute_query("select cast(date'2010-09-20' as string FORMAT " : "'MOnth MONth MONTh')") : assert result.data == ["SEPTEMBER SEPTEMBER SEPTEMBER"] : > This is a bit weird that we have to add FM modifier to the MONTH token to a Yes, Hive and Oracle works like this. E.g. In Oracle: TO_CHAR(TO_TIMESTAMP('2019-05-10','YYYY-MM-DD'),'MONTH-YYYY') MAY -2019 TO_CHAR(TO_TIMESTAMP('2019-05-10','YYYY-MM-DD'),'FMMONTH-YYYY') MAY-2019 http://gerrit.cloudera.org:8080/#/c/14714/1/tests/query_test/test_cast_with_format.py@834 PS1, Line 834: "'DAy dAY daY dY')") : assert result.data == ["WEDNESDAY wednesday wednesday wed"] : > Same as L403 See answer above. Oracle: TO_CHAR(TO_TIMESTAMP('2019-05-10','YYYY-MM-DD'),'DAY-YYYY') FRIDAY -2019 TO_CHAR(TO_TIMESTAMP('2019-05-10','YYYY-MM-DD'),'FMDAY-YYYY') FRIDAY-2019 -- To view, visit http://gerrit.cloudera.org:8080/14714 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic797f19a1311b54e5d00d01d0a7afe1f0f21fb8f Gerrit-Change-Number: 14714 Gerrit-PatchSet: 3 Gerrit-Owner: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Attila Jeges <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Comment-Date: Tue, 19 Nov 2019 11:33:13 +0000 Gerrit-HasComments: Yes
