Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/13722 )
Change subject: IMPALA-8703: ISO:SQL:2016 datetime patterns - Milestone 1 ...................................................................... Patch Set 2: (24 comments) Another bunch of comments. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc File be/src/runtime/datetime-iso-sql-format-parser.cc: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@184 PS2, Line 184: DCHECK(result->year >= 1400 && result->year <= 9999); : int month = 1; : for (int i = 0; i < 12; ++i) { : int month_length = MONTH_LENGTHS[i]; : if (i == 1 && boost::gregorian::gregorian_calendar::is_leap_year(result->year)) { : month_length = 29; : } : if (month_length >= day_in_year) { : result->month = month; : result->day = day_in_year; : break; : } : ++month; : day_in_year -= month_length; : } : DCHECK(result->month <= 12); : DCHECK(result->day <= 31); > Move this to a separate function. This can be done more efficiently if you use hard-coded month ranges, like MONTH_RANGES and LEAP_YEAR_MONTH_RANGES in be/src/runtime/date-value.cc. See be/src/runtime/date-value.cc, L193-L208 for an example of how day_in_year -> (month, day) conversion is done. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.h File be/src/runtime/datetime-iso-sql-format-tokenizer.h: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.h@42 PS2, Line 42: int bool? http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.h@45 PS2, Line 45: accept_time_toks_ = time_toks; Shouldn't 'used_tokens_' be cleared when tokenizer is reset? http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc File be/src/runtime/datetime-iso-sql-format-tokenizer.cc: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@87 PS2, Line 87: used_tokens_.clear(); Any reason 'used_tokens_' is cleared here and not in Reset()? http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@93 PS2, Line 93: current_pos In some functions this is called 'current' while in others it is called 'current_pos'. Please use the same name everywhere. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@94 PS2, Line 94: current_pos != str_end current_pos < str_end http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@96 PS2, Line 96: return SUCCESS This should be break, otherwise the check in L101 won't be executed. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@104 PS2, Line 104: void DateTimeIsoSqlFormatTokenizer::ProcessSeparators(const char** current) { DCHECK(*current != nullptr); http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@107 PS2, Line 107: IsSeparator(*current) && *current <= str_end I think the condition should be: (*current < str_end && IsSeparator(*current)) to avoid buffer over-read. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@115 PS2, Line 115: const char** current_pos) { DCHECK(*current_pos != nullptr); http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@120 PS2, Line 120: string curr_token(*current_pos, curr_token_size); : boost::to_upper(curr_token); This should be done before the loop starts. No need to convert every prefix-string to uppercase. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@204 PS2, Line 204: I think, you should also check that no more than one fractional tokens were specified. It would be better to do this check more generally: loop through dt_ctx_->toks and check that there are no duplicate DateTimeFormatTokenType values in the TokenItem items. You can also do this check in ProcessNextToken() similarly to L124. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@218 PS2, Line 218: const char* c We can pass the char directly. No need for a pointer. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.h File be/src/runtime/datetime-parser-common.h: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.h@128 PS2, Line 128: results longer output length than the format string. nit: produces output that is longer than the format string. http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.h@129 PS2, Line 129: then nit: from http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.h@157 PS2, Line 157: nit: 1 space http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc File be/src/runtime/datetime-parser-common.cc: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@46 PS2, Line 46: } Add DCHECK(!century_break_time.is_special()) http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@61 PS2, Line 61: std::stringstream ss; DCHECK(context != nullptr); http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@120 PS2, Line 120: StringParser::ParseResult status; Here and in the functions below add: DCHECK(token_group != nullptr); DCHECK(token_len > 0); DCHECK(result != nullptr); http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@120 PS2, Line 120: StringParser::ParseResult status; : result->year = StringParser::StringToInt<int>(token_group, token_len, &status); : if (UNLIKELY(StringParser::PARSE_SUCCESS != status)) return false; : if (UNLIKELY(result->year < 0 || result->year > 9999)) return false; : return true; All the Construct.. functions are very similar, only the valid range differs. You could create a helper function to avoid code duplication: bool ParseAndValidate(const char* s, int len, int min, int max, int* res) {..} http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@181 PS2, Line 181: for (int i = token_len; i < 9; ++i) result->fraction *= 10; Can you use std::pow() to simplify this? http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@185 PS2, Line 185: int GetDayInYear(int year, int month, int day_in_month) { : int day_in_year = day_in_month; : for (int i = 0; i < month - 1; ++i) { : int month_length = MONTH_LENGTHS[i]; : if (boost::gregorian::gregorian_calendar::is_leap_year(year) && i == 1) { : month_length = 29; : } : day_in_year += month_length; : } : return day_in_year; : } This function can be simplified if you use hard-coded month ranges. (See MONTH_RANGES and LEAP_YEAR_MONTH_RANGES in be/src/runtime/date-value.cc) http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-simple-date-format-parser.cc File be/src/runtime/datetime-simple-date-format-parser.cc: http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-simple-date-format-parser.cc@372 PS2, Line 372: GetDefaultFormatContext Rename to GetDefaultSimpleDateFormatContext() ? http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-simple-date-format-parser.cc@488 PS2, Line 488: case MONTH_IN_YEAR_SLT: { : char raw_buff[tok.len]; : std::transform(tok_val, tok_val + tok.len, raw_buff, ::tolower); : StringValue buff(raw_buff, tok.len); : boost::unordered_map<StringValue, int>::const_iterator iter = : REV_MONTH_INDEX.find(buff); : if (UNLIKELY(iter == REV_MONTH_INDEX.end())) return false; : dt_result->month = iter->second; : break; This should go to a separate function as well. -- To view, visit http://gerrit.cloudera.org:8080/13722 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I19d8d097a45ae6f103b6cd1b2d81aad38dfd9e23 Gerrit-Change-Number: 13722 Gerrit-PatchSet: 2 Gerrit-Owner: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Comment-Date: Mon, 01 Jul 2019 17:28:02 +0000 Gerrit-HasComments: Yes