[Impala-ASF-CR] IMPALA-8703: ISO:SQL:2016 datetime patterns - Milestone 1

Attila Jeges (Code Review) Mon, 01 Jul 2019 10:28:59 -0700

Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13722 )


Change subject: IMPALA-8703: ISO:SQL:2016 datetime patterns - Milestone 1
......................................................................


Patch Set 2:

(24 comments)

Another bunch of comments.

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc
File be/src/runtime/datetime-iso-sql-format-parser.cc:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-parser.cc@184
PS2, Line 184:     DCHECK(result->year >= 1400 && result->year <= 9999);
             :     int month = 1;
             :     for (int i = 0; i < 12; ++i) {
             :       int month_length = MONTH_LENGTHS[i];
             :       if (i == 1 && 
boost::gregorian::gregorian_calendar::is_leap_year(result->year)) {
             :         month_length = 29;
             :       }
             :       if (month_length >= day_in_year) {
             :         result->month = month;
             :         result->day = day_in_year;
             :         break;
             :       }
             :       ++month;
             :       day_in_year -= month_length;
             :     }
             :     DCHECK(result->month <= 12);
             :     DCHECK(result->day <= 31);
> Move this to a separate function.
This can be done more efficiently if you use hard-coded month ranges, like 
MONTH_RANGES and LEAP_YEAR_MONTH_RANGES in be/src/runtime/date-value.cc.

See be/src/runtime/date-value.cc, L193-L208 for an example of how day_in_year 
-> (month, day) conversion is done.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.h
File be/src/runtime/datetime-iso-sql-format-tokenizer.h:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.h@42
PS2, Line 42: int
bool?


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.h@45
PS2, Line 45:     accept_time_toks_ = time_toks;
Shouldn't 'used_tokens_' be cleared when tokenizer is reset?


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc
File be/src/runtime/datetime-iso-sql-format-tokenizer.cc:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@87
PS2, Line 87: used_tokens_.clear();
Any reason 'used_tokens_' is cleared here and not in Reset()?


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@93
PS2, Line 93: current_pos
In some functions this is called 'current' while in others it is called 
'current_pos'. Please use the same name everywhere.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@94
PS2, Line 94: current_pos != str_end
current_pos < str_end


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@96
PS2, Line 96: return SUCCESS
This should be break, otherwise the check in L101 won't be executed.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@104
PS2, Line 104: void DateTimeIsoSqlFormatTokenizer::ProcessSeparators(const 
char** current) {
DCHECK(*current != nullptr);


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@107
PS2, Line 107: IsSeparator(*current) && *current <= str_end
I think the condition should be:

(*current < str_end && IsSeparator(*current))

to avoid buffer over-read.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@115
PS2, Line 115:     const char** current_pos) {
DCHECK(*current_pos != nullptr);


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@120
PS2, Line 120:   string curr_token(*current_pos, curr_token_size);
             :     boost::to_upper(curr_token);
This should be done before the loop starts. No need to convert every 
prefix-string to uppercase.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@204
PS2, Line 204:
I think, you should also check that no more than one fractional tokens were 
specified.

It would be better to do this check more generally: loop through dt_ctx_->toks 
and check that there are no duplicate DateTimeFormatTokenType values in the 
TokenItem items. You can also do this check in ProcessNextToken() similarly to 
L124.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-iso-sql-format-tokenizer.cc@218
PS2, Line 218: const char* c
We can pass the char directly. No need for a pointer.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.h
File be/src/runtime/datetime-parser-common.h:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.h@128
PS2, Line 128: results longer output length than the format string.
nit: produces output that is longer than the format string.


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.h@129
PS2, Line 129: then
nit: from


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.h@157
PS2, Line 157:
nit: 1 space


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc
File be/src/runtime/datetime-parser-common.cc:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@46
PS2, Line 46: }
Add DCHECK(!century_break_time.is_special())


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@61
PS2, Line 61:   std::stringstream ss;
DCHECK(context != nullptr);


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@120
PS2, Line 120:   StringParser::ParseResult status;
Here and in the functions below add:
DCHECK(token_group != nullptr);
DCHECK(token_len > 0);
DCHECK(result != nullptr);


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@120
PS2, Line 120:   StringParser::ParseResult status;
             :   result->year = StringParser::StringToInt<int>(token_group, 
token_len, &status);
             :   if (UNLIKELY(StringParser::PARSE_SUCCESS != status)) return 
false;
             :   if (UNLIKELY(result->year < 0 || result->year > 9999)) return 
false;
             :   return true;
All the Construct.. functions are very similar, only the valid range differs.
You could create a helper function to avoid code duplication:

bool ParseAndValidate(const char* s, int len, int min, int max, int* res) {..}


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@181
PS2, Line 181: for (int i = token_len; i < 9; ++i) result->fraction *= 10;
Can you use std::pow() to simplify this?


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-parser-common.cc@185
PS2, Line 185: int GetDayInYear(int year, int month, int day_in_month) {
             :   int day_in_year = day_in_month;
             :   for (int i = 0; i < month - 1; ++i) {
             :     int month_length = MONTH_LENGTHS[i];
             :     if (boost::gregorian::gregorian_calendar::is_leap_year(year) 
&& i == 1) {
             :       month_length = 29;
             :     }
             :     day_in_year += month_length;
             :   }
             :   return day_in_year;
             : }
This function can be simplified if you use hard-coded month ranges. (See 
MONTH_RANGES and LEAP_YEAR_MONTH_RANGES in be/src/runtime/date-value.cc)


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-simple-date-format-parser.cc
File be/src/runtime/datetime-simple-date-format-parser.cc:

http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-simple-date-format-parser.cc@372
PS2, Line 372: GetDefaultFormatContext
Rename to GetDefaultSimpleDateFormatContext() ?


http://gerrit.cloudera.org:8080/#/c/13722/2/be/src/runtime/datetime-simple-date-format-parser.cc@488
PS2, Line 488:      case MONTH_IN_YEAR_SLT: {
             :         char raw_buff[tok.len];
             :         std::transform(tok_val, tok_val + tok.len, raw_buff, 
::tolower);
             :         StringValue buff(raw_buff, tok.len);
             :         boost::unordered_map<StringValue, int>::const_iterator 
iter =
             :             REV_MONTH_INDEX.find(buff);
             :         if (UNLIKELY(iter == REV_MONTH_INDEX.end())) return 
false;
             :         dt_result->month = iter->second;
             :         break;
This should go to a separate function as well.



--
To view, visit http://gerrit.cloudera.org:8080/13722
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I19d8d097a45ae6f103b6cd1b2d81aad38dfd9e23
Gerrit-Change-Number: 13722
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Comment-Date: Mon, 01 Jul 2019 17:28:02 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8703: ISO:SQL:2016 datetime patterns - Milestone 1

Reply via email to