Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
......................................................................


Patch Set 1:

(25 comments)

http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG@34
PS1, Line 34: statup
typo: startup


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc
File be/src/benchmarks/convert-timestamp-benchmark.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@134
PS1, Line 134: val
I don't know what RAND_MAX is here, but I think that it can be 32K, which would 
mean that the time part would move very slowly from the starting time. Is this 
by design?

In order to have "really random" time, I would use cpp11 random classes, or put 
together the time from several rand() calls, each with a modulo smaller than 
32k.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@166
PS1, Line 166:  d
I am a bit concerned about writing to the same buffer from every thread - maybe 
it does not hurt performance, but it is not how these functions are "normally" 
used.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@709
PS1, Line 709:     m1 = 
measure_multithreaded_elapsed_time(glibc_test_utc_to_unix, 
num_of_threads,BATCH_SIZE,
nit: long line


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@711
PS1, Line 711:     m2 = 
measure_multithreaded_elapsed_time(cctz_test_utc_to_unix, num_of_threads, 
BATCH_SIZE,
nit: long line


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@725
PS1, Line 725:     m1 = measure_multithreaded_elapsed_time(boost_test_from_utc, 
num_of_threads, BATCH_SIZE,
nit: long line


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@727
PS1, Line 727:     m2 = measure_multithreaded_elapsed_time(cctz_test_from_utc, 
num_of_threads, BATCH_SIZE,
nit: long line


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@743
PS1, Line 743:     m2 = 
measure_multithreaded_elapsed_time(cctz_test_utc_to_local, num_of_threads, 
BATCH_SIZE,
nit: long line


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions-ir.cc
File be/src/exprs/timestamp-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions-ir.cc@523
PS1, Line 523: /
nit: missing spaces


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@63
PS1, Line 63:   if (timezone == nullptr) {
This could be UNLIKELY.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@90
PS1, Line 90: t.fractional_seconds()
You have explained in person that 't' is used instead of 'to_cs'for 
sub-seconds, because cctz't time type does not support nano seconds, and no 
timezone rule affects sub-seconds. Could you add a comment about this?


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@105
PS1, Line 105:   if (timezone == nullptr) {
This could be UNLIKELY.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@142
PS1, Line 142: t.fractional_seconds()
Same as in line 90.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@28
PS1, Line 28: /// Functions to load and access the time-zone database.
Please add some comments about thread-safety (e.g. "Initialize() should be 
called at startup to load every timezone rule to memory. After it returned 
without error, other functions can be safely called from multiple threads.").


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc
File be/src/exprs/timezone_db.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@160
PS1, Line 160: bool IsSymbolicLink(const string& path, string* real_path) {
Maybe this could be moved to class FileSystemUtil.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@412
PS1, Line 412:   char buffer[64*1024];
I am a bit concerned about this - is it ok to keep buffers of this size on 
stack in Impala?


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@438
PS1, Line 438: // Load custom time-zone abbreviations from 'is' and add them to 
'tz_name_map_'.
In most of Impala, the comments for private functions are in the .h file and 
start with "///". Can you move them to the header for the sake of consistency?


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@439
PS1, Line 439: void TimezoneDatabase::LoadZoneAbbreviations(istream &is,
Are you sure that a corrupt abbreviation file is not an error? Maybe duplicates 
can be tolerated, but other issues are errors in my opinion.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@440
PS1, Line 440: const char *path /* = nullptr */
I did not find any caller that fills this argument. Please check if it can be 
removed.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@460
PS1, Line 460: Skippng
typo: Skippng is used consistently instead of Skipping - is this intentional, 
or type + copy paste?


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@484
PS1, Line 484:       if (tz_name_map_.find(abbrev) != tz_name_map_.end()) {
This could be checked before processing value and merged with the abbreviation 
duplicate checking in the non fixed offset branch.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/runtime/timestamp-value.cc@172
PS1, Line 172:   auto from_tp = FromUnixSeconds(unix_time);
             :   auto to_cs = cctz::convert(from_tp, 
TimezoneDatabase::GetUtcTimezone());
             :   // boost::gregorian::date() throws boost::gregorian::bad_year 
if year is not in the
             :   // 1400..9999 range. Need to check validity before creating 
the date object.
             :   if (UNLIKELY(CheckIfDateOutOfRange(cctz::civil_day(to_cs)))) {
             :     return ptime(not_a_date_time);
             :   } else {
             :     return ptime(
             :         boost::gregorian::date(to_cs.year(), to_cs.month(), 
to_cs.day()),
             :         boost::posix_time::time_duration(to_cs.hour(), 
to_cs.minute(), to_cs.second()));
This could be replaced by calling 
TimestampValue::UnixTimeToLocalPtime(unix_time, 
&TimezoneDatabase::GetUtcTimezone()).


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/runtime/timestamp-value.inline.h@55
PS1, Line 55: inline bool TimestampValue::UtcToUnixTime(time_t* unix_time) 
const {
Two opposing ideas:
a: This and ToUnixTime could call a new function that would contain the common 
cctz parts.
b: I think that this could be done much faster by calculating the days since 
1970 and (date_ minus a constant), multiplying it with 24*60*60 and adding 
time_/10^9. Something similar happens in the fast path of 
TimestampValue::UnixTimeToUtcPtime().

boost::gregorian::date stores the days since specific date, which is practical 
mainly because it can be converted to seconds without dealing with "calendar 
stuff" like leap years, while date_.year()/month()/day() has to do some more 
complex calculations.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/util/time-test.cc
File be/src/util/time-test.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/util/time-test.cc@60
PS1, Line 60:   EXPECT_EQ("1677-09-21 00:12:43.146",
            :       ToUtcStringFromUnixMillis(INT64_MIN / NANOS_PER_MICRO / 
MICROS_PER_MILLI));
            :   EXPECT_EQ("1677-09-21 00:12:43.145225",
            :       ToUtcStringFromUnixMicros(INT64_MIN / NANOS_PER_MICRO));
Why have these times changed? Are these results "more correct" than the old 
ones?


http://gerrit.cloudera.org:8080/#/c/9986/1/fe/src/test/java/org/apache/impala/testutil/TestUtils.java
File fe/src/test/java/org/apache/impala/testutil/TestUtils.java:

http://gerrit.cloudera.org:8080/#/c/9986/1/fe/src/test/java/org/apache/impala/testutil/TestUtils.java@267
PS1, Line 267:     queryCtx.setLocal_time_zone("PST8PDT");
Can you explain to what was changed here? Some tests ran differently depending 
on the local time zone?



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 1
Gerrit-Owner: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Thu, 12 Apr 2018 16:47:51 +0000
Gerrit-HasComments: Yes

Reply via email to