Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 )
Change subject: IMPALA-3307: Add support for IANA time-zone db ...................................................................... Patch Set 1: (25 comments) http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG@34 PS1, Line 34: statup typo: startup http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc File be/src/benchmarks/convert-timestamp-benchmark.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@134 PS1, Line 134: val I don't know what RAND_MAX is here, but I think that it can be 32K, which would mean that the time part would move very slowly from the starting time. Is this by design? In order to have "really random" time, I would use cpp11 random classes, or put together the time from several rand() calls, each with a modulo smaller than 32k. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@166 PS1, Line 166: d I am a bit concerned about writing to the same buffer from every thread - maybe it does not hurt performance, but it is not how these functions are "normally" used. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@709 PS1, Line 709: m1 = measure_multithreaded_elapsed_time(glibc_test_utc_to_unix, num_of_threads,BATCH_SIZE, nit: long line http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@711 PS1, Line 711: m2 = measure_multithreaded_elapsed_time(cctz_test_utc_to_unix, num_of_threads, BATCH_SIZE, nit: long line http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@725 PS1, Line 725: m1 = measure_multithreaded_elapsed_time(boost_test_from_utc, num_of_threads, BATCH_SIZE, nit: long line http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@727 PS1, Line 727: m2 = measure_multithreaded_elapsed_time(cctz_test_from_utc, num_of_threads, BATCH_SIZE, nit: long line http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@743 PS1, Line 743: m2 = measure_multithreaded_elapsed_time(cctz_test_utc_to_local, num_of_threads, BATCH_SIZE, nit: long line http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions-ir.cc File be/src/exprs/timestamp-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions-ir.cc@523 PS1, Line 523: / nit: missing spaces http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@63 PS1, Line 63: if (timezone == nullptr) { This could be UNLIKELY. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@90 PS1, Line 90: t.fractional_seconds() You have explained in person that 't' is used instead of 'to_cs'for sub-seconds, because cctz't time type does not support nano seconds, and no timezone rule affects sub-seconds. Could you add a comment about this? http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@105 PS1, Line 105: if (timezone == nullptr) { This could be UNLIKELY. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@142 PS1, Line 142: t.fractional_seconds() Same as in line 90. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@28 PS1, Line 28: /// Functions to load and access the time-zone database. Please add some comments about thread-safety (e.g. "Initialize() should be called at startup to load every timezone rule to memory. After it returned without error, other functions can be safely called from multiple threads."). http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc File be/src/exprs/timezone_db.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@160 PS1, Line 160: bool IsSymbolicLink(const string& path, string* real_path) { Maybe this could be moved to class FileSystemUtil. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@412 PS1, Line 412: char buffer[64*1024]; I am a bit concerned about this - is it ok to keep buffers of this size on stack in Impala? http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@438 PS1, Line 438: // Load custom time-zone abbreviations from 'is' and add them to 'tz_name_map_'. In most of Impala, the comments for private functions are in the .h file and start with "///". Can you move them to the header for the sake of consistency? http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@439 PS1, Line 439: void TimezoneDatabase::LoadZoneAbbreviations(istream &is, Are you sure that a corrupt abbreviation file is not an error? Maybe duplicates can be tolerated, but other issues are errors in my opinion. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@440 PS1, Line 440: const char *path /* = nullptr */ I did not find any caller that fills this argument. Please check if it can be removed. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@460 PS1, Line 460: Skippng typo: Skippng is used consistently instead of Skipping - is this intentional, or type + copy paste? http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@484 PS1, Line 484: if (tz_name_map_.find(abbrev) != tz_name_map_.end()) { This could be checked before processing value and merged with the abbreviation duplicate checking in the non fixed offset branch. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/runtime/timestamp-value.cc@172 PS1, Line 172: auto from_tp = FromUnixSeconds(unix_time); : auto to_cs = cctz::convert(from_tp, TimezoneDatabase::GetUtcTimezone()); : // boost::gregorian::date() throws boost::gregorian::bad_year if year is not in the : // 1400..9999 range. Need to check validity before creating the date object. : if (UNLIKELY(CheckIfDateOutOfRange(cctz::civil_day(to_cs)))) { : return ptime(not_a_date_time); : } else { : return ptime( : boost::gregorian::date(to_cs.year(), to_cs.month(), to_cs.day()), : boost::posix_time::time_duration(to_cs.hour(), to_cs.minute(), to_cs.second())); This could be replaced by calling TimestampValue::UnixTimeToLocalPtime(unix_time, &TimezoneDatabase::GetUtcTimezone()). http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/runtime/timestamp-value.inline.h File be/src/runtime/timestamp-value.inline.h: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/runtime/timestamp-value.inline.h@55 PS1, Line 55: inline bool TimestampValue::UtcToUnixTime(time_t* unix_time) const { Two opposing ideas: a: This and ToUnixTime could call a new function that would contain the common cctz parts. b: I think that this could be done much faster by calculating the days since 1970 and (date_ minus a constant), multiplying it with 24*60*60 and adding time_/10^9. Something similar happens in the fast path of TimestampValue::UnixTimeToUtcPtime(). boost::gregorian::date stores the days since specific date, which is practical mainly because it can be converted to seconds without dealing with "calendar stuff" like leap years, while date_.year()/month()/day() has to do some more complex calculations. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/util/time-test.cc File be/src/util/time-test.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/util/time-test.cc@60 PS1, Line 60: EXPECT_EQ("1677-09-21 00:12:43.146", : ToUtcStringFromUnixMillis(INT64_MIN / NANOS_PER_MICRO / MICROS_PER_MILLI)); : EXPECT_EQ("1677-09-21 00:12:43.145225", : ToUtcStringFromUnixMicros(INT64_MIN / NANOS_PER_MICRO)); Why have these times changed? Are these results "more correct" than the old ones? http://gerrit.cloudera.org:8080/#/c/9986/1/fe/src/test/java/org/apache/impala/testutil/TestUtils.java File fe/src/test/java/org/apache/impala/testutil/TestUtils.java: http://gerrit.cloudera.org:8080/#/c/9986/1/fe/src/test/java/org/apache/impala/testutil/TestUtils.java@267 PS1, Line 267: queryCtx.setLocal_time_zone("PST8PDT"); Can you explain to what was changed here? Some tests ran differently depending on the local time zone? -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 1 Gerrit-Owner: Attila Jeges <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 12 Apr 2018 16:47:51 +0000 Gerrit-HasComments: Yes
