Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 )
Change subject: IMPALA-3307: Add support for IANA time-zone db ...................................................................... Patch Set 2: (55 comments) http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc File be/src/benchmarks/convert-timestamp-benchmark.cc: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@150 PS2, Line 150: void AddTestDataDateTimes(vector<TimestampValue>& data, int n, const string& startstr) { > Since it's in a benchmark it doesn't really matter, anyway, some nit commen Done http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@172 PS2, Line 172: TestData > This looks like a fairly general class to me that could move to util/benchm 'measure_multithreaded_elapsed_time' function is not that general, it is used here as a quick and dirty way to verify that glibc calls are executed in a serial fashion even in a multithreaded environment. Because of that I would like to keep this class here. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@174 PS2, Line 174: > > > Nit: since C++11 you don't need to put spaces between right angle brackets. Done http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@228 PS2, Line 228: const shared_ptr<vector<FROM> > data_ > Nit: I think using 'const vector<FROM>&' would be simpler. I don't really s True, I rewrote this portion of the code so many times that I lost track of where I was going with it :) http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@319 PS2, Line 319: boost_throw_if_date_out_of_range(local_time.date()); > Is this function call really needed here? Don't we trust boost that it vali This is how the function was implemented originally. Since the point of this benchmark program is to compare the old implementation with the new one, I figured I shouldn't change the old code. I think the call is necessary, because boost might not validate the date range until you call the gregorian::date accessors. It is confusing. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@444 PS2, Line 444: time_t utc = > We could replace this with boost_utc_to_unix_time. This conversion should b This is how UtcToLocal was implemented originally. The goal of this benchmark program is to compare the original implementation with the new one (including the glue code). http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@530 PS2, Line 530: // : // Test UnixTimeToUtcPtime (boost is expected to be faster than CCTZ) : // : : // boost : boost::pos > I think that this a bit misleading, as boost_unix_time_to_utc_ptime never u Done http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@608 PS2, Line 608: if (cctz_utc_to_unix_data.get_result() != glibc_utc_to_unix_data.get_result()) { : cerr << "cctz/glibc utc_to_unix results do not match!" << endl; : return 1; : } : if (boost_utc_to_unix_data.get_result() != glibc_utc_to_unix_data.get_result()) { : cerr << "boost/glibc utc_to_unix results do not match!" << endl; : return 1; : } > The other benchmarks don't need this validity check? Done (although passing a vector of TestData to the helper function was not feasible as the different TestData classes are instantiated with a different converter functions, therefore they are not "compatible") http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exec/data-source-scan-node.h File be/src/exec/data-source-scan-node.h: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exec/data-source-scan-node.h@100 PS1, Line 100: Status MaterializeNextRow(RuntimeState* state, MemPool* mem_pool, Tuple* tuple); > What do you think about passing cctz::time_zone* instead of RuntimeState*? Done http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/decimal-operators.h File be/src/exprs/decimal-operators.h: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/decimal-operators.h@168 PS2, Line 168: /// local time in 'local_tz' time-zone). Rounds instead of truncating if 'round' is true. > nit: long line Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/expr-test.cc@6398 PS1, Line 6398: const char* local_tz_name = "PST8PDT"; : ScopedTimeZoneOverride time_zone(local_tz_name); : const cctz::time_zone* local_tz = TimezoneDatabase::FindTimezone(local_tz_name); : DCHECK(local_tz != nullptr); > Have you considered moving this to a function or macro or such? As I see yo Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@63 PS1, Line 63: if (UNLIKELY(timezone == nullptr)) { > This could be UNLIKELY. Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@73 PS1, Line 73: from_cs > Might be just me but I don't really find the names from_cs, to_cs and from_ Simplified the logic here a bit, so now we only have 'from_tp' and 'to_cs' . I also specified the concrete types instead of just using 'auto'. Hope it makes the code easier to understand. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@76 PS1, Line 76: auto from_tp = cctz::convert(from_cs, TimezoneDatabase::GetUtcTimezone()); : auto to_cs = cctz::convert(from_tp, *timezone); > I think it would worth writing a comment why the two cctz::convert() calls Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@79 PS1, Line 79: // Check if resulting timestamp is within range > In my opinion this comment doesn't add extra value as the name of the funct Removed it. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@113 PS1, Line 113: context->AddWarning(ss.str().c_str()); : return ts_val; : } : > this seems duplicate code (TimestampFunctions::FromUtc) Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@140 PS1, Line 140: context->AddWarning(msg.c_str()); : return TimestampVal::null(); : } : : // Create 'return_date' and 'return_time' from 'to_cs'. > I think this conversion could go to a function as it seems duplicate for me Done http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timestamp-functions.cc@71 PS2, Line 71: const boost::gregorian::date& d = ts_value.date(); : const boost::posix_time::time_duration& t = ts_value.time(); : const cctz::civil_second from_cs(d.year(), d.month(), d.day(), t.hours(), t.minutes(), : t.seconds()); : : auto from_tp = cctz::convert(from_cs, TimezoneDatabase::GetUtcTimezone()); > Does cctz offer a function to create timepoint from unix time_t? If yes, th Done http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timestamp-functions.cc@122 PS2, Line 122: or > Nit: of Done http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.h@21 PS2, Line 21: boost > I think we should use the unordered_map class from the C++ STL. Impala uses both implementations. I've switched to using std::unordered_map here. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.h@31 PS2, Line 31: class TimezoneDatabase { > My general feeling about this class is that a bit more transparency would a I've added more details. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.h@56 PS2, Line 56: TZ_MAP > TimezoneMap? I don't think a typename should be all capitals. Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@21 PS1, Line 21: #include <boost/unordered_map.hpp> > shouldn't we use the one from std? Both implementations are used in Impala. Switched to using std::unordered_map here. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@35 PS1, Line 35: > string& GetPath() ? Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@47 PS1, Line 47: } > FindTimezone() returns pointer while GetUtcTimezone() returns reference to GetUtcTimezone() is guaranteed to succeed, whereas FindTimezone() doesn't and it has to signal failure somehow. It could return bool instead, but I don't think that would help matters as the time_zone out parameter would still have to be a pointer. Maybe we can name it 'FindTimezonePtr' instead? http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@54 PS1, Line 54: static const cctz::time_zone UTC_TIMEZONE_; > Could you add a comment what the string param is used for? Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@62 PS1, Line 62: /// location. > As I see you wrote comments for these function in the .cc file. Could you m Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc File be/src/exprs/timezone_db.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@10 PS1, Line 10: // > Unrelated to your change but shouldn't we replace this to the Apache header What do you mean? Checked some random files under exprs directory and they all appear to use the same header. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@103 PS1, Line 103: // Returns 'true' if path 'a' starts with path 'b'. If 'relative' is not nullptr, it will > FileSystemUtil would be a better place for this, I think. Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@105 PS1, Line 105: PathStartsWith > I have the feeling that this function should only decide if path 'b' is the Implemented FileSystemUtil::IsPrefixPath() and FileSystemUtil::GetRelativePath() functions instead to make the functionality and usage straightforward. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@105 PS1, Line 105: string *relative > For me the name of this variable doesn't indicate it's purpose. Can you nam Moved the functionality to FileSystemUtil::GetRelativePath() and fixed the comment. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@108 PS1, Line 108: b.length() + 1 < a.length() > what if a==b. In theory then a starts with b, still this returns false. Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@118 PS1, Line 118: // with an uppercase letter. > Could you mention examples here that contain the mentioned allowed chars? Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@119 PS1, Line 119: bool IsTimezoneNameSegmentValid(const string& tz_seg) { > Shouldn't this be part of TimezoneDatabase or some other timezone related h Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@121 PS1, Line 121: find_if( > Tricky :) Changed the function to use regex. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@132 PS1, Line 132: // time-zone name segments delimited by '/'. > Could you mention one input example in the comment? Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@133 PS1, Line 133: bool IsTimezoneNameValid(const string& tz_name) { > Shouldn't this be part of TimezoneDatabase or some other timezone related h Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@136 PS1, Line 136: while (end != string::npos) { > Wouldn't it be easier to verify this with a regex and get rid of this while Changed the function to use regex. We still need IsTimezoneNameSegmentValid() though. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@147 PS1, Line 147: bool IsTimezoneOffsetValid(const string& tz_offset, int64_t* offset_sec) { > Shouldn't this be part of TimezoneDatabase or some other timezone related h Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@147 PS1, Line 147: bool > can you return int64_t* and return nullptr in case the offset is not valid? Returning a pointer just to be able to signal failure with nullptr seems contrived. I think this interface is cleaner. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@160 PS1, Line 160: // The implementation here was adapted from > I again feel here that this function serves 2 purposes instead of a clear o Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@251 PS1, Line 251: // mkdtemp operates in place, so we need a mutable array. > Can you move the comment to the header? Done http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc File be/src/exprs/timezone_db.cc: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc@43 PS2, Line 43: DEFINE_string(hdfs_zoneinfo_dir, "", : "HDFS/S3A/ADLS path to load IANA time-zone database from."); : DEFINE_string(hdfs_zoneabbrev_config, "", : "HDFS/S3A/ADLS path to config file defining non-standard time-zone abbreviations."); > It will be a bit tricky, but these should be tested somehow. Playing with t Added an e2e test: tests/custom_cluster/custom_tzdb.py http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc@419 PS2, Line 419: Status TimezoneDatabase::LoadZoneAbbreviations(istream &is, > At least basic testing should be added to check that fix and non-fix abbrev Done (see timezone_db-test.cc) http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc@420 PS2, Line 420: /* = nullptr */ > drop this comment Done http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc@495 PS2, Line 495: ZONEINFO_DIR > nit: ZONE_INFO_DIR? Done http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.h@101 PS2, Line 101: static TimestampValue FromUnixTime(time_t unix_time, const cctz::time_zone* local_tz) { > Do you think that mentioning the new param for these functions would add ex I've fixed the comments here and below. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.cc@135 PS2, Line 135: auto from_tp = FromUnixSeconds(unix_time); : auto to_cs = cctz::convert(from_tp, *local_tz > This would be a big change, but I would think about "auto" types - do they Done http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.inline.h File be/src/runtime/timestamp-value.inline.h: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.inline.h@98 PS2, Line 98: if (UNLIKELY(!HasDateAndTime())) return false; : cctz::civil_second cs(date_.year(), date_.month(), date_.day(), time_.hours(), : time_.minutes(), time_.seconds()); : auto tp = cctz::convert(cs, : FLAGS_use_local_tz_for_unix_timestamp_conversions ? *local_tz : : TimezoneDatabase::GetUtcTimezone()); > If FLAGS_use_local_tz_for_unix_timestamp_conversions is false, then we coul Done http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/filesystem-util.h File be/src/util/filesystem-util.h: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/filesystem-util.h@57 PS2, Line 57: static Status GetRealPath( > Shouldn't you mentioned that this should be called on sym links? (or do I m No, it's not just for symlinks. Fixed the comment about 'real_path'. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/filesystem-util.h@60 PS2, Line 60: Is it is > nit: if it is Done http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/hdfs-util.h File be/src/util/hdfs-util.h: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/hdfs-util.h@59 PS2, Line 59: /// Returns basename of 'path'. > Could you add an example to the comment? Done http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/time.h File be/src/util/time.h: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/time.h@123 PS2, Line 123: /// Converts input microseconds-since-epoch to date-time string in 'tz' time zone. > Could you mention 'p' as well? Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/util/time.h File be/src/util/time.h: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/util/time.h@30 PS1, Line 30: > What do you think add about adding a typedef for cctz::timezone? I think th Done (see be/src/common/global-types.h) http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/time.cc File be/src/util/time.cc: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/time.cc@167 PS2, Line 167: const char* fmt = (p == TimePrecision::Millisecond) ? fmt_millisec : > For me this 3 layers of embedded ternary operators isn't that readable. Wha Done -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 2 Gerrit-Owner: Attila Jeges <[email protected]> Gerrit-Reviewer: Attila Jeges <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 03 May 2018 17:57:37 +0000 Gerrit-HasComments: Yes
