Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
......................................................................


Patch Set 2:

(55 comments)

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc
File be/src/benchmarks/convert-timestamp-benchmark.cc:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@150
PS2, Line 150: void AddTestDataDateTimes(vector<TimestampValue>& data, int n, 
const string& startstr) {
> Since it's in a benchmark it doesn't really matter, anyway, some nit commen
Done


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@172
PS2, Line 172: TestData
> This looks like a fairly general class to me that could move to util/benchm
'measure_multithreaded_elapsed_time' function is not that general, it is used 
here as a quick and dirty way to verify that glibc calls are executed in a 
serial fashion even in a multithreaded environment.  Because of that I would 
like to keep this class here.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@174
PS2, Line 174: > >
> Nit: since C++11 you don't need to put spaces between right angle brackets.
Done


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@228
PS2, Line 228: const shared_ptr<vector<FROM> > data_
> Nit: I think using 'const vector<FROM>&' would be simpler. I don't really s
True, I rewrote this portion of the code so many times that I lost track of 
where I was going with it :)


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@319
PS2, Line 319:     boost_throw_if_date_out_of_range(local_time.date());
> Is this function call really needed here? Don't we trust boost that it vali
This is how the function was implemented originally. Since the point of this 
benchmark program is to compare the old implementation with the new one, I 
figured I shouldn't change the old code.

I think the call is necessary, because boost might not validate the date range 
until you call the gregorian::date accessors. It is confusing.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@444
PS2, Line 444:     time_t utc =
> We could replace this with boost_utc_to_unix_time. This conversion should b
This is how UtcToLocal was implemented originally. The goal of this benchmark 
program is to compare the original implementation with the new one (including 
the glue code).


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@530
PS2, Line 530: //
             : // Test UnixTimeToUtcPtime (boost is expected to be faster than 
CCTZ)
             : //
             :
             : // boost
             : boost::pos
> I think that this a bit misleading, as boost_unix_time_to_utc_ptime never u
Done


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@608
PS2, Line 608:   if (cctz_utc_to_unix_data.get_result() != 
glibc_utc_to_unix_data.get_result()) {
             :     cerr << "cctz/glibc utc_to_unix results do not match!" << 
endl;
             :     return 1;
             :   }
             :   if (boost_utc_to_unix_data.get_result() != 
glibc_utc_to_unix_data.get_result()) {
             :     cerr << "boost/glibc utc_to_unix results do not match!" << 
endl;
             :     return 1;
             :   }
> The other benchmarks don't need this validity check?
Done (although passing a vector of TestData to the helper function was not 
feasible as the different TestData classes are instantiated with a different 
converter functions, therefore they are not "compatible")


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exec/data-source-scan-node.h
File be/src/exec/data-source-scan-node.h:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exec/data-source-scan-node.h@100
PS1, Line 100:   Status MaterializeNextRow(RuntimeState* state, MemPool* 
mem_pool, Tuple* tuple);
> What do you think about passing cctz::time_zone* instead of RuntimeState*?
Done


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/decimal-operators.h
File be/src/exprs/decimal-operators.h:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/decimal-operators.h@168
PS2, Line 168:   /// local time in 'local_tz' time-zone). Rounds instead of 
truncating if 'round' is true.
> nit: long line
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/expr-test.cc@6398
PS1, Line 6398: const char* local_tz_name = "PST8PDT";
              :     ScopedTimeZoneOverride time_zone(local_tz_name);
              :     const cctz::time_zone* local_tz = 
TimezoneDatabase::FindTimezone(local_tz_name);
              :     DCHECK(local_tz != nullptr);
> Have you considered moving this to a function or macro or such? As I see yo
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@63
PS1, Line 63:   if (UNLIKELY(timezone == nullptr)) {
> This could be UNLIKELY.
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@73
PS1, Line 73: from_cs
> Might be just me but I don't really find the names from_cs, to_cs and from_
Simplified the logic here a bit, so now we only have 'from_tp' and 'to_cs' . I 
also specified the concrete types instead of just using 'auto'. Hope it makes 
the code easier to understand.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@76
PS1, Line 76: auto from_tp = cctz::convert(from_cs, 
TimezoneDatabase::GetUtcTimezone());
            :   auto to_cs = cctz::convert(from_tp, *timezone);
> I think it would worth writing a comment why the two cctz::convert() calls
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@79
PS1, Line 79:   // Check if resulting timestamp is within range
> In my opinion this comment doesn't add extra value as the name of the funct
Removed it.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@113
PS1, Line 113:   context->AddWarning(ss.str().c_str());
             :     return ts_val;
             :   }
             :
> this seems duplicate code (TimestampFunctions::FromUtc)
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@140
PS1, Line 140:   context->AddWarning(msg.c_str());
             :     return TimestampVal::null();
             :   }
             :
             :   // Create 'return_date' and 'return_time' from 'to_cs'.
> I think this conversion could go to a function as it seems duplicate for me
Done


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timestamp-functions.cc@71
PS2, Line 71:   const boost::gregorian::date& d = ts_value.date();
            :   const boost::posix_time::time_duration& t = ts_value.time();
            :   const cctz::civil_second from_cs(d.year(), d.month(), d.day(), 
t.hours(), t.minutes(),
            :       t.seconds());
            :
            :   auto from_tp = cctz::convert(from_cs, 
TimezoneDatabase::GetUtcTimezone());
> Does cctz offer a function to create timepoint from unix time_t? If yes, th
Done


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timestamp-functions.cc@122
PS2, Line 122: or
> Nit: of
Done


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.h@21
PS2, Line 21: boost
> I think we should use the unordered_map class from the C++ STL.
Impala uses both implementations. I've switched to using std::unordered_map 
here.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.h@31
PS2, Line 31: class TimezoneDatabase {
> My general feeling about this class is that a bit more transparency would a
I've added more details.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.h@56
PS2, Line 56: TZ_MAP
> TimezoneMap? I don't think a typename should be all capitals.
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@21
PS1, Line 21: #include <boost/unordered_map.hpp>
> shouldn't we use the one from std?
Both implementations are used in Impala. Switched to using std::unordered_map 
here.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@35
PS1, Line 35:
> string& GetPath() ?
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@47
PS1, Line 47:   }
> FindTimezone() returns pointer while GetUtcTimezone() returns reference to
GetUtcTimezone() is guaranteed to succeed, whereas FindTimezone() doesn't and 
it has to signal failure somehow. It could return bool instead, but I don't 
think that would help matters as the time_zone out parameter would still have 
to be a pointer. Maybe we can name it 'FindTimezonePtr' instead?


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@54
PS1, Line 54:   static const cctz::time_zone UTC_TIMEZONE_;
> Could you add a comment what the string param is used for?
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@62
PS1, Line 62:   /// location.
> As I see you wrote comments for these function in the .cc file. Could you m
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc
File be/src/exprs/timezone_db.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@10
PS1, Line 10: //
> Unrelated to your change but shouldn't we replace this to the Apache header
What do you mean? Checked some random files under exprs directory and they all 
appear to use the same header.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@103
PS1, Line 103: // Returns 'true' if path 'a' starts with path 'b'. If 
'relative' is not nullptr, it will
> FileSystemUtil would be a better place for this, I think.
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@105
PS1, Line 105: PathStartsWith
> I have the feeling that this function should only decide if path 'b' is the
Implemented FileSystemUtil::IsPrefixPath() and 
FileSystemUtil::GetRelativePath() functions instead to make the functionality 
and usage straightforward.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@105
PS1, Line 105: string *relative
> For me the name of this variable doesn't indicate it's purpose. Can you nam
Moved the functionality to FileSystemUtil::GetRelativePath() and fixed the 
comment.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@108
PS1, Line 108: b.length() + 1 < a.length()
> what if a==b. In theory then a starts with b, still this returns false.
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@118
PS1, Line 118: // with an uppercase letter.
> Could you mention examples here that contain the mentioned allowed chars?
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@119
PS1, Line 119: bool IsTimezoneNameSegmentValid(const string& tz_seg) {
> Shouldn't this be part of TimezoneDatabase or some other timezone related h
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@121
PS1, Line 121:       find_if(
> Tricky :)
Changed the function to use regex.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@132
PS1, Line 132: // time-zone name segments delimited by '/'.
> Could you mention one input example in the comment?
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@133
PS1, Line 133: bool IsTimezoneNameValid(const string& tz_name) {
> Shouldn't this be part of TimezoneDatabase or some other timezone related h
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@136
PS1, Line 136:   while (end != string::npos) {
> Wouldn't it be easier to verify this with a regex and get rid of this while
Changed the function to use regex. We still need IsTimezoneNameSegmentValid() 
though.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@147
PS1, Line 147: bool IsTimezoneOffsetValid(const string& tz_offset, int64_t* 
offset_sec) {
> Shouldn't this be part of TimezoneDatabase or some other timezone related h
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@147
PS1, Line 147: bool
> can you return int64_t* and return nullptr in case the offset is not valid?
Returning a pointer just to be able to signal failure with nullptr seems 
contrived. I think this interface is cleaner.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@160
PS1, Line 160: // The implementation here was adapted from
> I again feel here that this function serves 2 purposes instead of a clear o
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@251
PS1, Line 251:   // mkdtemp operates in place, so we need a mutable array.
> Can you move the comment to the header?
Done


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc
File be/src/exprs/timezone_db.cc:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc@43
PS2, Line 43: DEFINE_string(hdfs_zoneinfo_dir, "",
            :     "HDFS/S3A/ADLS path to load IANA time-zone database from.");
            : DEFINE_string(hdfs_zoneabbrev_config, "",
            :     "HDFS/S3A/ADLS path to config file defining non-standard 
time-zone abbreviations.");
> It will be a bit tricky, but these should be tested somehow. Playing with t
Added an e2e test: tests/custom_cluster/custom_tzdb.py


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc@419
PS2, Line 419: Status TimezoneDatabase::LoadZoneAbbreviations(istream &is,
> At least basic testing should be added to check that fix and non-fix abbrev
Done (see timezone_db-test.cc)


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc@420
PS2, Line 420: /* = nullptr */
> drop this comment
Done


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc@495
PS2, Line 495: ZONEINFO_DIR
> nit: ZONE_INFO_DIR?
Done


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.h@101
PS2, Line 101:   static TimestampValue FromUnixTime(time_t unix_time, const 
cctz::time_zone* local_tz) {
> Do you think that mentioning the new param for these functions would add ex
I've fixed the comments here and below.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.cc@135
PS2, Line 135:   auto from_tp = FromUnixSeconds(unix_time);
             :   auto to_cs = cctz::convert(from_tp, *local_tz
> This would be a big change, but I would think about "auto" types - do they
Done


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.inline.h@98
PS2, Line 98:   if (UNLIKELY(!HasDateAndTime())) return false;
            :   cctz::civil_second cs(date_.year(), date_.month(), date_.day(), 
time_.hours(),
            :       time_.minutes(), time_.seconds());
            :   auto tp = cctz::convert(cs,
            :       FLAGS_use_local_tz_for_unix_timestamp_conversions ? 
*local_tz :
            :       TimezoneDatabase::GetUtcTimezone());
> If FLAGS_use_local_tz_for_unix_timestamp_conversions is false, then we coul
Done


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/filesystem-util.h
File be/src/util/filesystem-util.h:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/filesystem-util.h@57
PS2, Line 57:   static Status GetRealPath(
> Shouldn't you mentioned that this should be called on sym links? (or do I m
No, it's not just for symlinks. Fixed the comment about 'real_path'.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/filesystem-util.h@60
PS2, Line 60: Is it is
> nit: if it is
Done


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/hdfs-util.h
File be/src/util/hdfs-util.h:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/hdfs-util.h@59
PS2, Line 59: /// Returns basename of 'path'.
> Could you add an example to the comment?
Done


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/time.h
File be/src/util/time.h:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/time.h@123
PS2, Line 123: /// Converts input microseconds-since-epoch to date-time string 
in 'tz' time zone.
> Could you mention 'p' as well?
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/util/time.h
File be/src/util/time.h:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/util/time.h@30
PS1, Line 30:
> What do you think add about adding a typedef for cctz::timezone? I think th
Done (see be/src/common/global-types.h)


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/time.cc
File be/src/util/time.cc:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/time.cc@167
PS2, Line 167:   const char* fmt = (p == TimePrecision::Millisecond) ? 
fmt_millisec :
> For me this 3 layers of embedded ternary operators isn't that readable. Wha
Done



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 2
Gerrit-Owner: Attila Jeges <[email protected]>
Gerrit-Reviewer: Attila Jeges <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 03 May 2018 17:57:37 +0000
Gerrit-HasComments: Yes

Reply via email to