Zoltan Ivanfi has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet ......................................................................
Patch Set 2: (11 comments) Thanks for tackling this issue! I raised a few concerns but in general it looks good. http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 237: timezone_ = TimezoneDatabase::FindTimezone( We should first check for TimezoneDatabase::IsTimestampDependentTimezone, otherwise there is no point in stroing a timezone_ that we won't use. Line 240: timezone_ = NULL; This exception should not be swallowed. If we can not convert to the desired time zone properly, we should show an error message to the user. Line 543: /// If not NULL, TIMESTAMP column readers require conversion to this time zone. Although technically correct, this comment is a bit misleading. One may easily interpret the if as an iff ("if and only if"). I would suggest to document its purpose instead: to cache the return value of TimezoneDatabase::FindTimezone in order to avoid repeated calls to it. Line 593: if (LIKELY(dst->HasDateAndTime())) { DCHECK(parent_->scan_node_->parquet_mr_time_zone() != "UTC") Line 597: dst->FromUtc(parent_->scan_node_->parquet_mr_time_zone()); This is somewhat confusing, as it seems as if the first two cases did the same thing: convert from UTC to the parent_->scan_node_->parquet_mr_time_zone() with the only exception that in the second case we use the timezone that we saved into the timezone_ field earlier, but it's value came from the same parent_->scan_node_->parquet_mr_time_zone() value. In reality though, the parameter of the first call is a string, while the parameter of the second one is a timezone, which leads to different behaviors. Since its very hard to notice this difference, please add comments to describe what's happening. http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: Line 189: time_zone_ptr TimezoneDatabase::FindTimezone(const string& tz) { This should be a private helper function with two callers: - FindTimezone(const string& tz, const TimestampValue& tv) as you already do it - FindTimezone(const string& tz) which should first ensure that !TimezoneDatabase::IsTimestampDependentTimezone(tz) http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: Line 109: // If a table has an invalid timezone, it should not have been loaded. How does this comment apply here? Was it supposed to be before the catch block? Line 112: *this = ptime(not_a_date_time); Again, swallowing this exception can be a problem as end-users may not notice that they don't see all data correctly. I understand that there is a check higher in the hierarchy, but then the try-catch is unnecessary. If we have it regardless, we should handle it according to the desired output: an error presented to the user. http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 201: /// Converts from UTC to local time in the given timezone. Please document the difference that one can be timestamp-dependent while the other not. http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/service/fe-support.cc File be/src/service/fe-support.cc: Line 352: extern "C" Please document the purpose of this call (allowing the Java code to check whether a timezone is valid in the C code). http://gerrit.cloudera.org:8080/#/c/5939/2/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java File fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java: Line 182: "'parquet.mr.int96.write.zone' is only supported for HDFS tables.")); Why? Parquet files should be handled this way regardless of using HDFS or not. -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila Jeges <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky <[email protected]> Gerrit-Reviewer: Zoltan Ivanfi <[email protected]> Gerrit-HasComments: Yes
