Zoltan Ivanfi has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
......................................................................


Patch Set 2:

(11 comments)

Thanks for tackling this issue! I raised a few concerns but in general it looks 
good.

http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 237:         timezone_ = TimezoneDatabase::FindTimezone(
We should first check for TimezoneDatabase::IsTimestampDependentTimezone, 
otherwise there is no point in stroing a timezone_ that we won't use.


Line 240:         timezone_ = NULL;
This exception should not be swallowed. If we can not convert to the desired 
time zone properly, we should show an error message to the user.


Line 543:   /// If not NULL, TIMESTAMP column readers require conversion to 
this time zone.
Although technically correct, this comment is a bit misleading. One may easily 
interpret the if as an iff ("if and only if"). I would suggest to document its 
purpose instead: to cache the return value of TimezoneDatabase::FindTimezone in 
order to avoid repeated calls to it.


Line 593:   if (LIKELY(dst->HasDateAndTime())) {
DCHECK(parent_->scan_node_->parquet_mr_time_zone() != "UTC")


Line 597:       dst->FromUtc(parent_->scan_node_->parquet_mr_time_zone());
This is somewhat confusing, as it seems as if the first two cases did the same 
thing: convert from UTC to the parent_->scan_node_->parquet_mr_time_zone() with 
the only exception that in the second case we use the timezone that we saved 
into the timezone_ field earlier, but it's value came from the same 
parent_->scan_node_->parquet_mr_time_zone() value.

In reality though, the parameter of the first call is a string, while the 
parameter of the second one is a timezone, which leads to different behaviors. 
Since its very hard to notice this difference, please add comments to describe 
what's happening.


http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

Line 189: time_zone_ptr TimezoneDatabase::FindTimezone(const string& tz) {
This should be a private helper function with two callers:

- FindTimezone(const string& tz, const TimestampValue& tv) as you already do it
- FindTimezone(const string& tz) which should first ensure that 
!TimezoneDatabase::IsTimestampDependentTimezone(tz)


http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

Line 109:     // If a table has an invalid timezone, it should not have been 
loaded.
How does this comment apply here? Was it supposed to be before the catch block?


Line 112:     *this = ptime(not_a_date_time);
Again, swallowing this exception can be a problem as end-users may not notice 
that they don't see all data correctly. I understand that there is a check 
higher in the hierarchy, but then the try-catch is unnecessary. If we have it 
regardless, we should handle it according to the desired output: an error 
presented to the user.


http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

Line 201:   /// Converts from UTC to local time in the given timezone.
Please document the difference that one can be timestamp-dependent while the 
other not.


http://gerrit.cloudera.org:8080/#/c/5939/2/be/src/service/fe-support.cc
File be/src/service/fe-support.cc:

Line 352: extern "C"
Please document the purpose of this call (allowing the Java code to check 
whether a timezone is valid in the C code).


http://gerrit.cloudera.org:8080/#/c/5939/2/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
File 
fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java:

Line 182:             "'parquet.mr.int96.write.zone' is only supported for HDFS 
tables."));
Why? Parquet files should be handled this way regardless of using HDFS or not.


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky <[email protected]>
Gerrit-Reviewer: Zoltan Ivanfi <[email protected]>
Gerrit-HasComments: Yes

Reply via email to