Dan Hecht has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
......................................................................


Patch Set 7:

(20 comments)

http://gerrit.cloudera.org:8080/#/c/5939/7//COMMIT_MSG
Commit Message:

PS7, Line 22: New tables created by Impala will set the table property to UTC 
if the
            : global flag 
--set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
            : set to true.
I don't follow this given the next two paragraphs. This isn't true for the 
"LIKE <other table>" case, right?


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

Line 305:   // parquet-mr. FE guarantees that this is a valid time zone.
from the code, it looks like this is set to the empty string if conversion 
should not occur? if that's true, please document that case.


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 234:             !parent_->scan_node_->parquet_mr_write_zone().empty()
given that parquet_mr_write_zone() will take precedence over the flag, I think 
it would be clearer to reverse the order of these two conditionals. i.e. we 
don't care what the value of the flag is if the timestamp is set, right? (I 
realize it makes no difference functionally, but seems more consistent with the 
intent).


PS7, Line 246: DCHECK
nit: DCHECK_EQ()


PS7, Line 609: parent_->scan_node_->parquet_mr_write_zone().empty()
since this code is very performance critical, it would be best to order this in 
the order we want to optimize for. i.e. first check timezone_, then check 
is_timestamp_dependent_timezone_ and then only if neither is set, check 
parquet_mr_write_zone().empty() (which is also probably the most expensive 
check). i.e. move this condition to the end of the if-else-if chain.  

And then both of parquet_mr_write_zone().empty() and 
FLAGS_convert_legacy_hive_parquet_utc_timestamps must be true, right? (or else 
we wouldn't have needs_conversion_ set). And so you can just DCHECK() them.


PS7, Line 611: !dst->UtcToLocal())
these could all be UNLIKELY since we want to optimize for the non-error case.


PS7, Line 625: !dst->FromUtc(timezone_)
i.e. here too and below.


PS7, Line 639:         ErrorMsg 
msg(TErrorCode::PARQUET_MR_TIMESTAMP_CONVERSION_FAILED,
             :             src->DebugString(), 
parent_->scan_node_->parquet_mr_write_zone(),
             :             
parent_->scan_node_->hdfs_table()->fully_qualified_name());
             :         Status status = parent_->state_->LogOrReturnError(msg);
             :         if (!status.ok()) {
             :           parent_->parse_status_ = status;
             :           return false;
             :         }
seems worth factoring this block (which occurs 3 times) into a non-inlined 
subroutine, which will help avoid I-cache pollution (and also make the code 
more readable).


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

Line 193: 
could you add a comment explaining these two lookup techniques?


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

PS7, Line 45: for the Moscow timezone
how about drawing the connection with the next function directly:

(May not work correctly when IsTimestampDependentTimezone(tz), e.g. Moscow 
timezone).


Line 49:   static bool IsTimestampDependentTimezone(const std::string& tz);
comment this.


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

Line 101:     LOG(ERROR) << from_boost.what();
this might produce a ton of logging.  instead, why not return the Status object 
from these, and include the what() in the details? then the error message 
deduping code in LogError() will take care of throttling it.


Line 109:   DCHECK(HasDateAndTime());
how do we guarantee that this is true? did the caller already check that?


PS7, Line 111: timezone == NULL
UNLIKELY (since perf critical)


Line 123:   local_date_time lt(temp, timezone);
see the comment at line 77. i think we were avoiding this for perf reasons. did 
you measure the perf of this compared to UtcToLocal()?


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

PS7, Line 197: local time
below you say "local time in the given timezone", which I guess is technically 
correct but then makes this statement confusing (becuase here the comment meant 
"local timezone").  So, how about either removing the words "local time in the" 
below, or change this comment to say "local time in the system timezone"


PS7, Line 203: local time in the given timezone
it's not clear that "convert" means that *this is modified in-place, so I think 
you should add that back in like the UtcToLocal() comment has.


http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

PS7, Line 122: new
is that always the case?


PS7, Line 122:  This changes "
             :     "the behavior of recent versions of Hive and Spark as well. 
The writing of "
             :     "timestamps is affected in Hive and Spark but not in Impala. 
The reading of "
             :     "timestamps that were written by Hive, Spark, or any other 
parquet-mr writer is "
             :     "affected in Hive, Spark and Impala. 
I'm not sure that this makes sense on it's own. I think you kind of have to 
read the documentation no matter what, so I don't think we should include this 
explanation here.


http://gerrit.cloudera.org:8080/#/c/5939/7/common/thrift/BackendGflags.thrift
File common/thrift/BackendGflags.thrift:

PS7, Line 58: all
is that true?  what about "LIKE table" case?


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Attila Jeges <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky <[email protected]>
Gerrit-Reviewer: Zoltan Ivanfi <[email protected]>
Gerrit-HasComments: Yes

Reply via email to