Dan Hecht has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet ......................................................................
Patch Set 7: (20 comments) http://gerrit.cloudera.org:8080/#/c/5939/7//COMMIT_MSG Commit Message: PS7, Line 22: New tables created by Impala will set the table property to UTC if the : global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is : set to true. I don't follow this given the next two paragraphs. This isn't true for the "LIKE <other table>" case, right? http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: Line 305: // parquet-mr. FE guarantees that this is a valid time zone. from the code, it looks like this is set to the empty string if conversion should not occur? if that's true, please document that case. http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 234: !parent_->scan_node_->parquet_mr_write_zone().empty() given that parquet_mr_write_zone() will take precedence over the flag, I think it would be clearer to reverse the order of these two conditionals. i.e. we don't care what the value of the flag is if the timestamp is set, right? (I realize it makes no difference functionally, but seems more consistent with the intent). PS7, Line 246: DCHECK nit: DCHECK_EQ() PS7, Line 609: parent_->scan_node_->parquet_mr_write_zone().empty() since this code is very performance critical, it would be best to order this in the order we want to optimize for. i.e. first check timezone_, then check is_timestamp_dependent_timezone_ and then only if neither is set, check parquet_mr_write_zone().empty() (which is also probably the most expensive check). i.e. move this condition to the end of the if-else-if chain. And then both of parquet_mr_write_zone().empty() and FLAGS_convert_legacy_hive_parquet_utc_timestamps must be true, right? (or else we wouldn't have needs_conversion_ set). And so you can just DCHECK() them. PS7, Line 611: !dst->UtcToLocal()) these could all be UNLIKELY since we want to optimize for the non-error case. PS7, Line 625: !dst->FromUtc(timezone_) i.e. here too and below. PS7, Line 639: ErrorMsg msg(TErrorCode::PARQUET_MR_TIMESTAMP_CONVERSION_FAILED, : src->DebugString(), parent_->scan_node_->parquet_mr_write_zone(), : parent_->scan_node_->hdfs_table()->fully_qualified_name()); : Status status = parent_->state_->LogOrReturnError(msg); : if (!status.ok()) { : parent_->parse_status_ = status; : return false; : } seems worth factoring this block (which occurs 3 times) into a non-inlined subroutine, which will help avoid I-cache pollution (and also make the code more readable). http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: Line 193: could you add a comment explaining these two lookup techniques? http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: PS7, Line 45: for the Moscow timezone how about drawing the connection with the next function directly: (May not work correctly when IsTimestampDependentTimezone(tz), e.g. Moscow timezone). Line 49: static bool IsTimestampDependentTimezone(const std::string& tz); comment this. http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: Line 101: LOG(ERROR) << from_boost.what(); this might produce a ton of logging. instead, why not return the Status object from these, and include the what() in the details? then the error message deduping code in LogError() will take care of throttling it. Line 109: DCHECK(HasDateAndTime()); how do we guarantee that this is true? did the caller already check that? PS7, Line 111: timezone == NULL UNLIKELY (since perf critical) Line 123: local_date_time lt(temp, timezone); see the comment at line 77. i think we were avoiding this for perf reasons. did you measure the perf of this compared to UtcToLocal()? http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: PS7, Line 197: local time below you say "local time in the given timezone", which I guess is technically correct but then makes this statement confusing (becuase here the comment meant "local timezone"). So, how about either removing the words "local time in the" below, or change this comment to say "local time in the system timezone" PS7, Line 203: local time in the given timezone it's not clear that "convert" means that *this is modified in-place, so I think you should add that back in like the UtcToLocal() comment has. http://gerrit.cloudera.org:8080/#/c/5939/7/be/src/service/impala-server.cc File be/src/service/impala-server.cc: PS7, Line 122: new is that always the case? PS7, Line 122: This changes " : "the behavior of recent versions of Hive and Spark as well. The writing of " : "timestamps is affected in Hive and Spark but not in Impala. The reading of " : "timestamps that were written by Hive, Spark, or any other parquet-mr writer is " : "affected in Hive, Spark and Impala. I'm not sure that this makes sense on it's own. I think you kind of have to read the documentation no matter what, so I don't think we should include this explanation here. http://gerrit.cloudera.org:8080/#/c/5939/7/common/thrift/BackendGflags.thrift File common/thrift/BackendGflags.thrift: PS7, Line 58: all is that true? what about "LIKE table" case? -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila Jeges <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Attila Jeges <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky <[email protected]> Gerrit-Reviewer: Zoltan Ivanfi <[email protected]> Gerrit-HasComments: Yes
