Alex Behm has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet ......................................................................
Patch Set 3: (39 comments) I've gone through the code. Still need to go through the tests. http://gerrit.cloudera.org:8080/#/c/5939/3//COMMIT_MSG Commit Message: Line 14: values from Parquet files written by Hive, and vice versa. I don't think the "vice versa" part is correct. My understanding is that Hive does not adjust data written by Impala, so it reads the data correctly. Line 17: Impala reads Parquet MR timestamp data and adjust values using a time typo: adjusts Line 20: applied to data written by Impala. Mention interaction with --convert_legacy_hive_parquet_utc_timestamps Line 23: global flag --prevent_parquet_mr_zone_adjustment is set to true. Not sure if the name of this flag was already decided upon, but imo, it's not very clear what it does from the description. As an alternative, I'd propose something like: --set_parquet_mr_int96_write_zone It's a complex issue, so I'd prefer to be more explicit about what the flag does as opposed to which behavior the flag is supposed to induce across components. http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 242: timezone_ = TimezoneDatabase::FindTimezone( What if timezone_ is NULL here? Line 244: } catch (std::exception& from_boost) { Does FindTimezone() really throw? We're not handling that in other places. Line 249: } DCHECK that is_timestamp_dependent_timezone_ is true xor timezone_ is set Line 554: boost::local_time::time_zone_ptr timezone_; initialize these new members in the constructor Line 608: DCHECK(parent_->scan_node_->parquet_mr_time_zone() != "UTC"); DCHECK_NE Line 611: Status status; Avoid creating/destroying this status even if no error happens (a possible perf issue). I'd suggest you move the parent_->parse_status_ = status; directly into the error branches and return false from them directly. Line 628: } else if (LIKELY(timezone_ != NULL)) { swap this case with the else if above Line 636: } else { DCHECK(false) << add error msg here } http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: Line 194: for (vector<string>::const_iterator iter = tz_region_list_.begin(); please convert to for-each while you are here http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: Line 37: /// Converts the name of a timezone to a boost timezone object. Not your change, but comment on what happens when the timezone was not found in the database. http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: Line 111: time_zone_ptr timezone = TimezoneDatabase::FindTimezone(timezone_str, *this); Can timezone be NULL? Line 113: } catch (std::exception& from_boost) { Which function can throw? Line 125: local_date_time lt(temp, timezone); I have a feeling boost might throw here if we give it funny input. http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 199: bool UtcToLocal(); comment on return values (and functions below) http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/fe-support.cc File be/src/service/fe-support.cc: Line 354: JNIEXPORT jbyteArray JNICALL Why even pass and return thrift structs? You can just take a string and return a boolean. Line 365: ABORT_IF_ERROR(TimezoneDatabase::Initialize()); Are you sure we need this? I see catalogd-main.cc calling InitCommonRuntime() which initializes the TZ DB. http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/impala-server.cc File be/src/service/impala-server.cc: Line 123: "Impala, Hive, and Spark to not apply timestamp timezone adjustments for parquet " Mention that this property also makes Hive write in UTC. http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/util/backend-gflag-util.cc File be/src/util/backend-gflag-util.cc: Line 55: cfg.__set_assign_utc_to_parquet_mr_timezone_prop( let's keep the name consistent across various uses http://gerrit.cloudera.org:8080/#/c/5939/3/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: Line 215: // Specifies a time zone for adjusting timestamp values read from Parquet MR files. This ... from Parquet files written by parquet-mr. Line 216: // is used for a Hive compatibilty fix. Mention that this comes from the table property http://gerrit.cloudera.org:8080/#/c/5939/3/common/thrift/generate_error_codes.py File common/thrift/generate_error_codes.py: Line 316: ("PARQUET_MR_TIMEZONE_INVALID", 102, "Invalid timezone '$0' specified in " Why even mention the file name? Seems better to mention the table name. Line 320: "'$1' for parquet file '$2'."), Parquet (capital P) http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java File fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java: Line 166: public static void analyzeParquetMRTimeZone(Map<String, String> tblProperties) analyzeParquetMrTimeZone() http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java File fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java: Line 67: analyzeParquetMRTimeZone(); Seems like the wrong place to validate this. Why would we validate it for every query? http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java: Line 184: if (getTblProperties().containsKey(HdfsTable.TBL_PROP_PARQUET_MR_TIME_ZONE)) { This should be an error for non-HDFS tables. Line 189: getTblProperties().put(HdfsTable.TBL_PROP_PARQUET_MR_TIME_ZONE, "UTC"); We should only do this for HDFS tables right? http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: Line 121: public static final String TBL_PROP_PARQUET_MR_TIME_ZONE = TBL_PROP_PARQUET_MR_WRITE_ZONE let's keep the names as similar as possible, otherwise it gets confusing pretty quickly Line 1381: private boolean hasParquetMRTimeZone() { hasParquetMrTimeZone() (and change MR to Mr elsewhere) Line 1382: String key = TBL_PROP_PARQUET_MR_TIME_ZONE; inline into the containsKey() call in L1385 Line 1400: * The caller must ensure that the property is contained in the 'tblProperties' map. If Why even pass the map then? Seems cleaner to pass the TZ value Line 1406: String key = TBL_PROP_PARQUET_MR_TIME_ZONE; inline Line 1409: boolean is_timezone_valid; Java style Line 1413: is_timezone_valid = false; log the exception stack - this is definitely unexpected behavior, and we should not just swallow it http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 184: parquetMRTimeZone_ = tbl_.validateParquetMRTimeZone(error); let's not validate this for every scan node / query http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/service/BackendConfig.java File fe/src/main/java/org/apache/impala/service/BackendConfig.java: Line 57: public boolean isAssignUtcToParquetMRTimeZoneProp() { keep names as consistent as possible (i.e. ideally this should reflect the name of the FLAGS in the BE) -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila Jeges <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Attila Jeges <[email protected]> Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky <[email protected]> Gerrit-Reviewer: Zoltan Ivanfi <[email protected]> Gerrit-HasComments: Yes
