[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Alex Behm (Code Review) Wed, 01 Mar 2017 18:07:56 -0800

Alex Behm has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
......................................................................



Patch Set 3:

(39 comments)

I've gone through the code. Still need to go through the tests.

http://gerrit.cloudera.org:8080/#/c/5939/3//COMMIT_MSG
Commit Message:

Line 14: values from Parquet files written by Hive, and vice versa.
I don't think the "vice versa" part is correct. My understanding is that Hive 
does not adjust data written by Impala, so it reads the data correctly.


Line 17: Impala reads Parquet MR timestamp data and adjust values using a time
typo: adjusts


Line 20: applied to data written by Impala.
Mention interaction with --convert_legacy_hive_parquet_utc_timestamps


Line 23: global flag --prevent_parquet_mr_zone_adjustment is set to true.
Not sure if the name of this flag was already decided upon, but imo, it's not 
very clear what it does from the description.

As an alternative, I'd propose something like:
--set_parquet_mr_int96_write_zone

It's a complex issue, so I'd prefer to be more explicit about what the flag 
does as opposed to which behavior the flag is supposed to induce across 
components.


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 242:           timezone_ = TimezoneDatabase::FindTimezone(
What if timezone_ is NULL here?


Line 244:         } catch (std::exception& from_boost) {
Does FindTimezone() really throw? We're not handling that in other places.


Line 249:     }
DCHECK that is_timestamp_dependent_timezone_ is true xor timezone_ is set


Line 554:   boost::local_time::time_zone_ptr timezone_;
initialize these new members in the constructor


Line 608:   DCHECK(parent_->scan_node_->parquet_mr_time_zone() != "UTC");
DCHECK_NE


Line 611:     Status status;
Avoid creating/destroying this status even if no error happens (a possible perf 
issue). I'd suggest you move the parent_->parse_status_ = status; directly into 
the error branches and return false from them directly.


Line 628:     } else if (LIKELY(timezone_ != NULL)) {
swap this case with the else if above


Line 636:     }
else {
  DCHECK(false) << add error msg here
}


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

Line 194:   for (vector<string>::const_iterator iter = tz_region_list_.begin();
please convert to for-each while you are here


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

Line 37:   /// Converts the name of a timezone to a boost timezone object.
Not your change, but comment on what happens when the timezone was not found in 
the database.


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

Line 111:     time_zone_ptr timezone = 
TimezoneDatabase::FindTimezone(timezone_str, *this);
Can timezone be NULL?


Line 113:   } catch (std::exception& from_boost) {
Which function can throw?


Line 125:   local_date_time lt(temp, timezone);
I have a feeling boost might throw here if we give it funny input.


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

Line 199:   bool UtcToLocal();
comment on return values (and functions below)


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/fe-support.cc
File be/src/service/fe-support.cc:

Line 354: JNIEXPORT jbyteArray JNICALL
Why even pass and return thrift structs? You can just take a string and return 
a boolean.


Line 365:     ABORT_IF_ERROR(TimezoneDatabase::Initialize());
Are you sure we need this? I see catalogd-main.cc calling InitCommonRuntime() 
which initializes the TZ DB.


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

Line 123:     "Impala, Hive, and Spark to not apply timestamp timezone 
adjustments for parquet "
Mention that this property also makes Hive write in UTC.


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/util/backend-gflag-util.cc
File be/src/util/backend-gflag-util.cc:

Line 55:   cfg.__set_assign_utc_to_parquet_mr_timezone_prop(
let's keep the name consistent across various uses


http://gerrit.cloudera.org:8080/#/c/5939/3/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 215:   // Specifies a time zone for adjusting timestamp values read from 
Parquet MR files. This
... from Parquet files written by parquet-mr.


Line 216:   // is used for a Hive compatibilty fix.
Mention that this comes from the table property


http://gerrit.cloudera.org:8080/#/c/5939/3/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

Line 316:   ("PARQUET_MR_TIMEZONE_INVALID", 102, "Invalid timezone '$0' 
specified in "
Why even mention the file name? Seems better to mention the table name.


Line 320:    "'$1' for parquet file '$2'."),
Parquet (capital P)


http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
File 
fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java:

Line 166:   public static void analyzeParquetMRTimeZone(Map<String, String> 
tblProperties)
analyzeParquetMrTimeZone()


http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
File fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java:

Line 67:     analyzeParquetMRTimeZone();
Seems like the wrong place to validate this. Why would we validate it for every 
query?


http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java:

Line 184:     if 
(getTblProperties().containsKey(HdfsTable.TBL_PROP_PARQUET_MR_TIME_ZONE)) {
This should be an error for non-HDFS tables.


Line 189:       getTblProperties().put(HdfsTable.TBL_PROP_PARQUET_MR_TIME_ZONE, 
"UTC");
We should only do this for HDFS tables right?


http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 121:   public static final String TBL_PROP_PARQUET_MR_TIME_ZONE =
TBL_PROP_PARQUET_MR_WRITE_ZONE

let's keep the names as similar as possible, otherwise it gets confusing pretty 
quickly


Line 1381:   private boolean hasParquetMRTimeZone() {
hasParquetMrTimeZone() (and change MR to Mr elsewhere)


Line 1382:     String key = TBL_PROP_PARQUET_MR_TIME_ZONE;
inline into the containsKey() call in L1385


Line 1400:    * The caller must ensure that the property is contained in the 
'tblProperties' map. If
Why even pass the map then? Seems cleaner to pass the TZ value


Line 1406:     String key = TBL_PROP_PARQUET_MR_TIME_ZONE;
inline


Line 1409:     boolean is_timezone_valid;
Java style


Line 1413:       is_timezone_valid = false;
log the exception stack - this is definitely unexpected behavior, and we should 
not just swallow it


http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 184:     parquetMRTimeZone_ = tbl_.validateParquetMRTimeZone(error);
let's not validate this for every scan node / query


http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/service/BackendConfig.java
File fe/src/main/java/org/apache/impala/service/BackendConfig.java:

Line 57:   public boolean isAssignUtcToParquetMRTimeZoneProp() {
keep names as consistent as possible (i.e. ideally this should reflect the name 
of the FLAGS in the BE)


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Attila Jeges <[email protected]>
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky <[email protected]>
Gerrit-Reviewer: Zoltan Ivanfi <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Reply via email to