Attila Jeges has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in 
Parquet
......................................................................


Patch Set 4:

(40 comments)

http://gerrit.cloudera.org:8080/#/c/5939/3//COMMIT_MSG
Commit Message:

Line 14: values from Parquet files written by Hive.
> I don't think the "vice versa" part is correct. My understanding is that Hi
Removed it.


Line 17: Impala reads Parquet MR timestamp data and adjusts values using a time
> typo: adjusts
Done


Line 20: applied to data written by Impala.
> Mention interaction with --convert_legacy_hive_parquet_utc_timestamps
Done


Line 23: global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
> Not sure if the name of this flag was already decided upon, but imo, it's n
Changed the name of the flag to 
set_parquet_mr_int96_write_zone_to_utc_on_new_tables


Line 23: global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is
> It is also important that it is only set on new tables and that it is set t
Done


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 242:       if (!is_timestamp_dependent_timezone_) {
> What if timezone_ is NULL here?
Added error logging.


Line 244:             parent_->scan_node_->parquet_mr_write_zone());
> Does FindTimezone() really throw? We're not handling that in other places.
You are right, it does not throw an exception. The documentation I found on 
boost::local_time::tz_database::time_zone_from_region() was incorrect: 
http://marc.info/?l=boost-bugs&m=132589759305973


Line 249:         }
> DCHECK that is_timestamp_dependent_timezone_ is true xor timezone_ is set
If parquet_mr_time_zone() returns an invalid timezone string then
(!timestamp_dependent_timezone_ && timezone_ == 0) will be true.

We check for this error condition in ScalarColumnreader::ConvertSlot() and log 
error there as well if it is true.

On the other hand it is true that is_timestamp_dependent_timezone_ and 
timezone_ cannot be set at the same time. Added a DCHECK for that.


Line 554:   /// table property to avoid repeated calls to 
'TimezoneDatabase::FindTimezone()'.
> initialize these new members in the constructor
Done


Line 608:     const TimestampValue* src, TimestampValue* dst, MemPool* pool) {
> DCHECK_NE
Done


Line 611:   DCHECK_NE(parent_->scan_node_->parquet_mr_write_zone(), "UTC");
> Avoid creating/destroying this status even if no error happens (a possible 
Done


Line 628:       // Not a timestamp specific timezone. Convert timestamp to the 
timezone object
> swap this case with the else if above
Done


Line 636:           parent_->parse_status_ = status;
> else {
As explained above, it is possible that is_timestamp_dependent_timezone_ is 
false and timezone_ is NULL. Added logging an error message.


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

Line 194:   for (const auto& tz_region: tz_region_list_) {
> please convert to for-each while you are here
Done


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

Line 37:   /// Converts the name of a timezone to a boost timezone object.
> Not your change, but comment on what happens when the timezone was not foun
Done


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

Line 111:   if (timezone == NULL) {
> Can timezone be NULL?
Removed the try-catch and added code to return false when timezone_ == NULL


Line 113:     return false;
> Which function can throw?
I was under the impression that FimezoneDatabase::FindTimezone() can throw, but 
it can't.
Removed the try-catch and added check for timezone_ == NULL


Line 125:   return true;
> I have a feeling boost might throw here if we give it funny input.
What do you mean? The documentation doesn't say anything about an exception 
thrown from local_date_time(posix_time::ptime, timezone_ptr) constructor.


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

Line 199:   /// conversion was successfull and false otherwise.
> comment on return values (and functions below)
Done


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/fe-support.cc
File be/src/service/fe-support.cc:

Line 354: JNIEXPORT jboolean JNICALL
> Why even pass and return thrift structs? You can just take a string and ret
Done


Line 365: static JNINativeMethod native_methods[] = {
> Are you sure we need this? I see catalogd-main.cc calling InitCommonRuntime
You are correct, I've removed the timezone db initialization from here.


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

Line 123:     "the behavior of recent versions of Hive and Spark as well. The 
writing of timestamps"
> Hive always writes in UTC (that's the problem). It makes Hive write as if i
Done. Also changed the default value of the flag to false to stay 
backwards-compatible with older releases.


http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/util/backend-gflag-util.cc
File be/src/util/backend-gflag-util.cc:

Line 55:   cfg.__set_set_parquet_mr_int96_write_zone_to_utc_on_new_tables(
> let's keep the name consistent across various uses
Done


http://gerrit.cloudera.org:8080/#/c/5939/3/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 215:   // Specifies a time zone for adjusting timestamp values read from 
Parquet files written
> ... from Parquet files written by parquet-mr.
Done


Line 216:   // by parquet-mr. The actual value comes from 
"parquet.mr.int96.write.zone" table
> Mention that this comes from the table property
Done


http://gerrit.cloudera.org:8080/#/c/5939/3/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

Line 316:   ("PARQUET_MR_WRITE_ZONE_INVALID", 102, "Invalid timezone '$0' 
specified in "
> Why even mention the file name? Seems better to mention the table name.
Done


Line 320:    "'$1' for a Parquet file in table '$2'."),
> Parquet (capital P)
Done


http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java
File 
fe/src/main/java/org/apache/impala/analysis/AlterTableSetTblProperties.java:

Line 166:    * null, then the method ensures that 'parquet.mr.int96.write.zone' 
is supported for its
> analyzeParquetMrTimeZone()
Renamed it to 'analyzeParquetMrWriteZone()' to be more consistent with the 
table property name. Also, there's no need to keep this version of the 
overloaded method, so I removed it.


http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java
File fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java:

Line 67:   }
> Seems like the wrong place to validate this. Why would we validate it for e
I was copying the code that validates the skip.header.line.count table 
property. But yes, you're right, the timzone validation shouldn't be done here.


http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java
File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java:

Line 184:       String timezone = 
getTblProperties().get(HdfsTable.TBL_PROP_PARQUET_MR_WRITE_ZONE);
> This should be an error for non-HDFS tables.
Correct. I need some help here. How can I check if the table that will be 
created is HDFS or non-HDFS?


Line 189:       
getTblProperties().put(HdfsTable.TBL_PROP_PARQUET_MR_WRITE_ZONE, "UTC");
> We should only do this for HDFS tables right?
Correct. How can I check if the table that will be created is HDFS or non-HDFS?


http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 121:       "parquet.mr.int96.write.zone";
> TBL_PROP_PARQUET_MR_WRITE_ZONE
Done


Line 1381:   public String getParquetMrWriteZone() {
> hasParquetMrTimeZone() (and change MR to Mr elsewhere)
Done


Line 1382:     org.apache.hadoop.hive.metastore.api.Table msTbl = 
getMetaStoreTable();
> inline into the containsKey() call in L1385
Done. I also got rid off 'hasParquetMRTimeZone()' as it is only called from one 
place.


Line 1400:       // Look for Avro schema in TBLPROPERTIES and in 
SERDEPROPERTIES, with the latter
> Why even pass the map then? Seems cleaner to pass the TZ value
Removed this method from the HdfsTable class. Timezone validation is only 
needed in CreateTableStmt and AlterTableSetTblProperties classes so I moved 
validation there.


Line 1406: 
> inline
Removed validateParquetMRTimeZone() method


Line 1409:       if (avroSchema_ == null) {
> Java style
Removed validateParquetMRTimeZone() method


Line 1413:             msTbl.getSd().getCols(), getFullName());
> log the exception stack - this is definitely unexpected behavior, and we sh
Removed validateParquetMRTimeZone() method


http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 184:   }
> let's not validate this for every scan node / query
Removed the validation from here. The BE validates the table property 
(parquet-column-readers.cc).


http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/service/BackendConfig.java
File fe/src/main/java/org/apache/impala/service/BackendConfig.java:

Line 57:   public boolean isSetParquetMrWriteZoneToUtcOnNewTables() {
> keep names as consistent as possible (i.e. ideally this should reflect the 
Done. Renamed it to isSetParquetMrWriteZoneToUtcOnNewTables().


-- 
To view, visit http://gerrit.cloudera.org:8080/5939
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky <tbobrovyt...@cloudera.com>
Gerrit-Reviewer: Zoltan Ivanfi <zi+ger...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to