Alex Behm has posted comments on this change. Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet ......................................................................
Patch Set 4: (14 comments) Still need to look at tests. Code is getting close. http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: Line 304: // Time zone for adjusting timestamp values read from Parquet files written by single line http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 246: parent_->state_->LogError(ErrorMsg(TErrorCode::PARQUET_MR_WRITE_ZONE_INVALID, I don't think logging an error is the right thing to do here. We will scan all files and just keep appending to the error log, but no useful results can come of a query in this state. This error is not recoverable, so I think we should terminate this scan and the query. Consider moving this code into Reset() or better add a new Open() or Init() function that is called once per column reader from HdfsParquetScanner::CreateColumnReaders(). That way you can return a Status early and not scan any data. Line 653: } else { As mentioned above, we should terminate the query early when in such a bad state. Imo, there should be a DCHECK here. http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: Line 194: for (const auto& tz_region: tz_region_list_) { No need to use auto here. const string& is clearer http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: Line 40: /// If 'tz' is not found in the database, null pointer is returned. null pointer -> nullptr Line 46: /// If 'tz' is not found in the database, null pointer is returned. null pointer -> nullptr http://gerrit.cloudera.org:8080/#/c/5939/3/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: Line 125: return true; > What do you mean? The documentation doesn't say anything about an exception Agree, I could not find any indication in the code either. http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 199: /// conversion was successfull and false otherwise. Comment on what the state of *this is when false is returned. Line 202: /// Converts from UTC to local time in the given timezone. Returns true if conversion Comment on what the state of *this is when false is returned. http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/service/fe-support.cc File be/src/service/fe-support.cc: Line 357: const char *tz = env->GetStringUTFChars(timezone, NULL); const char* http://gerrit.cloudera.org:8080/#/c/5939/4/be/src/service/impala-server.cc File be/src/service/impala-server.cc: Line 122: "the parquet.mr.int96.write.zone table property to UTC for new tables. This changes " consistent placing of space (at line end or at line beginning) http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java File fe/src/main/java/org/apache/impala/analysis/BaseTableRef.java: Line 67: } > I was copying the code that validates the skip.header.line.count table prop My apologies, but I was wrong on this one. I think it's unfortunate, but better to check this for every query. The alternative is to catch the issue during table loading, but Impala has no way to fix such tables today, because you can only ALTER tables that are fully loaded. That's an independent issue, but for now, it does seem better to check every time like for skip.header.linecount http://gerrit.cloudera.org:8080/#/c/5939/3/fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java File fe/src/main/java/org/apache/impala/analysis/CreateTableStmt.java: Line 184: String timezone = getTblProperties().get(HdfsTable.TBL_PROP_PARQUET_MR_WRITE_ZONE); > Correct. I need some help here. How can I check if the table that will be c You can check getFileFormat() here. The only non-HDFS formats we have are HBase and Kudu. http://gerrit.cloudera.org:8080/#/c/5939/4/fe/src/main/java/org/apache/impala/service/FeSupport.java File fe/src/main/java/org/apache/impala/service/FeSupport.java: Line 87: // Returns true if timezone String is valid, false otherwise. valid according to the BE timezone database -- To view, visit http://gerrit.cloudera.org:8080/5939 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila Jeges <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Attila Jeges <[email protected]> Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Taras Bobrovytsky <[email protected]> Gerrit-Reviewer: Zoltan Ivanfi <[email protected]> Gerrit-HasComments: Yes
